KM to Extract Text From pdf, .docx, etc and concatenate into 1 file?

Hi, we have multiple folders of files which contain mostly pdf's, docx's, and pptx's, low hundreds of documents per folder. We want to take each folder, and create a big text file that is a mashup of all the text in the documents in the folder. Doesn't have to be cleanly formatted, just looking for all the text to be in one file per folder. I'd ideally I'd like to use a combination of "For Each... (file in directory)" and "Append Text to File" but I'm struggling to automatically get the text out of the pdf's, docx's, and pptx's without opening the documents individually. Anyone have any suggestions here? Thanks in advance.

How far wild are you able to go? If you can use Homebrew to install GNU grep you should be able to use unzip -qc /path/to/word/file.docx word/document.xml | grep -oP '(?<=\<a:t\>).*?(?=\</a:t\>)' to extract text from Word documents and unzip -qc /path/to/powerpoint/file.pptx ppt/slides/slide*.xml | grep -oP '(?<=\<a:t\>).*?(?=\</a:t\>)' (shamelessly stolen from here).

Unfortunately the included macOS grep doesn't support the lookahead used. Unless someone can rewrite the above patterns to work without them?

And if you can use Homebrew you can install pdftotext or similar to do your PDF text extraction.

Otherwise yes, you might be looking at opening docs in apps, one by one...

1 Like

Once the files have been converted, perhaps use the cat command, e.g. for each folder use something like:
cat *.txt > all_text.txt to create a text file with the contents of all files with the extender .txt.