Hi, we have multiple folders of files which contain mostly pdf's, docx's, and pptx's, low hundreds of documents per folder. We want to take each folder, and create a big text file that is a mashup of all the text in the documents in the folder. Doesn't have to be cleanly formatted, just looking for all the text to be in one file per folder. I'd ideally I'd like to use a combination of "For Each... (file in directory)" and "Append Text to File" but I'm struggling to automatically get the text out of the pdf's, docx's, and pptx's without opening the documents individually. Anyone have any suggestions here? Thanks in advance.
How far wild are you able to go? If you can use Homebrew to install GNU grep
you should be able to use unzip -qc /path/to/word/file.docx word/document.xml | grep -oP '(?<=\<a:t\>).*?(?=\</a:t\>)'
to extract text from Word documents and unzip -qc /path/to/powerpoint/file.pptx ppt/slides/slide*.xml | grep -oP '(?<=\<a:t\>).*?(?=\</a:t\>)'
(shamelessly stolen from here).
Unfortunately the included macOS grep
doesn't support the lookahead used. Unless someone can rewrite the above patterns to work without them?
And if you can use Homebrew you can install pdftotext
or similar to do your PDF text extraction.
Otherwise yes, you might be looking at opening docs in apps, one by one...
Once the files have been converted, perhaps use the cat
command, e.g. for each folder use something like:
cat *.txt > all_text.txt
to create a text file with the contents of all files with the extender .txt
.