How to Extract a String from a PDF with Keyboard Maestro

Caroline · February 23, 2023, 6:23pm

Hi there,

I'd like to rename my Documents by using Keyboard Maestro. I already got a big part of it but not all.

I would like to automatically read out a sequence of numbers that always remains the same. This character string XXXXXX/XX/X should also be considered in the first place in the document name. Can you help me to find the right command to read the character string?

I just tried this:

Aktenzeichen auslesen Macro (v10.2)

Aktenzeichen auslesen.kmmacros (55 KB)

I appreciate your help.

Caroline

Automatische Benennung von PDF-Dateien _ Version 2022-01-17_1 copy.kmmacros (123.2 KB)

ccstone · February 23, 2023, 10:06pm

Hey Jüsten,

Welcome to the forum!

It would be helpful if you provided some practical real-world examples of relevant file names.

-Chris

Caroline · February 23, 2023, 10:36pm

Hi Chris,

Sure - there you go:

Image

I want to read the code 112361/22/1 out of the document and want to name the PDF with that string - in different documents there are different numbers but the string is always like XXXXXX/XX/X.

Thanks a lot!

ccstone · February 23, 2023, 10:52pm

Can you copy text out of these PDFs manually?
Do you have any experience with the Terminal and Homebrew?

Caroline · February 23, 2023, 11:01pm

Hi again,

Yes I“be already downloaded different scripts to install tesseract ORCmypdf and Poppler PDF to text. By using these programs I’ve managed to get the date and the topic of the pdf in the name of it - I uploaded these functions up here. These functions are working perfectly. But I can not get the string out of the pdf into the name of the PDF

Thank you!

Caroline

Nige_S · February 24, 2023, 9:30am

What we really need is

A sample of the text you've extracted from the PDF
An example of the file name before you add the string got from the text from step 1
An example of the file name after you add the string

...because we can't run your macro to find those things out without having one of the PDFs and installing a bunch of utilities.

Caroline · February 24, 2023, 11:13am

Hey Nice_S,

thank you for your answer!

I'm gonna send you the PDF.

I want to extract the part: 115500/23/0 which is part of the most documents. If this part is missing, I want the document to be named just with the following script by finding a concrete word. The last part is the Date of the document.

Currently the PDF is named: S220QXEZ6_Q65jhicl.pdf

and with the whole Script I want it to be named like: 115500-23-0-Ermittlungsakte-2023-02-17.pdf

while the word "Ermittlungsakte" got already found in the Document. This part of the script is already working. The Date function is working as well. Missing is the part of the string XXXXXX/XX/X in the document's name.

Thanks a lot for your help!

Caroline · February 24, 2023, 11:18am

I am sorry! There you go with the PDF

S220QXEZ6_Q65jhicl.pdf.zip (39.8 KB)

Nige_S · February 24, 2023, 12:08pm

We don't want the PDF -- we want the text that your OCR steps are extracting from the PDF.

That's important because your OCR routines may behave differently from ours, especially if you've any language-specific training/dictionaries involved.

Try adding an action that puts the variable local Text der Ursprungsdatei (I think that's the right one!) onto the clipboard just before you do your regular expression search, then you can paste it into a TextEdit document or similar and upload that here so people can see it.

thoffman666 · February 24, 2023, 4:18pm

I routinely extract text from PDF's, but I don't involve OCR (and its errors). I use pdftotext (of Xpdf), then I process the resulting txt files with regex. All of this is via KM.

dglancy · February 24, 2023, 4:32pm

Hi Caroline,

Is the string you are trying to get always proceeded with Ihr Zeichen: AZ: ?
If so this regex would find the string you are after and save to a variable.

Ihr Zeichen: AZ: (.+)

my approach was to open the pdf in AcrobatReader then select all then paste to clipboard then search the clipboard for the string

find string.kmmacros (3.0 KB)

Caroline · February 24, 2023, 5:08pm

Hi guys,

thanks a lot. I just managed to find the string by using: (\d{6})/(\d{2})/(\d{1})

The documents are mostly named by the found string now.

The macro did found out all of the strings in the documents but it also extracted random numbers out of the document when it did not find the exact string:

In that case the string was not completed by the writer of the document - so I like to have the macro to name it without any string by using the other parts of the macro. And when the macro neather found the string nor the scripted content, it want it to be named like "not found - please rename it manually.

Thank you guys!

Advo| Autobenennung Advo AZ-Inhalt.kmmacros (223.0 KB)

ccstone · February 24, 2023, 7:24pm

Hey Caroline,

I've only taken this to the text-extraction point.

Select your PDF file in the Finder, and run the macro.
You'll get a pop-up window with the string.

Hopefully you can take it from there.

-Chris

Download Macro(s): Extract Text from a PDF v1.00.kmmacros (10 KB)

Macro-Image

Keyboard Maestro Export

Macro-Notes

Macros are always disabled when imported into the Keyboard Maestro Editor.
- The user must ensure the macro is enabled.
- The user must also ensure the macro's parent macro-group is enabled.

System Information

macOS 10.14.6
Keyboard Maestro v10.2

Caroline · February 24, 2023, 10:53pm

Thanks a lot!

I hope one day I will be as fit as you are!

andreamike · June 23, 2023, 9:50pm

Hi there, I struggle to make this work on Apple Silicon 13.4
The macro stops at the shell script execution. If I run pdftotext command on terminal it works fine. Might be something to do with the homebrew $PATH?

andreamike · June 30, 2023, 11:03am

This is the error I'm getting:

Execute a Shell Script failed with script error: text-script: line 1: /usr/bin:/bin:/usr/sbin:/sbin=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/homebrew: No such file or directory
text-script: line 5: pdftotext: command not found. Macro “Extract Text from a PDF v1.00” cancelled (while executing Execute Shell Script).

mrpasini · June 30, 2023, 4:03pm

Try adding (or editing) the full path to your pdftotext.

andreamike · July 3, 2023, 4:53pm

I tried to, but it is returning the same kind of error. Can the problem be in $PATH=?

@ccstone you might be able to give a hit here:)

mrpasini · July 3, 2023, 5:13pm

$PATH is irrelevant is you've specified the path to the application. It comes into play when you don't. But it doesn't hurt to confirm the location:

which pdftotext

Chris Stone wrote a nice macro to write your $PATH to a Keyboard Maestro ENV_PATH (or some such) which duplicates the environment for Keyboard Maestro shell scripts. I couldn't find it on a recent search, though. It's a once-and-done thing. You could also paste your $PATH into a variable of that name.

tiffle · July 3, 2023, 5:29pm

Here’s Chris’s macro:

How to Extract a String from a PDF with Keyboard Maestro

Options