Search For Specific Text in a PDF File

Hi Everyone,

I'm new here and in using KM.

My need is to extract a specific text from PDFs file and copy it to a .txt file. To be more clear I will just post an example of what I need to do.

From this kind of document I need to create a .txt like this:

Name= name of the currrent document + _GironeVerde
First line empty
2021
GRANDA COLLEGE CUNEO
TAM TAM
B.C. GATORS
SCUOLA BASKET ASTI
A. SPORTIVA DIL. PALL. ABA SALUZZO
O.A.S.I. LAURA VICUNA
ASD BASKET CLUB SERRAVALLE
A.S.D. PALL. FARIGLIANO
USAC RIVAROLO BK 2009
PALL. GRUGLIASCO
PALLACANESTRO CIRIE'
ALFIERI
A.S.D. 5 PARI
A.S.D. OLEGGIO JUNIOR BASKET
FULGOR OMEGNA
ETEILA BASKET

Name= name of the currrent document + _GironeRosso
First line empty
2021
O.A.S.I. LAURA VICUNA
ASD BASKET CLUB SERRAVALLE
A.S.D. PALL. FARIGLIANO
USAC RIVAROLO BK 2009
PALL. GRUGLIASCO
PALLACANESTRO CIRIE'
ALFIERI
A.S.D. 5 PARI
A.S.D. OLEGGIO JUNIOR BASKET
FULGOR OMEGNA
ETEILA BASKET

Please help, no idea how to solve this.

Hey Riccardo,

Can you post a zipped copy of the PDF file?

This should be doable, but it's beyond the scope of Keyboard Maestro's built-in tools and will require installing one or more Unix executables.

I'll know right quick once I have my hands on an example file.

-Chris

Gironi.zip (1.8 KB)

Here it is, thank you Chris

The Keyboard Maestro macro Textcavator 2021 does that, Riccardo.

1 Like

Hey Riccardo,

You posted two Text files and No PDF files.

I need an example PDF file.

-Chris

My bad, this should be the exact one

Under_15_Gold.pdf.zip (33.5 KB)

This AppleScript will return the text content of a given PDF file whose path you supply:

use framework "PDFKit"
use scripting additions

property this : a reference to current application

property NSString : a reference to NSString of this
property NSURL : a reference to NSURL of this
property PDFDocument : a reference to PDFDocument of this

on pdfText from pdf_filepath as text
		set pdf_pages to {}
		tell PDFDocument's alloc()
				initWithURL_(fileURLWithPath_((NSString's ¬
						stringWithString:pdf_filepath)'s ¬
						stringByStandardizingPath()) of NSURL)
				
				repeat with i from 0 to the pageCount() - 1
						set end of pdf_pages to the |string|() ¬
								of the pageAtIndex_(i) as text
				end repeat
		end tell
		
		return the pdf_pages
end pdfText


get pdfText from "/Users/CK/Downloads/Under_15_Gold.pdf"

It grabs the text verbatim, going from left-to-right, and top-to-bottom through each text object.

The result is a list, each item containing the text from an individual page in order. I’ll leave the parsing to you to isolate the bits you need versus the stuff you don’t,

Hey @CJK,

“PDFKit” blows up in my face.

Can’t get framework "PDFKit".

-1,728

Was that added after Mojave?

I do have a working solution for Mojave that uses WebKit and Quartz instead.

-Chris

Hey @Rick_4,

Okay, extracting text from the PDF works as expected.

Unfortunately @CJK's AppleScriptObjC method produces garbled text that's really hard to work with, but fortunately there's a tool available with a switch that maintains the original document layout as much as possible.

Download the 64-bit Xpdf tools from here:

DO NOT TRY TO USE THE INSTRUCTIONS IN THE “INSTALL” FILE OF THE ARCHIVE.

Install the items from the bin64 folder in the archive to this folder:

/usr/local/bin/

In a Finder window G will open a field for you to paste the path and go there.

Install the items from the doc folder of the archive to:

/usr/local/share/man/man1/

Create the folders if you have to.

Once you've accomplished this we'll move forward.

-Chris

1 Like

Hi @ccstone,

Thanks you so much.

Installation done, ready to follow you forward.

Okay.

Open the Keyboard Maestro Editor's Preferences and create a variable named:

ENV_PATH

Place this path string in it:

/opt/local/bin:/opt/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:

Download and install BBEdit.

If you don't own it already it will be fully featured during the 30 day demo period and then revert the the freeware “lite” version. The lite version is still very powerful and scriptable.

I use it for programming and for viewing plain text documents.

We'll use BBEdit later.

In the meantime try this macro.

-Chris


Extract Text from the Selected PDF File in the Finder v1.00.kmmacros (7.0 KB)
Keyboard Maestro Export

Hey @ccstone ,

The Macro works perfectly... it prints the text Pdf , so right now i suppose BBEdit will exclude the numbers and the rest of the string after the '-'.

Thanks a lot for your help, really appreciated.

Hey @Rick_4,

Alright – try this one.

Then show me by example what is missing and/or what needs to be changed.

Also – what is the ultimate destination for the extracted text?

-Chris


Extract Text from the Selected PDF File in the Finder v1.01 @Working.kmmacros (7.2 KB)

Macro-Image

Keyboard Maestro Export

This one works perfectly. The final destination of the file is the same folder of the selected PDF file, and the PDF file should be deleted.

Hey @Rick_4,

Okay, try v1.02.

-Chris


Extract Text from the Selected PDF File in the Finder v1.02.kmmacros (13 KB)

Macro-Image

Hey @ccstone,

I've tried your Macro many times and works perfectly, thank you very much.

But now, I'm trying to script also the line "Girone: Girone something" in order to create a new txt file with name "name of the current document + _Gironesomething".

Here' s my attempt to script Girone something :

Match_Gironi.kmmacros (1.9 KB)

Thank you in advance,

-Rick

Hey Rick,

What's your starting point?

Are you starting with your example PDF file?

You're not give me a very solid picture of your current process.

-Chris

Hey Chris,

Sorry for my poor explanation,

Take as an example the PDF file I've posted. From this, I'm using the Macro you posted (Extract Text from the Selected PDF File in the Finder v1.02), to execute the shell Script.

So from this kind of PDF file i would like to create a new txt file like this:

Example using the above PDF file:

Name = name of the currrent document + _GironeA
Content =
POL ORATORIO DON BOSCO
GSO GSO GOTTTOLENGO
BASKET AQUILE LONATO
....

Name = name of the currrent document + _GironeB
Content =
UISP MANERBIO
POL CABRIOLESE
GSO RUDIANO
...

(I've wrote just the name of the first 3 teams for simplicity)

Hope I've been more clear,

-Rick

So you want to break up the PDF into individual text files containing each group's information and named for the original document plus the group name – yes?

Just like with the previous task – you need to post the actual PDF files for testing.

-Chris

Exactly,

I've posted a JPG version of the actcual PDF file because i can't post the PDF file version.