Search / Copy / Paste Text From PDFs? Life Changing

I have about 500 resumes mostly in pdf format in one folder. I need to pull the years of experience from all of these pdfs.

Here is what I think is the workflow...

Open a text file in Sublime

Do while I have pdf files in a folder

  • Read a pdf
  • Search and copy the employee name to x
  • Search for "Experience"
  • If found, pull the previous 5 words or so ("5 years of experience") to xx
  • Paste x (employee name) to a text file
  • Keyboard a comma (for a csv file)
  • Paste xx to the text file
  • Copy the pdf file to a archive folder
    Rinse, repeat

Then I can take this csv file, import it in a spreadsheet and do some further cleaning....

Any hints / help with this would be greatly appreciated. I use KM, but I'm a "basic" user and this is beyond my ability.

Hey Kurt,

This is easily done, but you must install a Unix command-line tool to read the PDFs.

Xpdf Download Page

Xpdf Mac Binary Direct Download

Download the binary.

Follow the instructions in the INSTALL file to install the command-line tools and man pages.

Give the INSTALL file a .txt suffix, so you can read it in TextEdit (or whatever).

Create any directories that don’t already exist.

Use G in the Finder to bring up the Go-To sheet to the directories:

Such as:

/usr/local/bin

Once you have finished installing everything get back to me, and we’ll talk some more.

Also – if you can find one or two of these PDFs that I can test with it would be helpful.

-Chris

Thanks @ccstone!

My requirements have changed a bit since I wrote this. Here is what I think the process looks like:

Open a text file
Do while I have pdf files in a folder

  • Read a pdf
  • Search and copy the employee name to variable x
  • Search for the Education Level
  • If found, assign to variable xx
  • Paste x (employee name) to a text file
  • Keyboard a comma (for a csv file)
  • Paste xx to the text file
  • Move the pdf file to an archive folder
    Rinse, repeat

I don’t know how to search for the Education level? Should we use some kind of list like:
AS
A.S.
BA
B.A.
BS
B.S.
MS
M.S.
M.A.
MA
MBA
PhD
etc…

I have the Xpdf loaded.
The Directory (“Sample pdfs”) is on my Desktop with a Archive subfolder (“Archive”)

I would love to upload my resume, but this website won’t allow me to upload a pdf.

I do have PDFpenPro 8.0.2

I appreciate your help with this!

Hey Kurt,

Zip it.  :wink:

-Chris

Hey Kurt,

Oh, heck. Let’s do this the easy way (for me).

Please also download and install the Satimage.osax AppleScript Extension.

http://www.satimage.fr/software/en/downloads/downloads_companion_osaxen.html

It adds regular expressions to AppleScript (amongst other things) and will make this task much easier.

-Chris

Kurt Kessler - resume.pdf.zip (257.0 KB)

Thanks! I installed the Satimage software. Also, I OCR'd the pdf using PDFPenPro and figured out how to use Hazel to OCR all the files in the folder...

Hey Kurt,

Don’t do that.  :wink:

The command-line tool will be more accurate – provided these are real PDFs and not images saved as PDF.

-Chris

ok. Got it.

Hey Kurt,

Okay, let’s start out simple.

Make sure you have at least one PDF in the “Sample pdfs” on the desktop, and run this script from the Script Editor.app.

set sourceFolder to alias ((path to desktop as text) & "Sample pdfs")
tell application "Finder"
   set thePdfFile to first file of sourceFolder as alias
end tell
set thePdfFile to quoted form of (POSIX path of thePdfFile)

set shCMD to "
export PATH=/opt/local/bin:/opt/local/sbin:/usr/local/bin:$PATH;
pdftotext -layout " & thePdfFile & " -
"
do shell script shCMD

You should end up with the text of the resume.

-Chris

hmmm…

I have a pdf in the folder. I get the error message

error “File alias /Users/KurtKessler/Desktop/Sample pdfs of «script» wasn’t found.” number -43

My bad…this script works fine! It returns the text.

Hey Kurt,

Okay, now run this one.

I’m using your resume as a model for this, so drop any others into a temp folder for this test.

-------------------------------------------------------------------------------------------
# dNam: Kurt Kessler → KM Forum → Working
# dCre: 2016/07/29 13:12 
# dMod: 2016/07/29 14:02
-------------------------------------------------------------------------------------------

set sourceFolder to alias ((path to desktop as text) & "Sample pdfs")
tell application "Finder"
   set thePdfFile to first file of sourceFolder as alias
end tell
set thePdfFile to quoted form of (POSIX path of thePdfFile)

set shCMD to "
export PATH=/opt/local/bin:/opt/local/sbin:/usr/local/bin:$PATH;
pdftotext -layout " & thePdfFile & " -
"
set pdfText to do shell script shCMD
set educationText to fndUsing("(?m)^(Education.*\\s.*)(?=(^\\w|\\Z))", "\\1", pdfText, false, true) of me

-------------------------------------------------------------------------------------------
--» HANDLERS
-------------------------------------------------------------------------------------------
on cng(_find, _replace, _data)
   change _find into _replace in _data with regexp without case sensitive
end cng
-------------------------------------------------------------------------------------------
on fnd(_find, _data, _all, strRslt)
   try
      find text _find in _data all occurrences _all string result strRslt with regexp without case sensitive
   on error
      return false
   end try
end fnd
-------------------------------------------------------------------------------------------
on fndUsing(_find, _capture, _data, _all, strRslt)
   try
      set findResult to find text _find in _data using _capture all occurrences _all ¬
         string result strRslt with regexp without case sensitive
   on error
      false
   end try
end fndUsing
-------------------------------------------------------------------------------------------

NOTE that this is only a demonstration. I expect the various resume formats will not be uniform and will require more clever parsing.

-Chris

I copied the script into Script Editor and it appears to error on this line…

set educationText to fndUsing("(?m)^(Education.\s.)(?=(^\w|\Z))", “`”, pdfText, false, true) of me

It doesn’t like the “’”

Expected “"” but found unknown token.

Hell. That’s a bug in the Discourse forum software.

I’ll post a script file for you in a sec.

-ccs

Hey Kurt,

Okay – download this script file – and give it a try:

Kurt Kessler → KM Forum → Working.scpt.zip (10.2 KB)

-Chris

Yep. Works fine. Returns a “false”

It shouldn’t.

Are you running it on your resume?

-Chris

Yes I am.

The exact same one you posted for me?

-Chris

Yes it is.