Extract date from scanned document PDFs (letters, invoices etc.)

I don't think the following can be done with Keyboard Maestro by itself. But perhaps some of you automation gurus have an idea how to do it:

I'm getting lots of scanned letters, invoices, etc. as PDF files. To sort them, I need them to be named after the date of the document they contain.

For instance, a PDF containing a letter that says "June 2nd 2017" would have to be named "170602 XYZ.pdf".

Right now, I'm doing this manually. I go through the PDFs in Finder, open QuickView using the [space] on my keyboard, and then look for the date inside the PDF and rename the file by pressing [Enter] and typing the number.

That is an idiot's task, and I'd be thrilled to get this "outsourced" to my computer's CPU.

I know that there are a couple of online services that use artificial intelligence / machine learning to identify data points such as dates or invoice amounts in PDF documents (one such example being autoentry.com).

Another approach I think could be possible is to go down the path OCR > Regular Expression search. Meaning,

  1. OCR each document
  2. Search the text that was recognized for typical date patterns
  3. Convert date patterns into format YYMMDD
  4. Rename the files where recognition was successful

Any ideas?

I'm attaching three sample documents (contents blurred with the exception of the date which would have to be extracted).

This looks like a job for Hazel.

2 Likes

Hey Guys,

A bit of discussion on the topic:

-Chris

Thank you, @ccstone.

I use OCRKit - it works well with AppleScript.

ABBYY has Automator workflows, too, but I found it unreliable - especially when OCRing documents in bulk. Accuracy is slightly better with ABBYY, I think.

My current workflow is:

  1. Bulk OCR incoming documents using OCKit
  2. A KBM script that…
  • runs mdimport to get the PDF’s embedded text
  • runs Regular Expressions on that in order to identify the date

It’s not perfect. It fails, for instance, if the OCR introduced wrong characters into the date, or if the sender used a date format that I did not forsee when writing my extraction rules.

It could be made better if I could somehow make the script factor in the position of the date string. So if multiple date patterns match in the document body, the script would always prefer the one that is isolated on the page, and on the top right as opposed to middle/center or bottom of the page.

Perhaps this could be achieved with Google Cloud Vision API, which seems to output recognized text blocks along with their coordinates on the page.

Hello Everyone,

I see this is a 5-year old thread, but I do have the same request.

Have there been some improvements, dedicated scripts or anything in the last 5 years that would make the sorting of PDF by an OCR recognized date (with possible automated renaming of the file)?

Thank you in advance.

Hazel is the app to get. I use it all the time.

She reads files that enter my downloads folder, scans for the date, (OCR) and names the file with a format I setup that includes the date.

1 Like

Hey @Marcs,

This is a fairly simple task, but to do it well requires some tinkering...

Read at least the first post here:

Keyboard Maestro “Convert PDF Files into Text Files in the Front Finder Window” Macro

You need the 64-bit executables from here:

Download Xpdf and XpdfReader

Once installed you can feed the path to any given PDF file to an Execute a Shell Script action to extract structured text from it.

pdftotext -layout <POSIX_Path_of_Your_File> -

From there you can go-to-town.

-Chris

1 Like

Thank you!!!

Thank you for your help!!!

1 Like