“PDF2Text Test Macro With Word "Confined"” Macro

Is anyone able to explain why the PDFToText file conversion handles the word "confined" as it does (see both the PDF and converted files)? The results shown are obtained for any field where the word "confined" is entered! I don't believe the problem is a Keyboard Maestro issue, but I often use the pdftotext command for processing files and I'm hoping others do as well. I know the "raw" option is a hack not recommended, but works best for this file and application.

Keyboard Maestro 8.2 “PDF2Text Test Macro With Word "Confined"” Macro

PDF2Text Test Macro With Word "Confined".kmmacros (4.7 KB)

image

Bill Jones Travel Request To & From Atlanta, GA For Travel Purpose Itinerary Template.zip (175.7 KB)

1 Like

That certainly sounds peculiar. I can’t, unfortunately, offer any insight into what pdftotext is doing but I have a suggestion.

You might try using the built-in textutil program just to see if it gives you the same problem.

Just replace your red action command line with:

textutil -convert txt -stdout "$KMVAR_Itinerary_Filename"

(You need the -stdout option to write the output to a variable rather than a file.)

mrpasini,

I’m not familiar with the textutil, but on first examination it appears that it will only be useful for processing my text file after using pdftotext on the pdf and therefore will already have the problem. Remember, I’m starting with a pdf. Am I missing something?

The command I suggested will convert your PDF to text, just like pdftotext. It's an alternative.

Run it from Terminal without the -stdout option to create a text file to compare to your pdftotext output.

Hey Mike,

I believe textutil does not grok PDFs.

-Chris

Hey Anthony,

I’ve poked around this thing a bit with several tools, and it looks to me like the problem lies with whatever app did the OCR.

-Chris

Chris,

I used PDFPenPro to enter the information into the fields, so that’s likely the source of the issue based upon your investigation. Its just odd that I’ve only encountered this issue with the word “confined”. Perhaps, I can make an inquiry to Smile.

Thanks

You’re right. It creates a text file but it’s identical to the PDF. Sorry for the goose chase, KM_Panther.

mrpasini,

No worries! I still learned something from your post.

Thanks for the support

Me, too! :slight_smile:

I’ve noticed in my pdf conversions that “fi” and certain other letter combinations (which when they used to be set in type were known as “ligatures”) often convert weirdly. It would be interesting to see if you get the same result converting a pdf with a word like “confirmation” or “first” or some other word with an “fi” in it.

2 Likes

Rolian,

Again, recognizing that this is not a KM problem, thanks for your input. I tested converting a PDF using “First Last” as the employee name and “Training For Resolution Confirmation” as the trip purpose. First did not trigger an issue, but Confirmation did. Seeing that I then tried using “Training For Conflicted Confirmation” for the purpose which yielded problems with both Conflicted and Confirmation! I’ve submitted an email to Smile for PDFPenPro for technical support with hopes they can provide some understanding.

Yes, “fl” would be another ligature. “First” probably worked because the upper-cased F followed by an “i” isn’t a ligature. Based on your testing, I would be surprised if lower-cased “first” didn’t generate the issue as well.

Rolian,

You’re right! I just looked up character ligatures to better understand your input and also tested lower case first to yield the issue. I’ll pass this along to Smile.

Thank you!

Acccording to this https://superuser.com/questions/220363/cleaning-up-pdftotext-font-issues you can deal with ligatures in pdftotext by forcing it to encode as ASCII:

pdftotext -enc ASCII7 input.pdf output.txt

Hey Mike,

Good to know, but unfortunately it doesn't work with the document in question.

I've gotten “Confined” in Anthony's object document to render accurately by converting to a Word document, but not everything comes through neatly.

Some of the OCR issues are due to the watermark image and Anthony's signature, but OCR just isn't as good as we want it to be either.

-Chris

pdfs use ligatures for several combinations ‘ff’ , ‘ffi’ and in some fonts fl.
You may be able to set your word processor to recognise and render ligatures. Alternatively you should be able to set PDFPenPro to not use ligatures

1 Like

Hey Mike,

Unfortunately I don't see a way to do that.

-Chris

As Chris noted, while the -enc[ode] options handles the ligatures, it doesn’t provide a comparable and usable text output as the -raw option. The output looks very much like what one gets with the -layout option. Smile technical support has responded with acknowledgment of the ligature issue and maybe as a solution they’ll provide a mechanism to set PDFPenPro not to use ligatures.

PDFPenPro technical support identified that using a different font, specifically Arial, instead of Helvetica where the issue was observed, avoids the ligature issues when using pdftotext to convert the PDF to text.