Text columnar layout made from PDF to text scraped into single column

Tunes · February 16, 2017, 6:12am

Need some suggestions from the group. I ran pdftotext using the -layout option and the resulting file output is text in three columns of:
Last Name, First Name
Address
Notes
Email Address

In some cases there are missing lines. For example some don’t have an address but have childrens names under the notes. Or some have an email address and no phone. Basically the common inconsistency you’d see with a contact info sheet.

I want to get the data into a format that can be imported into a database. What’s my best option to move these around and get them into either csv, tab, or paragraph blocks (1 column only)?

JMichaelTX · February 16, 2017, 10:45pm

If we can understand the real-world format of the data, then most likely a RegEx will get you what you want.

But you state "three columns", then show 4 lines.

Please post some real-world examples that will cover the range of possible data. If the data is sensitive, you can change parts of the content (name, phone#, etc), but be sure to keep the format/layout EXACTLY the same as the source data you will need to process.

Tunes · February 16, 2017, 11:58pm

JM I know your a stickler for details. The only thing that I didn't mention and should have is that each person's entry is not consistent at being 4 lines. They range from one line (name), and up to five lines. The usual format, but not always is: three columns wide and for each name up to five lines which is usually in this order:

NameLast, NameFirst
Spouse:
Street
City, ST, ZIP
Email:

Here's a screen shot:

I circled one that's typical but as you can see from the redacted records they may be really short entries with just a name.

Tom · February 17, 2017, 12:06am

I think @JMichaelTX asked for a real-world example of the output of pdftotext. That is, the plain-text data that will go into the KM macro.

Ideally you would also post a valid example PDF. Because pdftotext is not the only way to get text out of a PDF. Maybe some other solution works better here.

Tunes · February 17, 2017, 12:11am

Thanks Tom. So what you’re saying is rather than a screen shot, instead a text file that’s like what I am working on then?

Tunes · February 17, 2017, 12:13am

I have some dummy data I could mold into a resonable text file example. The problem I see with it is that the spacing is inconsistent where the list is not laid out in a symetrical spacing like say address labels would be.

Tom · February 17, 2017, 12:17am

Yes.

The data we have to work with, for example:

Temple, Shirley
Spouse: Bill Mayer
Street: Main Street
City, ST, ZIP: Washington, 947564
Email: bla@bla.com

or is it rather…

Temple, Shirley
Spouse: Bill Mayer
Main Street
Washington, 947564
bla@bla.com

It’s not clear. (At least not for me.)

Tom · February 17, 2017, 12:19am

Best thing to do, would be producing a minimal working PDF with 2 or 3 data sets, at least one of them with missing values.

Then post the example PDF as well as the text output from pdftotext.

Edit:

To be clear: it’s way easier to build a RegEx based on real-world text, than on some vague assumptions what the text might look like.

JMichaelTX · February 17, 2017, 12:33am

@Tunes:
Yes. That is exactly what I meant.

No, please do NOT "mold" or change the text file format/layout in any way.
To craft an effective, flexible RegEx, I need to see the data as you will get it from the pdftotext tool.

It needs to include an example of every possible format that you might encounter in the real data.

JMichaelTX · February 17, 2017, 12:37am

@Tunes, I don't need the PDF because I don't plan on processing it.
I want to start with the pdf2text output.

If @Tom finds the PDF useful, then fine, you can post it for him.
But I will ignore the PDF and use only the text file.

ccstone · February 17, 2017, 2:05am

Hey John,

Parsing data requires EXACTING detail.

Without good real-world examples it is impossible to do properly.

Test data with as many of the conceivable anomalies as possible is ESSENTIAL to building routines that work and have any degree of reliability.

People who don't do this kind of work often fail to understand this.

Lacking useful test data you can only make guesses about parsing it — it's like trying to assemble a complex puzzle without being able to see it.

Real work and real testing is required.

-Chris

Tunes · February 17, 2017, 2:26am

I completely understand Chris. I really do. The other work I do has similar parallels and in addtion I am also doing Filemaker Dev. So I know that solutions must be built in place to meet up with existing architectures.

I have a list of 600 high school alumni names and addresses. I am sure none of the people I have spoken to would care if I posted it here but it’s the 500 on the list I have never spoken to to ask if that would be OK. The list is 20 years old and not likely even accurate. Most of the emails would likely bounce but it’s where I have to start from to attempt to track down classmates for an upcoming reunion.

I suppose what I really need to do is stop working on learning JavaScript right now and chew on this thick steak I’ve been given with RegEx. I have been fiddling with it and am starting to understand the short hand so maybe this is the project for me to finally just give in and finally learn it. RegEx is so necessary and useful it would be stupid for me not to.

JMichaelTX · February 17, 2017, 3:42am

Actually, the JavaScript RegEx engine is very powerful and easy to use.
So a JXA solution might be the best method.

Regardless, whether it is for a project you want to do yourself, or one that you want help with, having a good set of requirements/specs is essential. Time spent getting a good example of source data will repay you in saved time many times over.

You need this good example of source data to use at RegEx101.com.

peternlewis · February 17, 2017, 6:27am

A way of anonymizing the data would be to change all the digits to 3 and all the letters to c. If there are fixed strings you want to preserve (eg “Spouse”), change them to something else, then change them back later.

For example, say the data is like this:

Temple, Shirley
Spouse: Bill Mayer
Street: Main Street
City, ST, ZIP: Washington, 947564
Email: bla@bla.com

Stick it in BBEdit, and then do the sequence:

Search and Replace regex [0-9] with 3
Search and Replace Spouse (case sensitive) with 1
Search and Replace Email (case sensitive) with 2
Search and Replace Name (case sensitive) with 4
Search and Replace regex [a-zA-Z] with c
Search and Replace ccc+ with ccc
Search and Replace 1 with Spouse
Search and Replace 2 with Email
Search and Replace 4 with Name

to get something like:

ccc, ccc
Spouse: ccc ccc
Street: ccc ccc
City, ST, ZIP: ccc, 333333
Email: ccc@ccc.ccc

That way all the required data for figuring out the regex is there, but the confidential information is almost entirely obliterated.

Double check before posting!

JMichaelTX · February 17, 2017, 10:37pm

OK, I'm new to BBEdit.
Is this ("do the sequence") a BBEdit feature?
I don't see how to use it.
In particular I don't understand statements like this one:
Search and Replace Spouse (case sensitive) with 1

I know how to use the BBEdit Find/Replace with RegEx (grep), but I'm not following this sequence of instructions.

ccstone · February 18, 2017, 1:44am

Hey JM,

No. It's not a feature that already exists.

You'd have to do it manually or build it with one of the following methods:

AppleScript
Text Filter
Text Factory

It's easy enough to build a sequence of replace commands with AppleScript:

tell application "BBEdit"
   tell front text window's text
      replace "\\d" using "9" options {search mode:grep, case sensitive:false, starting at top:true}
   end tell
end tell

** JXA would give even more processing options.

Or you can use a simple text filter using sed:

#!/usr/bin/env bash
sed -E 's![[:digit:]]!9!g'

** It's very easy to build up multiple find/replace statements with sed.

Filtering with real intelligence could be done with AppleScript but would be slow.

Real intelligence - meaning to retain the exact number of characters and capitalization in spousal names for instance.

To do this level of processing I'd use Perl, because I know it well enough – and it would be very fast.

I think Peter misspoke on that one.

-Chris

JMichaelTX · February 18, 2017, 1:51am

Thanks for the clarification, Chris.

If Peter's method (corrected) will work in general, it would be cool if we could build a KM Macro that does this automatically. Then, when any user asks for help with stuff like this, we can give them the Macro they can use before posting the sample data.

peternlewis · February 18, 2017, 9:59am

I only really intended the sequence to be done manually, one after the other, though BBEdit does have Text Factories that do exactly this (and Keyboard Maestro has an action to run a Text Factory).

You could probably build a macro that asked for a list of keywords, and then applied the transformations. Generalising the preserving of the keywords would not be too hard (assuming the keywords did not have numbers in them, and even then with a bit more thought as to how to represent them).

ccstone · February 18, 2017, 12:39pm

I was going to do this with Perl, but I had some trouble I couldn't solve with my regex pattern. The pattern works fine in BBEdit and ICU regex but not in Perl, and I don't know why yet.

So – with help from Shane Stanley I assembled something in AppleScriptObjC I'm reasonably happy with.

The goal was:

To replace all lowercase letters with an x and uppercase letters with an X

Thus preserving the exact layout.
Replace all digits with a 9.
Leave punctuation intact.
Leave designated label names intact.

TextEdit is currently the input and output agent.

The label delimiter is currently fixed as a colon, although if it gets used any the macro is bound to gain some flexibility.

---

###Instructions

Place the text you want to sanitize in the front TextEdit document.

Run the macro.

A new document with the sanitized text will appear.

---

###Sample Text

Name: Robert A. Heinlein
Phone: (666) 666-6666
Spouse: Virginia Gerstenfeld Heinlein
Address: 2019 E. Tycho Ave.

      Name: George Stephanopoulos
      Spouse: Alexandra Wentworth

Extra text 01

Extra text 02

Extra text 03

Sanitize Text for Public Consumption.kmmacros (6.3 KB)

Enjoiy.

-Chris

ccstone · February 18, 2017, 12:55pm

Text factories are relatively slow – even when run directly from BBEdit. Their main advantage is they require little expertise to assemble.

AppleScripting BBEdit directly is much faster.

Parsing text directly is faster still.

I would have preferred to use BBEdit (or TextWrangler) as the input/output agent for my macro, but since everyone who has a Mac has TextEdit it got the nod.

True.

My macro currently relies on the user to provide keywords they want to leave in the clear, but it shouldn't be too hard to automate that to a fair degree.

-Chris

Text columnar layout made from PDF to text scraped into single column

Edit:

Options