Macro for removing text from PDF and importing into Numbers spreadsheet

Hi guys,

I'm a bit of a beginner, so please bear with me here.

For my work, I have to deal with very large PDF documents, remove snippets of information from them and reduce that information into a chronology (I'm a lawyer). I find the best tool for a chronology is actually Numbers, which allows basic formatting and sort by date etc.

The workflow I currently employ is (with PDF and spreadsheet open side-by-side) is:

  1. Select text from PDF.
  2. Revert to spreadsheet
  3. Manually input the date in column one. Tab into column two.
  4. Manually input the page number from the pdf in column two (so I can find the source again if I need it). Tab into column three.
  5. Paste the information from the PDF into column three.
  6. Revert to PDF, repeat when necessary.

I've been thinking about how I could automate this more. I would be really grateful for any input from the hive mine.

I am currently thinking:

  1. Select text from PDF.
  2. Launch KMB shortcut (stages 1 and 2 could be integrated into a single step).
  3. Then, a prompt user box could appear for input of the date and possibly the page number of the PDF. But this would still require manual input.
  4. All the information is then dropped into the spreadsheet without me having to actually do that bit manually.

At stage 3 I was wondering if there is a way for KBM to "read" the number of the page I have taken the information from on the PDF (I use PDF Expert)? But I suspect that might require AppleScript, which I fear is a bit beyond my capabilities at the moment.

I'm also wondering how, at stage 4, I could import this data into the spreadsheet without having to actually go to it, and ensure the data is added to a new row each time. Some help here would be gratefully received!

I was thinking that the best way to assign the data into columns would be to use tokens, rather than a clipboard manager. Does that sound about right?

If anyone could offer me some help and guidance, I would really appreciate it.

Thanks so much for taking the time to read.

Manually input what date? If it's the date you are doing this then that's easy enough, if you're trying to extract the date from the PDF (or the copied selection) that might be trickier (depending formatting, consistency, etc.).

Unfortunately, PDF Expert has no AppleScript support -- it could be that you'll have to do the page number manually or switch to another PDF reader, but you might be able to do something with KM's OCR action if you set PDF Expert's Preferences to always show the page number indicator (that appears to have absolute positioning relative to the bottom-right corner of the document window). I'm assuming you want the page number of the PDF, not the number "written" on that page of the PDF!

Stage 4 will also require some cleverness -- if you don't interact with the spreadsheet during a "session" and are happy to select the first empty cell at the start, it's as simple as "paste tab-delimited string from clipboard, then keypress down-arrow" (tab-delimited as auto-split into columns). If you do manually change the selected cell during the "session" we'll have to find some way of getting to the "first empty cell of column X" every time...

I'm sorry, I should have been clearer!

Let's say the PDF contains a detail that says "On 4 March 2021, Paul went to the shops"

I could select the text "Paul went to the shops" then manually input 4/3/21. I'm not opposed to that, as I can see trying to automate the date completely could add a whole load of complications that would be almost impossible to get around.

I'm not opposed to moving to another PDF reader if necessary. I also have access to Adobe and, of course, Preview, albeit Preview isn't my favourite. I do want the number of the PDF which is displayed in that box, which incidentally I do have set to always display anyway.

Stage 4 is where I am struggling most.

If by this you mean once at the very start of the session, then that is fine. I can live with that without any difficulties.

If I had to select the empty cell every time I wanted to add a new row to the spreadsheet, then I think that would be the type of labour-intensive evil I am trying to avoid.

Thank you for your response, I really do appreciate it.

Here's a down payment on a full macro. It grabs the selected text and page number from Acrobat Reader, asks you for the date, and displays the three pieces of data in a window.

Acrobat Reader Selection.kmmacros (5.8 KB)

Image of macro

The macro assumes that Reader is active and you have the text selected. It uses Keyboard Maestro's Copy action to get the selection and put it in a variable (it also gets rid of the extra line feeds you sometimes get when copying multiple lines of text in a PDF—you can delete that action if you want). It uses a short AppleScript to get the page number of the PDF.

Currently, the macro doesn't put the data in Numbers; it just displays the information in a window. If this works more or less the way you want, a bit of AppleScript can be added to put the data in a spreadsheet.

I built the macro work with Acrobat Reader because you said you could use it. Skim could also be used, but it's a more obscure PDF reader. Preview, unfortunately, is not an options because it doesn't have any real AppleScript support.

Update
I guess you don't really need AppleScript to insert the data into Numbers—you could use a series of Tabs and Insert Texts. I tend to think AppleScript would be more reliable, though.

Firstly, thank you so much for taking the time to reply and, secondly, for providing a macro. That is very kind of you.

When I said Adobe, I actually meant Pro DC, but rather than messing with the macro, I've downloaded Reader for sake of ease and I have imported the maco.

Unfortunately, it doesn't work. When I press the shortcut I don't get anything. When I go into editor and click "play" I get a beep.

Do I need to run de-bugging?

Thanks again.

Update

I inserted a pause of 0.1 seconds between each step, and it now works!

I'm getting the following confirmation box up on completion.

"To be completed" was the highlighted text.

B3 is the page number, so this is working perfectly up to this point.

Thanks

I played with OCRing PDF Expert's page number display, but it simply wasn't reliable. So i re-wrote what I'd done to Acrobat Pro, shamelessly stealing @DrDrang's AppleScript for getting the page number :wink:

Change the shortcuts to suit yourself but, as it stands:

  • Set up Numbers so the empty cell you want to start filling from this session is selected.
    Switch to Acrobat and select the date you want to record, then ⌘⌥C to trigger the macro. It'll attempt to parse the date -- if unsuccessful you'll get a dialog asking you to enter it.
  • It'll then pause while you select the other text you want to copy -- once that's selected hit ⌘⌥⌃C and that'll be collected and the whole pasted into Numbers.
  • You'll be returned to Acrobat, ready for the next

It'll break if you've got tabs or returns in the copied text (Numbers will shift cells/rows appropriately) -- if that's an issue you could "Search and Replace" then for other characters.

Text From Adobe to Numbers.kmmacros (9.9 KB)

Summary

And @DrDrang is right -- it'd be much better to add the data to Numbers via AppleScript. He's typing right now -- I'm sure a better solution than mine is incoming!

I suspect you can get exactly the same behavior with Pro, but because I don't have it, I can't tell you how to adjust the macro. Maybe someone else here can.

That seems like too much pausing, but different computers run at a different speeds. If you're annoyed by the delays, I'm sure you can delete some of the pauses. But I can't tell you which ones—you'll have to experiment.

Do you want to finish this off by adding actions that switch to Numbers and insert the three pieces of data with Insert Text? Or do you want me to try out an AppleScript that will do it?

Thanks to both @Nige_S and @drdrang for your helpful replies.

I aim to set some time aside this Saturday to work through this and try and get it sorted!

Thanks again