Extract PDF page numbers of specific Text? - Kind of indexing


#1

I have a 400 page document with a lot of #xxxx (random unique numbers) text to easily access/search for things.

I was wondering if there's a way I could create a list with the following details:

#xxxxx (PDF-page-number) (Possibly the header text that's before that #xxxx in the line)

As there's an table of contents in the file the #xxxxx can have repeats also
(Will later have to figure out a way to ignore the ones in the TOC pages)

Would really appreciate if someone could help.

Thanks.


#2

Hey @forums2012,

That's probably doable, but a sample document would be required to test with.

-Chris


#3

Thanks a lot or the help.
Finally created the sample document.
Sample document.zip (49.4 KB)

Uploaded Sample Document.pdf as zip file.

Would be amazing if this works out. I have a 400 page document with ~ 2000 of such tags to process, will make my life so much easier.

Really appreciate your help.

Thanks a lot!


#4

Hey @forums2012,

Okay, now we're cookin' with gas...

Open the Applescript Editor and paste in this script:

----------------------------------------------------------------
# Auth: Christopher Stone
# dCre: 2019/01/20 06:45
# dMod: 2019/01/20 07:06
# Appl: AppleScriptObjC, Finder
# Task: Extract Text from a PDF File Selected in the Finder Page by Page.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @ASObjC, @Extract, @Text, @PDF, @Selected, @Finder, @Page
# Vers: 1.00
----------------------------------------------------------------
use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions
----------------------------------------------------------------

tell application "Finder"
   set finderSelectionList to selection as alias list
   if length of finderSelectionList > 1 then error "Problem with Finder Selection!"
   set thePath to POSIX path of item 1 of finderSelectionList
end tell

set theText to current application's NSMutableString's |string|()
set anNSURL to current application's |NSURL|'s fileURLWithPath:thePath
set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL
set theCount to theDoc's pageCount() as integer

repeat with i from 1 to theCount
   set thePageText to (theDoc's pageAtIndex:(i - 1))'s |string|() as text
   
   if i = 1 then --» DEBUGGING -- Change the Index Number to See a Given Page.
      return thePageText
   end if
   
end repeat

----------------------------------------------------------------

Then open the Log History window (under the Window menu).

SELECT your test PDF file in the Finder and run the script.

Change the index in the DEBUG line to see different pages.

This is our starting point.

Once you've seen the text we can extract per page with the script, you need to show me by example the kind of output you're looking for.

-Chris


#5

Just tried this. It seems like it's just using the TOC to extract data.

If you go to Page 2 in pdf, you'll see that Some Subheadings also have the #xxxxx tag.

Also, there must be a few (definitely very few though) #xxxxx in normal text lines

Both of these aren't in TOC.

Also there will be multiple TOC pages as these are basically subject notes TOC and final PDF is multiple notes PDFs combined.

The Output I need is pretty easy:
#xxxxx - Actual Page number it's present on (not from TOC) - (If possible heading text)

My purpose of giving the TOC page was just to mention that the #xxxxxx will be on each TOC page as well, so it would be great if those could be excluded

(although not too much of a hassle as I can just manually remove all references of TOC pages {maybe about 20 pages} later on)

Again, Thanks so much.


#6

Hey @forums2012,

I don't know what you mean by that.

The script is pulling the text from page 1 of the PDF document, and that's what it looks like.

Unfortunately the formatting of text data is NOT preserved when extracting it from a PDF.

And so? (You haven't made a point.)

Unless you can provide a good way to discriminate what is and is not a TOC page they will be hard to filter out.

The Output I need is pretty easy:
[/quote]

Easy for you to say...  :sunglasses:
[/quote]

Run this script the same way you ran the last one and look at the output:

----------------------------------------------------------------
# Auth: Christopher Stone
# dCre: 2019/01/20 06:45
# dMod: 2019/01/20 11:30
# Appl: AppleScriptObjC, Finder
# Task: Extract Text from a PDF File Selected in the Finder Page by Page.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @ASObjC, @Extract, @Text, @PDF, @Selected, @Finder, @Page, @KMForum
# Vers: 1.00
----------------------------------------------------------------
use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions
----------------------------------------------------------------

tell application "Finder"
   set finderSelectionList to selection as alias list
   if length of finderSelectionList > 1 then error "Problem with Finder Selection!"
   set thePath to POSIX path of item 1 of finderSelectionList
end tell

set theText to current application's NSMutableString's |string|()
set anNSURL to current application's |NSURL|'s fileURLWithPath:thePath
set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL
set theCount to theDoc's pageCount() as integer

repeat with i from 1 to theCount
   set thePageText to (theDoc's pageAtIndex:(i - 1))'s |string|() as text
   
   set AppleScript's text item delimiters to linefeed
   
   if i = 1 then --» DEBUGGING -- Change the Index Number to See a Given Page.
      set outputText to (its regexFindWithCapture:"(?m-s)^(.+?)(?=#\\d{2,})(#\\d{2,}.*)" fromString:thePageText resultTemplate:("$2 " & "p." & i & " $1")) as text
      return outputText
   end if
   
end repeat

----------------------------------------------------------------
on regexFindWithCapture:thePattern fromString:theString resultTemplate:templateStr
   set theString to current application's NSString's stringWithString:theString
   set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
   set theFinds to theRegEx's matchesInString:theString options:0 range:{0, theString's |length|()}
   set theResult to current application's NSMutableArray's array()
   
   repeat with aFind in theFinds
      set foundString to (theRegEx's replacementStringForResult:aFind inString:theString |offset|:0 template:templateStr)
      (theResult's addObject:foundString)
   end repeat
   
   return theResult as list
   
end regexFindWithCapture:fromString:resultTemplate:
----------------------------------------------------------------

-Chris


#7

Sorry for not being clear. :slight_smile:

I want to extract the #xxxxx tags from all pages of pdf not just the 1st page.

Like in the Output I received I only got #xxxxx tags from the TOC page (page 1 of PDF) and not from the other pages. Which is not what I am trying to capture. I mean as a starter it would help if this becomes a global search rather than just searching the first instance.

Current Output:

"#3748 p.1 - Header Text

#3233 p.1 - Would be cool if we can extract this text also

#3272 #1234 p.1 - Some headings might have multiple tags

#3248 p.1 - Another header Text

#3133 p.1 - Would be cool if we can extract this text also

#3242 #1634 p.1 - Some headings might have multiple tags "

The Most Ideal Output I am looking for:

"#3748 p.2 - Header Text

#3233 p.2 - Would be cool if we can extract this text also

#3272 #1234 p.2 - Some headings might have multiple tags (If this can be separated as one line for each as below)

#3272 p.2 - Some headings might have multiple tags
#1234 p.2 - Some headings might have multiple tags

#12213 p.2 - Very rarely some text may also have some tags (this is added text in other pages not in TOC) (TBH, don't the text here, just the #xxxx p.x)
#3248 p.2 - Another header Text

#3133 p.2 - Would be cool if we can extract this text also

#3242 #1634 p.2 - Some headings might have multiple tags " (similar breakdown as above)

As you can see basically I am looking for the page where the actual notes are. Not the page number where they're mentioned in TOC.
(However I am ok if both instances {one in TOC & one in actual notes} are recorded and I can manually remove all instances of TOC pages {will be a manual task} though I don't mind doing that.

Again, really appreciate all the help!
Thanks a lot.


#8

Yes, I know.

As I told you before:

This line is there to let you examine the output of a given page:

if i = 1 then --» DEBUGGING -- Change the Index Number to See a Given Page.

You change i = 1 to i = 2, so you can see the 2nd page, 3rd page, etc.

This is all very preliminary, and you have not until now shown by example what you want the output to look like.

Explaining in words is next to useless – real examples are required (ideally real-world and not made-up).

Yes – and as I told you – unless you are able to provide a reliable method for determining what entries are from the TOC and what are not – there is no way to automatically filter them out.

-Chris


#9

Hey @forums2012,

Okay, here's a basic proof-of-concept macro.

Select your single PDF in the Finder and run the macro.

You'll get a pop-up window with a list of all found references.

-Chris


Extract References from PDF v1.00.kmmacros (8.9 KB)


#10

Holy smokes. This worked perfectly!

Can't thank you enough!

Was able to get 2300 references spread over 400 pages flawlessly!

Manually removed the TOC pages by using for each command later on. :slight_smile:

Thanks a lot!

You're amazing @ccstone!


#11

This reminds me of a problem I have: adding bookmarks to a PDF for every piece of e.g. 24 point text.

This PDF is generated in a way where I can’t add the bookmarks at generation time.

I’m wondering if the method can

  1. Find text at 24 point, possibly restricted to a particular font.
  2. Add a bookmark where it finds it.

As this thread is about iterating over a PDF I wonder if it’s the basis for what I want.


#12

Great!

Not including the headers and handlers that was accomplished in only 26 lines of code.

Such is the power of knowing something about regular expressions and scripting.

And how did you do this? How did you identify the TOC pages?

If I know how you're doing that I can add it into the main script.

I stand on the shoulders of giants...  :sunglasses:

Don't get the idea that I cranked that out in no time though. That macro probably took 3 solid hours of work and testing -- not including communications with you trying to get a complete picture of the task specifications.

-Chris


#13

Hey Martin,

All the AppleScriptObjC in my macro does is extract plain text on a page-by-page basis.

You can ask Shane Stanley over on the Script Debugger Forum if some of what you want is possible.

Otherwise the only option I can think of is Skim, but it is no pleasure to script – nor do I see a way to edit content via AppleScript.

You can find things based on Attribute Runs, so you can do something like this:

tell application "Skim"
   tell front document
      tell its text
         properties of attribute runs whose font is "Helvetica" and size is 9
      end tell
   end tell
end tell

But as far as I can see this is only good for extracting stuff.

-Chris


#14

Thanks. I’m contemplating seeing if PDF.js can be run from Node.js on my laptop. That might be able to do interesting things.