Open the Applescript Editor and paste in this script:
----------------------------------------------------------------
# Auth: Christopher Stone
# dCre: 2019/01/20 06:45
# dMod: 2019/01/20 07:06
# Appl: AppleScriptObjC, Finder
# Task: Extract Text from a PDF File Selected in the Finder Page by Page.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @ASObjC, @Extract, @Text, @PDF, @Selected, @Finder, @Page
# Vers: 1.00
----------------------------------------------------------------
use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions
----------------------------------------------------------------
tell application "Finder"
set finderSelectionList to selection as alias list
if length of finderSelectionList > 1 then error "Problem with Finder Selection!"
set thePath to POSIX path of item 1 of finderSelectionList
end tell
set theText to current application's NSMutableString's |string|()
set anNSURL to current application's |NSURL|'s fileURLWithPath:thePath
set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL
set theCount to theDoc's pageCount() as integer
repeat with i from 1 to theCount
set thePageText to (theDoc's pageAtIndex:(i - 1))'s |string|() as text
if i = 1 then --» DEBUGGING -- Change the Index Number to See a Given Page.
return thePageText
end if
end repeat
----------------------------------------------------------------
Then open the Log History window (under the Window menu).
SELECT your test PDF file in the Finder and run the script.
Change the index in the DEBUG line to see different pages.
This is our starting point.
Once you've seen the text we can extract per page with the script, you need to show me by example the kind of output you're looking for.
Just tried this. It seems like it's just using the TOC to extract data.
If you go to Page 2 in pdf, you'll see that Some Subheadings also have the #xxxxx tag.
Also, there must be a few (definitely very few though) #xxxxx in normal text lines
Both of these aren't in TOC.
Also there will be multiple TOC pages as these are basically subject notes TOC and final PDF is multiple notes PDFs combined.
The Output I need is pretty easy: #xxxxx - Actual Page number it's present on (not from TOC) - (If possible heading text)
My purpose of giving the TOC page was just to mention that the #xxxxxx will be on each TOC page as well, so it would be great if those could be excluded
(although not too much of a hassle as I can just manually remove all references of TOC pages {maybe about 20 pages} later on)
The script is pulling the text from page 1 of the PDF document, and that's what it looks like.
Unfortunately the formatting of text data is NOT preserved when extracting it from a PDF.
And so? (You haven't made a point.)
Unless you can provide a good way to discriminate what is and is not a TOC page they will be hard to filter out.
The Output I need is pretty easy:
[/quote]
Easy for you to say...
[/quote]
Run this script the same way you ran the last one and look at the output:
----------------------------------------------------------------
# Auth: Christopher Stone
# dCre: 2019/01/20 06:45
# dMod: 2019/01/20 11:30
# Appl: AppleScriptObjC, Finder
# Task: Extract Text from a PDF File Selected in the Finder Page by Page.
# Libs: None
# Osax: None
# Tags: @Applescript, @Script, @ASObjC, @Extract, @Text, @PDF, @Selected, @Finder, @Page, @KMForum
# Vers: 1.00
----------------------------------------------------------------
use AppleScript version "2.4"
use framework "Foundation"
use framework "Quartz"
use scripting additions
----------------------------------------------------------------
tell application "Finder"
set finderSelectionList to selection as alias list
if length of finderSelectionList > 1 then error "Problem with Finder Selection!"
set thePath to POSIX path of item 1 of finderSelectionList
end tell
set theText to current application's NSMutableString's |string|()
set anNSURL to current application's |NSURL|'s fileURLWithPath:thePath
set theDoc to current application's PDFDocument's alloc()'s initWithURL:anNSURL
set theCount to theDoc's pageCount() as integer
repeat with i from 1 to theCount
set thePageText to (theDoc's pageAtIndex:(i - 1))'s |string|() as text
set AppleScript's text item delimiters to linefeed
if i = 1 then --» DEBUGGING -- Change the Index Number to See a Given Page.
set outputText to (its regexFindWithCapture:"(?m-s)^(.+?)(?=#\\d{2,})(#\\d{2,}.*)" fromString:thePageText resultTemplate:("$2 " & "p." & i & " $1")) as text
return outputText
end if
end repeat
----------------------------------------------------------------
on regexFindWithCapture:thePattern fromString:theString resultTemplate:templateStr
set theString to current application's NSString's stringWithString:theString
set theRegEx to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:0 |error|:(missing value)
set theFinds to theRegEx's matchesInString:theString options:0 range:{0, theString's |length|()}
set theResult to current application's NSMutableArray's array()
repeat with aFind in theFinds
set foundString to (theRegEx's replacementStringForResult:aFind inString:theString |offset|:0 template:templateStr)
(theResult's addObject:foundString)
end repeat
return theResult as list
end regexFindWithCapture:fromString:resultTemplate:
----------------------------------------------------------------
I want to extract the #xxxxx tags from all pages of pdf not just the 1st page.
Like in the Output I received I only got #xxxxx tags from the TOC page (page 1 of PDF) and not from the other pages. Which is not what I am trying to capture. I mean as a starter it would help if this becomes a global search rather than just searching the first instance.
Current Output:
"#3748 p.1 - Header Text
#3233 p.1 - Would be cool if we can extract this text also
#3272#1234 p.1 - Some headings might have multiple tags
#3248 p.1 - Another header Text
#3133 p.1 - Would be cool if we can extract this text also
#3242#1634 p.1 - Some headings might have multiple tags "
The Most Ideal Output I am looking for:
"#3748p.2 - Header Text
#3233p.2 - Would be cool if we can extract this text also
#3272#1234p.2 - Some headings might have multiple tags (If this can be separated as one line for each as below)
#3272p.2 - Some headings might have multiple tags #1234p.2 - Some headings might have multiple tags
#12213 p.2 - Very rarely some text may also have some tags (this is added text in other pages not in TOC) (TBH, don't the text here, just the #xxxx p.x) #3248p.2 - Another header Text
#3133p.2 - Would be cool if we can extract this text also
#3242#1634p.2 - Some headings might have multiple tags " (similar breakdown as above)
As you can see basically I am looking for the page where the actual notes are. Not the page number where they're mentioned in TOC.
(However I am ok if both instances {one in TOC & one in actual notes} are recorded and I can manually remove all instances of TOC pages {will be a manual task} though I don't mind doing that.
Again, really appreciate all the help!
Thanks a lot.
This line is there to let you examine the output of a given page:
if i = 1 then --» DEBUGGING -- Change the Index Number to See a Given Page.
You change i = 1 to i = 2, so you can see the 2nd page, 3rd page, etc.
This is all very preliminary, and you have not until now shown by example what you want the output to look like.
Explaining in words is next to useless – real examples are required (ideally real-world and not made-up).
Yes – and as I told you – unless you are able to provide a reliable method for determining what entries are from the TOC and what are not – there is no way to automatically filter them out.
Not including the headers and handlers that was accomplished in only 26 lines of code.
Such is the power of knowing something about regular expressions and scripting.
And how did you do this? How did you identify the TOC pages?
If I know how you're doing that I can add it into the main script.
I stand on the shoulders of giants...
Don't get the idea that I cranked that out in no time though. That macro probably took 3 solid hours of work and testing -- not including communications with you trying to get a complete picture of the task specifications.
All the AppleScriptObjC in my macro does is extract plain text on a page-by-page basis.
You can ask Shane Stanley over on the Script Debugger Forum if some of what you want is possible.
Otherwise the only option I can think of is Skim, but it is no pleasure to script – nor do I see a way to edit content via AppleScript.
You can find things based on Attribute Runs, so you can do something like this:
tell application "Skim"
tell front document
tell its text
properties of attribute runs whose font is "Helvetica" and size is 9
end tell
end tell
end tell
But as far as I can see this is only good for extracting stuff.