Macro Request: Exporting pdf search results into excel

Hello all,

I’m new to KM and scripting and wondering if I can get suggestions or feedback. Some background info:

I’m working with research articles (in .pdf format) and have a list of the articles in excel cloumn A (the list is a list of citations of the articles all in column A with different citations in different rows). Columns B, C, D and E I have labeled as different “keywords” that need to be found in the articles. Normally, I go through the articles and manually search each article with each keyword, switching back to excel to mark if the keyword is present or not (marking the keyword in the corresponding column B, C, D or E for the corresponding citation in column A).

I would like to automate the process for each of the files in a specific folder (all .pdf files). Here is what I have so far:

  1. I start the KM Macro by prompting the user to enter a keyword, which I set as a variable (KeyWord) and have KM copy this variable to clipboard.

  2. Next, I wrote an Applescript that searches a pdf with a keyword and outputs the page number(s) on which the keyword was found. If the word is not present, the script adds “NONE FOUND” to the clipboard (see script at the end of this post). I added an action in KM to run the script, in which I use the output of the system clipboard as the input into the search box for the keyword search in pdf.

  3. I then have KM save the output (page numbers) in a new keyboard (named PageNumbers)

The macro works perfectly and does everything I need (for one selected file), but I’m having trouble integrating it into an actual automation. Here are the ideas/questions I would like to implement and would really appreciate any feedback (on this or the approach in general):

  1. If the column A of the excel sheet I have would match the name of the pdf files being processed, is there a way for me to copy the name of the file to a new clipboard for each file being processed? I’m thinking if I’m able to do this, I can create an action that will search the excel sheet for the “name of .pdf file” in column A, then insert a keystroke to move to column B/C/D or E to paste the page numbers retrieved with the script from step 2 above.

  2. The Applescript I wrote works only on an opened pdf file, so I’m wondering if there is a way to create a loop to open each file sequentially and apply the script to the opened file, then closing the opened file, so there are no opened window next time the loop repeats. I’ve tried working with the “For each item in a collection execute actions” action, but haven’t made any progress (unable to figure out how to open each of the files before running the other actions and then closing the file), and still not sure if this is the most efficient way to proceed with what I’m trying to accomplish.

As I mentioned in the beginning, I’m brand new to scripting (it took me the last week and a half to write the applescript to do the search in the pdf file), so please feel free to throw out suggestions, however basic. Here is the applescript I wrote–please do let me know if you can think of any ways I can improve it:

activate application "Preview"
tell application "System Events"
	tell process "Preview"
		set frontmost to true
		click menu item "Find…" of menu 1 of menu item "Find" of menu 1 of menu bar item "Edit" of menu bar 1
		delay 1
		try
			tell window 1
				-- Perform the search
				set keyWord to the clipboard
				set value of text field 1 of group 3 of toolbar 1 to keyWord
				delay 3
				-- Get the search results from the sidebar
				tell UI element 1 of rows 2 thru -1 of outline 1 of scroll area 1 of splitter group 1
					set page_numbers to value of static text 1
				end tell
			end tell
			return {page_numbers}
			-- Error number -1719 means no keyword was found
			-- if -1719 error displays, copy "NONE FOUND" into clipboard
		on error error_message number error_number
			if error_number is -1719 then
				set the clipboard to "NONE FOUND"
			else if error_number is -2753 then
				display dialog "An error has occured:" & return & error_message
			end if
		end try
	end tell
end tell
-- variable page_numbers not defined gives error -2753 from "page_numbers"

Thank you in advance,

Sincerely,

V.

Thanks for the very detailed background. I would be most helpful if you could now succinctly specify exactly what you would like help on now.

OK, thank you for reading. My question specifically is as follows:

How would I go about creating a macro with a repeat loop that would accomplish the following for Each file in the folder (steps within parenthesis are listed for clarity of order only and are not steps with which I’m having trouble):

  1. Open pdf file in Preview
  2. Copy name of opened file
  3. (Perform page number extraction on opened file)
  4. (Perform necessary operations in excel)
  5. Close pdf file
  6. Open next pdf file
  7. Repeat for each file in folder

Please let me know if this makes sense or if I can provide any additional information.

Thank you.

Hello V,

Please provide an example of what your listings look like in Excel (more than just 1 line). Excel is very scriptable.

An actual zipped worksheet would be handy. (You can drag & drop into the forum editor.)

Please also provide a couple of links to sample PDFs of the sort you’re working with.

Chances are pretty good that you can do this job with considerably greater efficiency.

For instance:

Preview is not scriptable except in the most basic fashion.

Skim is quite scriptable.

AppleScriptObjC can work with PDFs directly.

There’s a great command-line tool for reading PDFs called pdftotext. I like it, because it has an option for preserving the layout of the PDF when it converts to text.

In combination with the Satimage.osax AppleScript Extension you have very powerful tools for batch processing of PDFs.

-Chris

Hello Chris,

Thank you for responding. Here is an example of the type of excel listing with which I'm working (contains only the necessary info--the actual file typically has much more data, but data is different depending on the type of project):

Example_List_File_Names.zip (26.8 KB)

The column A in the excel file lists the file names; although normally I list article citations (Author, year), it wouldn't be a problem to extract file names and list them in excel (but do let me know if you have a more efficient idea). Note that Column B is labeled as "Research"--an example keyword I would like to extract from the article (if present). I normally have several keywords for which I search (e.g. column B would be labeled "Consortium", column C would be labeled "Standard Error", etc.). I would like to have the page numbers on which the keyword was found in the article to be listed under the keyword column (in this case "Research"), but I would be OK with listing whether or not the word was found in the article, if that would make it easier to automate.

As far as the type of articles, here are 4 articles as an example (names are part of the list in Column A in the example excel file):

Articles_Example.zip (2.3 MB)

I'm not familiar with the programs you mentioned and would appreciate any advice at all in getting this started.

Thank you,

V.

Hello V,

I specifically want to see the structure with keywords you're searching for included. I need to see how these are laid out and have examples to work with.

I realize this data is a subset which is fine, but I need an example of a working data-set.

What version of OSX are you using?

On preliminary inspection it appears this task can be done entirely via script. E.g. no driving the UI — just a push of a hotkey (or other trigger) — wait a few seconds — and done.

Searching the PDFs would appear to be dead-simple — even working page-by-page. The more difficult part will be to enter data back into Excel.

Although... It might be very easy to dump to a data file that could be reimported into Excel or possibly pasted.

You noticed that I provided links to the software I mentioned?

I don't think you're gong to need either pdftotext or the Satimage.osax — just Mavericks or Yosemite — although you should definitely scope-out Skim (it's a PDF-reader designed for research).

-Chris