Which tool to chuck pdf texts?

Hi all,

Short version: I want to split a large PDF into chapters by calculating page ranges from a text file, using native Mac/KM tools instead of Python. Is this possible?


Longer version:

I have a working PDF chunking system built with Python and shell scripts inside Keyboard Maestro. It does exactly what I need, but I'd like to rebuild it using native Mac tools and KM actions where possible β€” following advice I've gotten here before about preferring KM actions over shell scripts.

The context: I work with large reference books (500–1400 pages) in PDF format. Each book is organized into chapters, one per subject. My goal is to extract each chapter as its own PDF file, named after the chapter title.

What I am asking:

The step I don't know how to do natively is: 1. taking a source PDF and extracting a specific page range (say pages 105 to 203) into a new named PDF file.

In Python we used the pypdf library for this. Is there a KM action, an AppleScript approach, or a Mac-native tool that can do the same thing β€” given a source file path, a start page number, an end page number, and an output file name?

Can you suggest tools from KM or that are native to Mac.

Here's what the system currently does:

I prepare a Table of Contents text file in advance. Each line looks like this:

[[Chapter One]] ((45))
[[Chapter Two]] ((112))
[[Chapter Three]] ((203))

The double brackets contain the chapter title. The double parentheses contain the page number from the book's printed table of contents.

The challenge is that printed TOC page numbers don't match physical PDF page numbers. A book typically has front matter (title page, foreword, preface) before Chapter 1, so the physical PDF pages run ahead of the printed numbers. This gap β€” which I call the offset β€” is consistent across the entire book.

I find the offset by opening the PDF in PDF Expert, navigating to the first chapter, and reading the physical page number it shows there. I enter that one number into a KM prompt at the start of the macro.

What the macro does:

  1. Prompts me for the offset (one number)
  2. Reads the TOC text file line by line
  3. For each chapter: calculates start page = TOC page + offset, end page = next chapter's TOC page + offset βˆ’ 1
  4. Extracts those pages from the source PDF and saves as a new PDF named after the chapter title
  5. Outputs all chunks into a CHUNKS_FINISHED folder

Currently steps 2–5 are handled by a Python script called from a shell script in KM. The KM macro handles the prompt and passes the offset to the shell script.

I also have safety checks: verify the TOC file exists, verify there is exactly one source PDF in the folder, verify the offset was actually entered.


To be concrete about where we are technically:

Steps 1–3 I believe I can handle with KM native actions, based on a working macro I already have:

  • Reading a text file line by line β†’ For Each Item in a Collection (lines of file)
  • Parsing each line for title and page number β†’ Search Variable Using Regular Expression, capturing two groups
  • Calculating physical page numbers β†’ Set Variable to Calculation (TOC page + offset)

Repeating my question:

The step I don't know how to do natively is step 4: taking a source PDF and extracting a specific page range (say pages 105 to 203) into a new named PDF file.

In Python we used the pypdf library for this. Is there a KM action, an AppleScript approach, or a Mac-native tool that can do the same thing β€” given a source file path, a start page number, an end page number, and an output file name?

Thank you,

Ellen Madono

macOS used to offer extracting PDF pages in Automator, and I think in Shortcuts. They've removed both, though, leaving (as far as I know) Preview as the only way to do this. And that's far from ideal for use in a Keyboard Maestro macro.

So your Python script, which seems to be working fine?, looks like it's your best bet. Is there some reason you don't like using it?

If you find that script overly large or complex to manage, there are some third-party command line tools that are small and fast. If you have Homebrew installed, installation is a command away. If not, they're still not overly hard, but a bit tougher.

Here's one you can install just by putting the binary in the folder where you want it to live:

pdfcpu

Once installed, you can run it in a Keyboard Maestro shell script action with something like:

/path/to/pdfcpu trim -p "5-8" /path/to/source.pdf /path/to/extracted.pdf

It's fast and easy to use, and has many additional features.

-rob.

Why not use python for this? Or, and for less effort, @griffman's suggested pdfcpu?

I'm a great advocate for using "native" KM Actions when possible, but that's partly because it seems silly to replace simple KM Actions with your own scripts and partly because this is, after all, the KM Forum :wink:

You could do this "natively", but you would still need to leverage other applications. Both Automator and Shortcuts have "Split PDF" and "Combine PDF" actions, so you could explode your original into single pages then combine page ranges to get your chapters. If you have Acrobat then you could drive the "Organise" window's interface with "native" KM actions -- filling in a page range, pressing "Extract", and manipulating the resulting "Save" dialog (the same may be possible with PDF Expert and other apps). But all that seems a lot more work than using a tool someone else has provided!

Just a thought -- doesn't DEVONthink have a "Split PDF: Into Chapters" tool that uses the PDF's table of contents? It would mean importing the whole document into DT, splitting it, then deleting the original -- but you wouldn't have to mess around creating your own TOC then creating and importing the chapter files.

As an alternative, here's a solution using the Execute a Swift Script action in Keyboard Maestro.

Warning This macro uses 100% AI-generated code. It reads a file from your drive, and writes a new file back to the drive. Use at your own risk.

Download Macro(s): Extract PDF pages.kmmacros (11 KB)

Macro screenshot

Macro notes
  • Macros are always disabled when imported into the Keyboard Maestro Editor.
    • The user must ensure the macro is enabled.
    • The user must also ensure the macro's parent macro-group is enabled.
System information
  • macOS 15.7.4
  • Keyboard Maestro v11.0.4

I knew that macOS had the ability to do quite a bit with PDFs, and Keyboard Maestro supports Swift scripts, so I spent a few minutes with an AI, describing what I wanted. In the end, it came up with this fully functional solution in 32 lines of Swift code (after we had some back-and-forth discussion, of course).

The demo macro is very basic: Select a single PDF file in Finder, then run the macro. It will prompt you for which pages to extract, then creates a new file named the same as the source, but with the extracted pages included in its name.

But you could easily use this how you need toβ€”you just need to make sure you send four values (start page, end page, original file path, new file path) to the script.

-rob.

1 Like

As a continuation to @griffman's comment I was in a similar position as you wanting to do everything in Keyboard Maestro but there are situations where AppleScript or Python Script is preferred because it it more readable, simpler, etc.

I am not nearly as authoritative as @griffman, @Nige_S or others here but do prefer Keyboard Maestro actions where possible and reasonable (i.e., I do opt for a 3 to 5 line Python Script over 10+ Keyboard Maestro actions when wrangling dates).

I did turn to Python and OCRmyPDF for one of my macros and note that it was simple (I was able t o figure it out) and it worked great. While my purpose / use is different from yours it does have the functionality you are looking for (note: I have yet to try it). The syntax is simple:

I hope this is helpful and good luck.

Nice.

It does lose metadata that Acrobat's Organise->Extract method keeps ("Title", "Author", and so on). But that's a small price to pay compared to the cost of an Adobe subscription :wink:

Well, then you'll want to use the updated version of the macro that I just posted; it seems to keep everything (though it'll take a minute to populate after you create the file). After a brief chat with Claude, it turns out it's really easy to keep that data in Swift:

if let documentAttributes = inputDoc.documentAttributes {
    outputDoc.documentAttributes = documentAttributes
}

That just tells it to copy the attributes from the input file to the newly-created file.

-rob.

It's still not bringing me a fresh coffee and a Chelsea Bun on completion -- could you get on that?

2 Likes

Claude tells me that such tasks require tokens to hire real world workers, so that version will cost you $175,252.32, give or take.

-rob.

3 Likes

That is not exactly true.

We still have an option to use internal framework to work with PDF.

Below KM macro which ONLY use MacOS tools (in that case framework PDFKit and JavaScript for Automation). It ask about source and destination pdf files and third question is about operation:

CPYPAGES: 1,3,7-12,13 - copying selected pages from source to destination

DELPAGES: 1,3,7-12,13 - remove selected pages while write to destination.

7-12 means range and include all pages (7,8,9,10,11,12) in selected operations.

@Ellenm - it is possible to use that macro in bigger solution, just specify what should be in input f.ex. fila path, page range and what should be on output, as much precise as possible, and I may try to extend this.

There is no needs for Python, external libraries etc.

Copy-Delete selected pages from PDF file with prompt Macro (v11.0.4)

Copy-Delete selected pages from PDF file with prompt.kmmacros (7.2 KB)

"Thank you, @Griffman β€” the Swift script works perfectly for the page extraction. Now I need help with the TOC reading part.

I have a text file where each line looks like this:

[[Acari (Mites & Ticks)]] ((1))

I need to loop through every line, extract the title and page number, and calculate the end page by looking at the next line's page number. Then pass title, start page, and end page to the Swift script as KM variables.

"I have a For Each loop reading lines of a text file. Each line has a chapter title and start page number. I need to calculate the end page of each chapter, which is the next line's start page minus 1. A For Each loop can only see one line at a time. What is the best native KM approach for look-ahead like this β€” or is AppleScript the right tool here?"

Someone mentioned AppleScript. Is that the right tool, or is there a better native KM approach?"

Is Swift now available on each macOS or still need to be installed with Xcode and I have to register in Apple as developer?

I copied the action from the macro prepared by @griffman . AI changed the content to fit my situation. I don't know if it works yet.

Either:

  1. "Filter: Reverse" your TOC so you can "For Each" the lines in reverse order
  2. Treat the TOC as a pseudo array, where you access each line via an index, i, using the form %Variable%Local_toc[i]\n%. You can then "For Each: range of numbers" from 1 to LINES(%Variable%Local_toc%) getting the chapter end via %Variable%Local_toc[i + 1]\n% (but watch out for the last line, which won't have an [i + 1]!).

Give both a go and see how you get on.

(It's actually more efficient to pass the entire TOC to the script and iterate through the lines there -- but that won't help you learn how to use KM :wink: )

You mean, there's a better Keyboard Maestro action than what you are suggesting?

No.

When you run a script via the "Execute a Shell Script" Action there is a time and resource cost in creating the environment, running the script, destroying the environment.

If you have ten chapters in your TOC and "For Each" the lines to the script one at a time you have to create/run/destroy ten times.

If you pass the entire TOC into your script and process the lines there then you will

  1. Create the environment
  2. Run the script, looping over those 10 lines and generating a PDF for each
  3. Destroy the environment

For a small number of lines like you have here it doesn't make much difference -- the running of the script is the majority of the work. But if you were processing 1,000 lines of a log file or similar then you'd see a big advantage to passing the log file in one go and not line-by-line from KM.

As of macOS 14.8.5, it's available on every Mac. I happen to know that release version because I just updated my wife's Mac to that release, and Swift was then accessible without the Xcode command line tools installed. (That's what you needed before: No dev account, no $99 a year, and no full Xcode; macOS would just prompt you to install the Xcode command line tools.)

Here's what happens on macOS 14.8.5 and newer when you type swift for the first time:

-rob.

2 Likes

Thanks

So now I have no reason for bypass Swift as the solution suggested for other persons.

@Ellenm

Here is the macro which ask about input file and top file and build all chunks in the same directory as input file with names taken from TOC file. It can be changed in way, that it will be asking only about input file and try to open doc.file in the same directory.

All logic is inside Swift script.

Extract PDF pages - Swift only Macro (v11.0.4)

Extract PDF pages - Swift only.kmmacros (5.6 KB)