Help Me Help My Missus - Two PDF Columns To Variables

My better half is a music publisher and has to add up hundreds of royalty statements from PDFs. There are two columns of interest:

Until now, she's been doing it by hand with a calculator, but I've made a macro that will add up one column at a time. It would make her life a bit easier if I could do them both at once. If I select and copy the highlighted columns above, I get this:

Expand for data...

59 0.0053 67 0.0014 88 0.0024

135 0.0017 294 0.0105 740 0.0421

12049 0.1021 383 0.0150 482 0.0190 370 0.0158 440 0.0184 383 0.0161 431 0.0181

If we see A as a Column A value and B as a Column B value, the above result is:

A B A B A B A B A B A B A B A B A B A B A B A B....

How can I separate these to get every A to lines in one variable and every B to lines in another?

Edit:

I've come up with something by literally just alternating which variable a space-delimited value goes to. I suppose my question is, is this the best way to do this or is there are smarter method?

Add Two PDF Columns.kmmacros (39 KB)

Macro screenshot

There are lots of apps/services that convert PDF tables to other formats. I use Tabula to extract data and put it into CSV files.

2 Likes

I would personally do it with a chunksOf function in a Keyboard Maestro Execute JavaScript for Automation action:

Pair of JSON arrays from alternating values.kmmacros (5.6 KB)


SAMPLE OUTPUT


Expand disclosure triangle to view JS Source
(() => {
    "use strict";

    // Rob Trew @2022

    const main = () =>
        unzip(
            chunksOf(2)(
                words(
                    Application("Keyboard Maestro Engine")
                    .getvariable("sampleRows")
                )
            )
        );


    // --------------------- GENERIC ---------------------

    // chunksOf :: Int -> [a] -> [[a]]
    const chunksOf = n => {
        // xs split into sublists of length n.
        // The last sublist will be short if n
        // does not evenly divide the length of xs .
        const go = xs => {
            const chunk = xs.slice(0, n);

            return Boolean(chunk.length) ? [
                chunk, ...go(xs.slice(n))
            ] : [];
        };

        return go;
    };


    // unzip :: [(a,b)] -> ([a],[b])
    const unzip = xys =>
        xys.reduce(
            ([a, b], [x, y]) => [
                [...a, x],
                [...b, y]
            ],
            [
                [],
                []
            ]
        );


    // words :: String -> [String]
    const words = s =>
        // List of space-delimited sub-strings.
        s.split(/\s+/u);


    // MAIN ...
    return JSON.stringify(
        main(),
        null, 2
    );
})();
3 Likes

@drdrang Thanks for the heads up about Tabula. I just had a go with it and it produced similar results to ABBYY Finereader; mostly accurate but with quite a few displaced columns, despite maintaining identical selections for every page. I'll see how easy it is to tidy up.

@ComplexPoint Thanks for the script idea. It required a bit of post-processing, so is there a benefit to this method over the KM-action one or just personal preference and familiarity?

I think that's probably right – the post-processing of course could also be done in JavaScript, but it really a question of what comes most quickly to hand.

1 Like

For example, I don't think you've described the output format that you need, but if it was two textual columns, then perhaps:

Pair of JSON Strings from alternating values.kmmacros (4.9 KB)


Expand disclosure triangle to view JS Source
(() => {
    "use strict";

    // Rob Trew @2022

    const main = () =>
        unzip(
            chunksOf(2)(
                words(
                    Application("Keyboard Maestro Engine")
                    .getvariable("sampleRows")
                )
            )
        )
		.map(unlines);


    // --------------------- GENERIC ---------------------

    // chunksOf :: Int -> [a] -> [[a]]
    const chunksOf = n => {
        // xs split into sublists of length n.
        // The last sublist will be short if n
        // does not evenly divide the length of xs .
        const go = xs => {
            const chunk = xs.slice(0, n);

            return Boolean(chunk.length) ? [
                chunk, ...go(xs.slice(n))
            ] : [];
        };

        return go;
    };


    // unlines :: [String] -> String
    const unlines = xs =>
        // A single string formed by the intercalation
        // of a list of strings with the newline character.
        xs.join("\n");


    // unzip :: [(a,b)] -> ([a],[b])
    const unzip = xys =>
        xys.reduce(
            ([a, b], [x, y]) => [
                [...a, x],
                [...b, y]
            ],
            [
                [],
                []
            ]
        );


    // words :: String -> [String]
    const words = s =>
        // List of space-delimited sub-strings.
        s.split(/\s+/u);


    // MAIN ...
    return JSON.stringify(main());
})();

1 Like

A reliable PDF to CSV conversion would be the ideal, but there always seem to be issues, and it might end up taking longer to clean up in Excel than it does to use KM like this.

The idea is to select a number of rows in those two columns, hit a hotkey and the sum of each will pop up on-screen for her to note down. My version does it and so does yours now that I've added the adding-up bit.

Here's mine:
Add Two PDF Columns.kmmacros (39 KB)

Macro screenshot

Here's yours:
Pair of JSON Strings from alternating values.kmmacros (27 KB)

Macro screenshot

1 Like

Got it.

CSV would be a simpler target than separate columns here, and you perhaps could get your sums (or any other reporting) from sqlite3.

Sums of two columns from alternating values copy.kmmacros (4.2 KB)



1 Like

It's all voodoo to me. Wish I understood it! The only reason I think I'll stick to my version is that she prefers Display Text Large so she doesn't have to manually close the window, and it looks a bit cleaner without the quotation marks. Well, that and the fact that if I ever have to adapt it, I'll know what I'm looking at. I'm sure other people who understand what you're doing will benefit from this though.

1 Like

I don't know if it's smarter, but you could keep count of the line number in your "For Each" then riff on the fact that odd-numbered lines add to LocalSum1 and even-numbered lines to LocalSum2.

This also makes it easily extendable for more columns, just initialise the appropriate extra LocalSum<n> variables, set the "Switch" calculation to ...MOD <n>, and add the extra cases. Simply replace your Set Variable "Local__Column" and For Each Item... with:

1 Like

I like the sound of that. This MOD thing is new to me. Is it a way of identifying even/odd numbers? How does the line count end up being 0?

MOD is modulo division -- what we called "the remainder" at Primary school :wink: So...

  • 1 MOD 2 is "remainder of 1 divided by 2", or 1
  • 2 MOD 2 is "remainder of 2 divided by 2", or 0
  • 3 MOD 2 is "remainder of 3 divided by 2", or 1
  • etc

...and you can see that odd numbers are always 1 and evens are 0.

It's the counterpart of DIV, or integer division -- a DIV b is the number of times b goes into a, ignoring the remainder.

These are useful in things like time calcs, like "what's 320 seconds in minutes and seconds?":

  • mins = 320 DIV 60
  • secs = 320 MOD 60

I was always crap at maths but actually understood that because it was so well explained!

So how would you determine a third or fourth column?

So if you want to split your return-delimited list into 3 columns:

  • Line 1 goes into 1 MOD 3 = 1, so the first column
  • Line 2 goes into 2 MOD 3 = 2, so the second column
  • Line 3 goes into 3 MOD 3 = 0, when we see a 0 we make that a 3, so the third column
  • Line 4 goes into 4 MOD 3 = 1, so the first column
  • Line 5 goes into 5 MOD 3 = 2, so the second column
  • etc

For 4 columns:

  • Line 1 goes into 1 MOD 4 = 1, so the first column
  • Line 2 goes into 2 MOD 4 = 2, so the second column
  • Line 3 goes into 3 MOD 4 = 3, so the third column
  • Line 4 goes into 4 MOD 4 = 0, when we see a 0 we make that a 4, so the fourth column
  • Line 5 goes into 5 MOD 4 = 1, so the first column
  • Line 6 goes into 6 MOD 4 = 2, so the second column
  • etc

...and you should be able to see a pattern emerging:

For number of columns required, n

if (lineNumber MOD n) = 0 then
    do something referring to column n
else
    do something referring to column (lineNumber MOD n)
endif

The problem with that in KM is that you're now looking at "variable variables" -- you want to add the current number to the variable "Local_column<number_determined_by_a_calculation>". You could probably do something with variable indirection, but if you are going to grab that hot potato and its associated obfuscation (and maintenance burden!) you may as well use @ComplexPoint's excellent routines instead.

The downside of using a KM "Switch/Case" action to do this is that you have to edit the macro when the number of columns changes -- the upside is that it is very obvious what's going on and so is very easy to edit... Make sure you have "Local_colNumX" variables for each of columns 1 to n, change the Switch's ...MOD n calculation so n matches your number of columns, and make sure you have a Case for every result of 0 through n-1.

So to go from 2 columns to 3 you'd change the previously posted to

Summary

...and from 2 columns to 4 it would be

Summary

1 Like

Did I mention I was crap at maths?

I'm going to have to have another coffee and re-read all that. :exploding_head:

Once the coffee's taken effect, here's your bonus section -- arrays and loops!

This kind of "working through a list" is perfect for loops -- you use a "For Each" action in your macro. But think of how you'd wrangle the list of numbers by hand -- left index finger on the first line, keep it there to hold your place, put your right index finger there and look at the number, move right finger down one line, look at the number... Once you've moved your right finger down to account for all the columns you jump your left finger down to that point as a place holder and start all over again at that point with your right finger.

Easier to do than explain -- print the numbers out and give it a go :wink:

So what you got is a loop within a loop. Better yet, that "inner" loop going "1, 2, 1, 2,..." matches your column numbers and you can use that to decide which column to add to. Array elements can be referenced by index, so we can use an array and our inner loop numbers to do the totalling.

Exactly how you do that depends on the language, but here's an AppleScript example because AppleScript is "readable" and indexes its lists (a poor man's array) from 1 rather than JavaScript's 0 which makes things easier to understand.

I've used your numbers but repeated the last 4 rows to give 30 lines -- the 2-column totals won't match but it means you can change numOfCols to 2, 3, 5, 6, or 10 to see how versatile such a simple approach can be.

Pop the following into a new Script Editor document, press the "Run" button, see the list of totals (one number per column) in the Results window. Change numOfCols and see what happens... (Note: There's way more "here's a list" and comments than there is actual code!)

Summary
-- Our return-delimited list of numbers
-- Note that empty lines will be treated as an
-- entry with the value of '0'
set theList to "59
0.0053
67
0.0014
88
0.0024
135
0.0017
294
0.0105
740
0.0421
12049
0.1021
383
0.0150
482
0.0190
370
0.0158
440
0.0184
383
0.0161
431
0.0181
383
0.0161
431
0.0181"

-- Set the number of columns you want here
set numOfCols to 2

-- This gets our "end of list" condition for the outer loop"
set numOfLines to count of paragraphs of theList
-- Initialise the "Totals" list to the right length, every value as '0'
set totalsList to {}
repeat numOfCols times
	set end of totalsList to 0
end repeat

-- Set our outer loop start to the first line of the text
set i to 1
-- and do the actual work
repeat while i < numOfLines
	-- Each time we restart the inner loop we set its counter to 0
	-- Using 0 means we have to add 1 to access the correct element of
	-- totalsList, but it can be used unmodified as the line counter 'offset'
	-- when we access the text
	set j to 0
	repeat while j < numOfCols
		set item (j + 1) of totalsList to (item (j + 1) of totalsList) + (paragraph (i + j) of theList)
		set j to j + 1
	end repeat
	-- after the inner loop finishes, jump down the text list a number lines
	-- equal to our number of columns
	set i to i + numOfCols
end repeat
1 Like

What's stronger than coffee? :exploding_head::exploding_head::exploding_head:

1 Like

I love it when you realise you've been overthinking something. The left column is always a whole number and the right column is always to four decimal places. So all I have to do to separate them in a For Each loop is to determine whether or not the line contains ".".

This is doubly helpful because some pdf pages only let you select one column at a time unless you hold ⌥. If you do that, then the data comes back as:

AAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBB

...meaning you need a separate calculation method.

By using the decimal point as a differentiator, I've made it fool-proof.

Why didn't I think of that before?!

1 Like