Convert organized Listed Data from Pages to excel

forums2012 · July 15, 2018, 2:53am

I have got great support to my questions on the forums. Loving Keyboard Maestro more and more everyday.
I don't know how much is too much to ask, so.

I have a Pages document that has a list of Question answers in the format:

Question?
[indent] * Answer (ends with unique identifier #1234)

Trying to create a macro to automatically convert this into an Excel file with 3 columns:
Question | Answer | unique ID

Would really be on the top of the world if you guys can do this as well!

Thanks a lot.

JMichaelTX · July 15, 2018, 3:29am

If you want someone to deliver a complete macro to you, then, yes, this is quite a task, unless they happen to already have a similar solution.

Both Pages and Excel are very scriptable, and I think an AppleScript solution is probably your best bet. While I have a lot of experience with Excel, I have almost none with Pages.

So, you might do some 'net searches on "AppleScript Pages" to see what you can find. I did find one web site that you might find helpful:
iWork Automation -- AppleScript and Pages

Here are some instructions and examples of getting text from a Pages document:
AppleScript and Pages: Working with Text

Of course, it highly depends on how you have organized your data in Pages. Is each question a line, paragraph, or a row in a table?

If you, or someone, can write an AppleScript that returns the list of questions in an AppleScript list of lists (each item would be 3 items: question, answer, ID), then I'll take a look at saving that to Excel.

Of course, if you anticipate you are going to have ongoing similar needs that require scripting, then this might be the perfect opportunity to learn AppleScript. KM and AppleScript work together very well. If you are interested, then I'd suggest you take a look at:

AppleScript: Beginner's Tutorial
Script Debugger 7.
SD7 now comes with a 20-day free trial, and then will gracefully downgrade to the FREE SD7 Lite (which is still much better than Script Editor).

You can get lots of AppleScript help here in this forum, and in the Script Debugger Forum.

Questions?

forums2012 · July 15, 2018, 3:41am

Thanks a lot for the detailed reply.
I think the solution to this goes beyond my knowledge of programming.

Had got some great solutions so thought I'll try my luck.

To answer your question, the text is pretty well organized: Eg.

Which is the best automation tool for mac?

Keyboard Maestro #0101

Who has been amazing person to help me with my KM queries?

@JMichaelTX #0007

These even have Text styles applied to them. All Questions are "Question" Text Style and answers "Answer"
No exceptions, well-organized.

Thanks a lot for your time.
Will try looking into it (but don't think this is my cup of tea)

I mean it seems like
I'll have to run a foreach loop for every ordered list item then select all text with style "Question" set it to variable Question then similarly for answer.
Then break down answer variable into answer and #tag
Then append these 3 variables to a row in xls document

But the problem is I don't know AppleScript at all to figure this out.

Anyway Thanks a lot for the help!

forums2012 · July 15, 2018, 3:52am

Oh I just realized I had tried this.
When I export to an html file (using TextEdit)
The organization seemed pretty organized to me:

Which is the best automation tool for mac?

Keyboard Maestro #0101

Who has been amazing person to help me with my KM queries?

@JMichaelTX #0007

Source Code:

<ol class="ol1">
  <li class="li3"><span class="s8"></span><span class="s3">Which is the best automation tool for mac?</span></li>
</ol>
<ol class="ol1">
  <ul class="ul1">
    <li class="li3"><span class="s3">Keyboard Maestro </span><span class="s6"> #0101</span></li>
  </ul>
</ol>
<ol class="ol1">
  <li class="li3"><span class="s8"></span><span class="s3">Who has been amazing person to help me with my KM queries?</span></li>
</ol>
<ol class="ol1">
  <ul class="ul1">
    <li class="li3"><span class="s3">@JMichaelTX </span><span class="s6"> #0007</span></li>
  </ul>
</ol>

Can this be useful? I know there are php scripts to parse this. But now I like using KM, so worth a try.

CJK · July 15, 2018, 9:36am

Hi @forums2012

Can you upload the Pages document (or a portion of it) ? It should be a fairly trivial task with AppleScript, but Pages does have a couple of bugs that can stop perfectly good scripts from working, so this can only be fruitful if I have genuine source data to go by.

The HTML could be used, but I would probably only resort to that if one of the aforementioned bugs makes the source material unviable for scripting, as otherwise, the scriptability of Pages makes processing text data very simple.

forums2012 · July 15, 2018, 9:07pm

Sample Pages document.zip (109.7 KB)
I have uploaded the zip file containing the pages document.
Thanks a lot @CJK for the interest. Really Appreciate the help!

CJK · July 16, 2018, 4:28am

It ended up being slightly less trivial than I expected it to be, but manageable nonetheless.

Although you may not be familiar with AppleScript personally, I do suggest you read through this script and get a sense of what it's doing at each stage. The nice thing about AppleScript is that it reads fairly naturally, almost like a set of spoken instructions that you're giving to the computer. Therefore, even if you've not much experience with scripting in general, you will still gleen what is going on in a simple AppleScript.

tell application "Pages"
	tell the body text of the document named "Sample Pages document" to ¬
		set _QA to a reference to (every paragraph ¬
			where class of words contains placeholder text and ¬
			the font is "Helvetica")
	
	set D to make new document
	
	set my text item delimiters to linefeed
	set D's body text to _QA as text
	
	tell D's body text
		set every character where it is "#" to linefeed
		
		delete the last character of (every paragraph ¬
			where it ends with " ")
		
		set _P to a reference to every paragraph
		
		set |rows| to {}
		set my text item delimiters to "|" --> or tab
		
		repeat with i from 1 to count _P by 3
			set end of |rows| to contents of {¬
				item i, ¬
				item (i + 1), ¬
				item (i + 2)} of _P as text
		end repeat
	end tell
	
	set my text item delimiters to linefeed
	set D's body text to |rows| as text
	
	export D as unformatted text to POSIX file "/Users/%you%/Desktop/Q&A.txt"
	close D without saving
end tell

Please note that this script only generates a text file with CSV-formatted plain text, where each line represents one row of question and answer data, delimited into three fields by a |, i.e. {question}|{answer}|{id}. As I don't own or use Excel, you'll have to ask someone if they can script the next part that utilises the text file and imports the data into Excel (although, you may find that Excel can import CSV-type text itself without the need for any scripting).

Two things in the script you'll need to change, and one you may wish to change:

On line 2, I reference the source document by its name. You'll need to change this to the actual name of your document, which should be open and waiting before this script runs.
On one of the lines at the end, the path to the output text file will need changing to wherever you want your CSV text file to be located. %you% is to be replaced with the name of your user home directory.
Optionally, there's a line where I've added the comment "--> or tab". If you (or Excel) would prefer to use a tabstop as the field delimiter, simply replace "|" in that line with tab (no quotes), so it reads set my text item delimiters to tab.

And, finally, a couple of notes: the script used the source material you provided and assumed it reflects accurately the nature of your actual document. Therefore, it has assumed that the questions and answers are in font "Helvetica", whilst the subtopic headings are not; and it also assumes that a hash sign ("#") only occurs as part of an answer's unique id, and won't occur in the text content of a question or answer itself.

JMichaelTX · July 16, 2018, 6:03am

OK, turned out to be easier than I thought.

Given This

Pages Source Document

Excel Receiving Document

Then this script will transfer the Q&A sets from Pages to Excel.
As always, feel free to post any questions, comments, issues, and/or suggestions that you might have concerning this macro.

Running:
Pages 7.1 (5683)
Microsoft Excel 14.7.2 (Office 2011)
macOS 10.12.6

Example Results

NOTE: The dup Q&A is in the source document. Note the changed Ref# which I changed in source document.

AppleScript to Transfer

property ptyScriptName : "Transfer Q&A from Pages to Excel"
property ptyScriptVer : "1.0"
property ptyScriptDate : "2018-07-16"
property ptyScriptAuthor : "JMichaelTX"

use AppleScript version "2.5" -- El Capitan (10.11) or later
use framework "Foundation"
use scripting additions

property LF : linefeed

--- Get Body Text of Current Page of Front Document of Pages ---

tell application "Pages"
  set oDoc to document 1 --  its document "Sample Pages document.pages"
  tell oDoc
    tell current page
      set bodyStr to body text as text
    end tell
  end tell
end tell

--- Extract Question, Answer, Ref# from Each Q&A Set ---

set reFind to "(?m)^(.+\\?)\\R(.+) #(\\d+)"
set quesList to my regexFind(reFind, bodyStr, true)

--- Set Excel Rows with Columns of Question, Answer, Ref# ---
### MUST Open Excel Workbook & Worksheet, & select Cell in First Col of Header Row ###

tell application "Microsoft Excel"
  activate
  set oCell to active cell -- MUST be in First Col of Header Row
  
  repeat with iQ from 1 to (count quesList)
    set oQues to item iQ of quesList
    set value of (get offset oCell row offset iQ) to item 2 of oQues
    set value of (get offset oCell row offset iQ column offset 1) to item 3 of oQues
    set value of (get offset oCell row offset iQ column offset 2) to item 4 of oQues
    
  end repeat
  
end tell
--~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
on regexFind(pFindRegEx, pSourceString, pGlobalBool) -- @RegEx @Find @Search @Strings @ASObjC @Shane
  --–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
  (*  VER: 1.1    2017-11-22
    PURPOSE:  Find Match(s) & Return Match with All Capture Groups as List of Lists
    METHOD:    Uses ASObjC RegEx (which is based on ICU Regex)
    PARAMETERS:
      • pFindRegEx    | text |  RegEx pattern to search for
      • pSourceString | text |  Source String to be searched
      • pGlobalBool   | bool |  Set true for Global search (all matches)

    RETURNS: IF pGlobalBool:  List of Lists, one list per match
                ELSE:  Single List of first match
                Each Match List is a List of Full Match + One Item per Capture Group
                  {<Full Match>, <CG1>, <CG2>, <CG3>, ...}
                  IF CG not found, Item is returned as empty string
                If NO matches, return empty list {}

    AUTHOR:  JMichaelTX
    ## REQUIRES:  use framework "Foundation"
    REF:  
      1. 2017-11-22, ShaneStanley, Does SD6 Find RegEx Support Case Change?
           • Late Night Software Ltd., 
          • http://forum.latenightsw.com//t/does-sd6-find-regex-support-case-change/816/8
      2.  ICU RegEx Users Guide
          http://userguide.icu-project.org/strings/regexp
    --–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
  *)
  
  local theFinds, theResult, subResult, groupCount
  
  try
    
    set pSourceString to current application's NSString's stringWithString:pSourceString
    set {theRegEx, theError} to current application's NSRegularExpression's regularExpressionWithPattern:pFindRegEx options:0 |error|:(reference)
    if theRegEx is missing value then error ("Invalid RegEx Pattern." & LF & theError's localizedDescription() as text)
    
    if (pGlobalBool) then ### FIND ALL MATCHES ###
      set theFinds to theRegEx's matchesInString:pSourceString options:0 range:{0, pSourceString's |length|()}
      
    else ### FIND FRIST MATCH ###
      set theFind to theRegEx's firstMatchInString:pSourceString options:0 range:{0, pSourceString's |length|()}
      set theFinds to current application's NSMutableArray's array()
      (theFinds's addObject:theFind)
    end if
    
    set theResult to current application's NSMutableArray's array()
    
    repeat with aFind in theFinds
      set subResult to current application's NSMutableArray's array()
      set groupCount to aFind's numberOfRanges()
      
      repeat with i from 0 to (groupCount - 1)
        
        set theRange to (aFind's rangeAtIndex:i)
        if |length| of theRange = 0 then
          --- Optional Capture Group was NOT Matched ---
          (subResult's addObject:"")
        else
          --- Capture Group was Matched ---
          (subResult's addObject:(pSourceString's substringWithRange:theRange))
        end if
      end repeat
      
      (theResult's addObject:subResult)
      
    end repeat -- theFinds
    
  on error errMsg number errNum
    set errMsg to "ASObjC RegEx ERROR #" & errNum & LF & errMsg
    set the clipboard to errMsg
    display dialog errMsg & LF & ¬
      "(error msg is on Clipboard)" with title (name of me) with icon stop
    error errMsg
    
  end try
  
  return theResult as list
end regexFind
--~~~~~~~~~~~~~~~~~~~~ END of Handler ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ccstone · July 16, 2018, 6:09am

Hey @forums2012,

Apple removed all scripting from Pages when they rewrote it all in Cocoa, and they still haven't restored all the functionality it used to have.

I personally find it to be more trouble than it's worth and almost never use it.

That said – your task is relatively easy IF you simply massage your data for pasting into Excel.

Massage Text of Front Document for Pasting into Microsoft Excel -- Place on the Clipboard.kmmacros (6.3 KB)

I've used AppleScript to extract the text and Keyboard Maestro's Find/Replace (from using the Keyboard Maestro Engine) to massage that data into a form Excel will accept for paste (on my system with Office 2011), although Paste Special might be needed with Office 2016.

-Chris

ComplexPoint · July 16, 2018, 11:01am

Rich harvest of good solutions !

Just in case this question (or the tags attached to it) provide a useful reference for later searchers, here is one more approach:

also starting with the bodyText, but
leaving Regexes aside, and
writing in JS. (tho AS would work just as well)

Taking only the body text lines which are questions or answers,

filter(
    x => x.includes('?') || x.includes('#'),
    lines(strText)
)

(see filter, lines)

partitioned, on the basis of a test, into a pair (or 'tuple') of lists:

tpl = partition(
    x => x.includes('?'),
    filter(
        x => x.includes('?') || x.includes('#'),
        lines(strText)
    )
),

(see partition)

In other words partitioning the filtered lines into:

a list of questions, and
a list of answers.

qs = fst(tpl),
as = snd(tpl)

(see fst, snd)

Having these two lists allows for a check that the count of questions and answers matches, and for
returning either a usable Excel translation in a Right channel,
or a helpful message to the user in a Left channel.

(Values which can have either a Left or Right type of content turn up, for example, as

Optional values in Swift, or
Either values in Haskell)

return qs.length === as.length ? Right(
    
    // Translation to an Excel-readable format ...

) : Left(
    "Mismatch in counts of '?' and '#' on this page."
);

(see Left, Right, and, for further down, bindLR)

From our two matching sets of questions and answers,
we can obtain a single 'zipped' list of matching Q&A pairs:

zip(qs, as)

(see zip)

and mapping some kind of string-constructing function over that list of pairs,
gives us a list of (for example, tab-delimited) lines.

map(
    pair => `${fst(pair)}\t${
        snd(pair).split(' #').join('\t')
    }`,
    zip(qs, as)
)

(see map)

A concatenation of the list of lines (as a single string) with a prepended header,

'Question\tAnswer\tUnique ID\n' +
unlines(
    map(
        pair => `${fst(pair)}\t${
            snd(pair).split(' #').join('\t')
        }`,
        zip(qs, as)
    )
)

(see unlines)

produces a usable result for the Right channel,

Right(
    'Question\tAnswer\tUnique ID\n' +
    unlines(
        map(
            pair => `${fst(pair)}\t${
                snd(pair).split(' #').join('\t')
            }`,
            zip(qs, as)
        )
    )
)

and all of this can be wrapped in a context which returns either that usable Right value or a Left channel message that no document is open in Pages,

const
    ds = Application('Pages').documents,
    lrTabbed = bindLR(
        0 < ds.length ? (
            Right(ds.at(0).bodyText())
        ) : Left('No document open in Pages'),
        strText => {
            const
                tpl = partition(
                    x => x.includes('?'),
                    filter(
                        x => x.includes('?') || x.includes('#'),
                        lines(strText)
                    )
                ),
                qs = fst(tpl),
                as = snd(tpl);

            return qs.length === as.length ? Right(
                'Question\tAnswer\tUnique ID\n' +
                unlines(
                    map(
                        pair => `${fst(pair)}\t${
                            snd(pair).split(' #').join('\t')
                        }`,
                        zip(qs, as)
                    )
                )
            ) : Left(
                "Mismatch in counts of '?' and '#' on this page."
            );
        }
    )

bindLR takes two arguments:

An Either value (i.e. either a Left(something) or a Right(something))
a function.

If the either value is a Left, bindLR just passes it straight on, unchanged.
If the either value is a Right, then bindLR:

extracts its contents from the Right wrapper,
applies the function to it,
returns the result, re-wrapped as Right(something).

As a result, nested uses of bindLR:

Pass anything in the Left channel straight through to the final result, or
continue a pipeline of function applications through the Right channel.

( This pipelining of optional values turns out to be simplifying and very useful, and is becoming central to code-structuring in newer languages like Swift and Rust )

Given either a message or some tab-delimited text, bindLR can add a further stage to the pipe-line, either passing on a message, or passing tabbed text into the clipboard:

(Full script for a KM Execute a Javascript for Automation action)

(() => {
    'use strict';

    const main = () => {
        const
            ds = Application('Pages').documents,
            lrTabbed = bindLR(
                0 < ds.length ? (
                    Right(ds.at(0).bodyText())
                ) : Left('No document open in Pages'),
                strText => {
                    const
                        tpl = partition(
                            x => x.includes('?'),
                            filter(
                                x => x.includes('?') || x.includes('#'),
                                lines(strText)
                            )
                        ),
                        qs = fst(tpl),
                        as = snd(tpl);

                    return qs.length === as.length ? Right(
                        'Question\tAnswer\tUnique ID\n' +
                        unlines(
                            map(pair => `${fst(pair)}\t${
                                    snd(pair).split(' #').join('\t')
                                }`,
                                zip(qs, as)
                            )
                        )
                    ) : Left(
                        "Mismatch in counts of '?' and '#' on this page."
                    );
                }
            ),
            lrResult = bindLR(
                lrTabbed,
                strTabbed => {
                    // standardAdditions :: () -> Application
                    const standardAdditions = () =>
                        Object.assign(Application.currentApplication(), {
                            includeStandardAdditions: true
                        });

                    return (
                        console.log(strTabbed),
                        standardAdditions().setTheClipboardTo(strTabbed),
                        Right('Tabbed text copied to clipboard from Pages Q&A')
                    );
                }
            );

        return lrResult.Left || lrResult.Right;
    };

    // GENERIC FUNCTIONS --------------------------------------

    // https://github.com/RobTrew/prelude-jxa

    // Left :: a -> Either a b
    const Left = x => ({
        type: 'Either',
        Left: x
    });

    // Right :: b -> Either a b
    const Right = x => ({
        type: 'Either',
        Right: x
    });

    // Tuple (,) :: a -> b -> (a, b)
    const Tuple = (a, b) => ({
        type: 'Tuple',
        '0': a,
        '1': b,
        length: 2
    });

    // bindLR (>>=) :: Either a -> (a -> Either b) -> Either b
    const bindLR = (m, mf) =>
        m.Right !== undefined ? (
            mf(m.Right)
        ) : m;

    // filter :: (a -> Bool) -> [a] -> [a]
    const filter = (f, xs) => xs.filter(f);

    // fst :: (a, b) -> a
    const fst = tpl => tpl[0];

    // lines :: String -> [String]
    const lines = s => s.split(/[\r\n]/);

    // map :: (a -> b) -> [a] -> [b]
    const map = (f, xs) => xs.map(f);

    // partition :: Predicate -> List -> (Matches, nonMatches)

    // partition :: (a -> Bool) -> [a] -> ([a], [a])
    const partition = (p, xs) =>
        xs.reduce(
            (a, x) =>
            p(x) ? (
                Tuple(a[0].concat(x), a[1])
            ) : Tuple(a[0], a[1].concat(x)),
            Tuple([], [])
        );

    // snd :: (a, b) -> b
    const snd = tpl => tpl[1];

    // unlines :: [String] -> String
    const unlines = xs => xs.join('\n');

    // zip :: [a] -> [b] -> [(a, b)]
    const zip = (xs, ys) =>
        xs.slice(0, Math.min(xs.length, ys.length))
        .map((x, i) => Tuple(x, ys[i]));

    // MAIN
    return main();
})();

ComplexPoint · July 16, 2018, 11:37am

PS a more efficient (perhaps also more pleasing ?) way to write:

map(
    pair => `${fst(pair)}\t${
        snd(pair).split(' #').join('\t')
    }`,
    zip(qs, as)
)

is as:

zipWith(
    (q, a) =>
        `${q}\t${a.split(' #').join('\t')}`,
    qs,
    as
)

which combines map and zip into zipWith