RegEx White Space Question

Potts_Jeff · March 31, 2023, 1:18am

I'm trying to call TextSoap and run a series of regex sequences.

I'm stuck at my last regex. I have an ingredients list the I parse using regex replacement statements. My last regex would run on any item that begins with a character and is not a vulgar fraction or a digit. So per the example I only wish to affect the line starting with Black.

Black pepper, ground fresh from the mill
½	cup	dry Marsala wine
1½	lbs   veal scaloppine

Expected Result

My first step is using this regex:

(?<=^\d|¼|½|¾|⅓|⅔|⅕|⅖|⅗|⅘|⅙|⅚|⅛|⅜|⅝|⅞)\s

to replace a space with a tab.

I have used:

(?<=(\b(tsp|tbsp|oz|fl oz|cup|pt|qt|gal|lb|lbs|g|kg|L|mL|ds|pn|smdg|min|dr\.|Lg|Med|Sm|Petit|Square)\b))\s

to find an ingredient and then replace a space with a tab.

Since

(?<=^)

Captures the non-digit whitespace at the beginning of the line I've written my regex as:

(?<=\R\s*)[a-zA-Z]\s*\D\s*

I've tried:

^[a-zA-Z](?<=^)\s*

Any ideas on how to accomplish this?

Kirby_Krieger · March 31, 2023, 1:53am

Starts with a letter?

Possibly too shallow for your needs, but as I often see the irregular steps in the surface of the bark and forget the trees, offered as an easily-dismissed queffort.

Yes, the bees in the forest contract things. The dance is complex enuf. That's "quick effort'.

ccstone · March 31, 2023, 5:20am

I'm amazed this pattern works...

PCRE requires that lookbehinds be a regular width, but it appears that the ICU regex flavor allows irregular widths – as TextSoap, Keyboard Maestro, CotEditor, and Script Debugger 8 work with that syntax.

(BBEdit does not.)

Try this:

(?m-s)^[a-zA-Z].+

ComplexPoint · March 31, 2023, 6:08am

In the spirit of:

How Do I Get The Best Answer in the Shortest Time to Questions on the KM Forum? - Tips & Tutorials - Keyboard Maestro Discourse

you have very helpfully given us a sample of your input data,
but you haven't yet shown us a sample of your desired output.

What are you trying to obtain ?

ComplexPoint · March 31, 2023, 9:59pm

Given the screenshot you have added, which suggests (I think) 3 tabbed output columns, i.e. something like:

"\t\tBlack pepper, ground fresh from the mill\n½\tcup\tdry Marsala wine\n1½\tlbs\tveal scaloppine"

I would personally tend to use a script with a few very simple and generic pre-packaged regular expressions.

Perhaps something like:

Recipe listings in columns.kmmacros (3.4 KB)

Expand disclosure triangle to view JS source

(() => {
    "use strict";

    // As three tabbed columns: Qty, Weight, Ingredient

    // Rob Trew @2023
    // Ver 0.01

    // main :: IO ()
    const main = () =>
        lines(
            Application("Keyboard Maestro Engine")
            .getvariable("recipeListings")
        )
        .map(
            s => 0 < s.trim().length
                ? isAlpha(s[0])
                    ? `\t\t${s}`
                    : (() => {
                        const
                            ws = words(s),
                            quant = ws.slice(0, 2).join("\t"),
                            rest = ws.slice(2).join(" ");

                        return `${quant}\t${rest}`;
                    })()
                : ""
        )
        .join("\n");


    // --------------------- GENERIC ---------------------

    // isAlpha :: Char -> Bool
    const isAlpha = c =>
        (/[A-Za-z\u00C0-\u00FF]/u).test(c);


    // lines :: String -> [String]
    const lines = s =>
    // A list of strings derived from a single string
    // which is delimited by \n or by \r\n or \r.
        Boolean(s.length) ? (
            s.split(/\r\n|\n|\r/u)
        ) : [];


    // words :: String -> [String]
    const words = s =>
        // List of space-delimited sub-strings.
        // Leading and trailling space ignored.
        s.split(/\s+/u).filter(Boolean);

    // MAIN ---
    return main();
})();

UPDATE
Edited to disable the default "Trim Results" in the Execute JavaScript action

Thanks to @Nige_S for shedding light on a puzzling evaporation of leading tabs in the first line

( and thanks to @unlocked2412 for not giving up on the mystery !)

Nige_S · April 1, 2023, 2:05pm

Echoing others -- why can you not just prefix every line that begins with a letter with two tabs? In KM terms (and adding an extra "to be processed" line for good measure):

Recipe Regex.kmmacros (2.9 KB)

Image

If that does work then your first regex looks over-complicated too -- why not
(?m)^([^a-zA-Z])+\s+([^\s]+)\s replacing with \1\t\2\t ? Putting it all together:

Recipe Regex v2.kmmacros (3.5 KB)

Image

Potts_Jeff · April 1, 2023, 3:52pm

Thanks for the insights. Your Macro looks great. I will surely try to tuck in the rest of the data processes that need to be accomplished for a nice tight solution.

My Regex knowledge is based on lots of trial and error since its not in my wheelhouse.

Thanks Again

unlocked2412 · April 1, 2023, 7:17pm

A variant using Haskell:

Download Macro(s): Recipe listings.kmmacros (8.3 KB)

Expand disclosure triangle to see "haskell" source

module Main where

import Data.Bifunctor
import qualified Data.ByteString.Lazy as BL
import Data.Char
import Data.List

recipeListing :: String -> String
recipeListing [] = ""
recipeListing (x : xs)
  | isAlpha x = "\t\t" <> [x] <> xs
  | otherwise = concat [quantity, "\t", rest]
  where
    (quantity, rest) =
      bimap (intercalate "\t") unwords $
        splitAt 2 $
          words (x : xs)

main :: IO ()
main =
  interact
    ( unlines
        . map recipeListing
        . lines
    )

Macro-Image

Macro-Notes

Macros are always disabled when imported into the Keyboard Maestro Editor.
- The user must ensure the macro is enabled.
- The user must also ensure the macro's parent macro-group is enabled.

System Information

macOS 13.1
Keyboard Maestro v10.2

ccstone · April 2, 2023, 12:03am

RegEx101.com
BBEdit or CotEditor
- I prefer the former as even the freeware version is seriously powerful.
RegExRX

RegEx White Space Question

Options