RegEx Help To Condense a List of Data Records

I have a file that contains variations of the data below, the formatting is consistent throughout the file;

Last name
First Name
Middle Initial (if available)
Date
Chart
Status

I am looking to ignore the first line and then add the first and last name only, (not middle initial of those that have it (for example "B" is a middle initial of "Stie, Urry B")) of each of the people to a variable. I need to know who is in the list, not how many times they appear in the list

So the new variable using the data below would contain this;

Account, Paug
Ggus, Idy
Fere, Psdsdsd
Garn, Psdsdsd
Kee, Jee
Merian, Psdsdsd
Potr, Mjmy
Pes, Ger
Stie, Urry
Tin, En
lls, eve

I have been trying to do this with RegEx using other posts but am not quite able to figure it out or maybe a better / different way to get the information.

This is where I am so far, so close but yet so far...

(?sm)^([^\t]+?) \t.+?(?:\n(?!\1)|\Z)

Your help is always appreciated!

Sample Data

Appointment Provider Name	Appointment Date	Patient Acct No	Chart Lock Status
Account, Paug 	2022-02-14	E2324942	Unlocked
Ggus, Idy L	2022-02-07	E2633609	Unlocked
Fere, Psdsdsd 	2022-02-01	E1434073	Unlocked
Garn, Psdsdsd 	2022-01-15	E2427959	Unlocked
Garn, Psdsdsd 	2021-12-29	E2530721	Unlocked
Garn, Psdsdsd 	2021-12-29	E2565698	Unlocked
Garn, Psdsdsd 	2021-12-29	E2565727	Unlocked
Garn, Psdsdsd 	2021-12-29	E2566921	Unlocked
Kee, Jee A	2022-02-02	E2394662	Unlocked
Kee, Jee A	2022-02-02	E2517265	Unlocked
Kee, Jee A	2022-02-02	E2529267	Unlocked
Kee, Jee A	2022-02-02	E2548321	Unlocked
Kee, Jee A	2022-02-02	E2628612	Unlocked
Kee, Jee A	2022-02-02	E2628640	Unlocked
Kee, Jee A	2022-02-02	E2628646	Unlocked
Kee, Jee A	2022-02-02	E2629113	Unlocked
Kee, Jee A	2022-02-02	E901997	Unlocked
Merian, Psdsdsd 	2022-02-10	E2635549	Unlocked
Meriian, Psdsdsd 	2022-02-10	E2635559	Unlocked
Potr, Mjmy 	2021-04-08	E2183611	Unlocked
Pes, Ger 	2022-02-14	E2638587	Unlocked
Stie, Urry B	2020-07-06	E975501	Unlocked
Stie, Urry B	2020-09-28	E1329262	Unlocked
Stie, Urry B	2022-02-14	E2639022	Unlocked
Stie, Urry B	2022-02-14	E2639025	Unlocked
Stie, Urry B	2022-02-14	E2639041	Unlocked
Stie, Urry B	2022-02-14	E883741	Unlocked
Tin, En L	2022-01-19	E1051821	Unlocked
Tin, En L	2022-01-19	E1319091	Unlocked
Tin, En L	2022-01-19	E1319095	Unlocked
Tin, En L	2022-02-02	E2629037	Unlocked
Tin, Enn L	2022-02-02	E2629098	Unlocked
lls, eve 	2022-02-03	E1323245	Unlocked
lls, eve 	2022-02-03	E2232045	Unlocked

You're doing four things here:

  • Eliminating the first line.
  • Extracting the first field (everything up to the first tab character).
  • Deleting the middle initial, if it exists.
  • Deleting duplicates.

Rather than trying to all of these in a single action, try doing them one at a time. To me, the natural approach is a shell pipeline of four separate commands:

sed '1d' | cut -f1 | sed 's/ [^ ]$//' | uniq

where the individual steps are separated by pipe (|) characters, and the output of each command is passed on as the input to the next. Here's a macro that does it:

(I've tried to upload the macro itself several times, but I keep getting the error "Sorry, but the file you provided is empty." The file is definitely not empty, so there seems to be a temporary bug somewhere.)

The first sed command deletes Line 1. The cut command extracts the first field of each line. The second sed searches for lines that end with a space followed by a nonspace and replaces those two characters with an empty string. The final command eliminates duplicated lines.

The uniq command works because your input seems to be already sorted. If it weren't, I'd have used

sed '1d' | cut -f1 | sed 's/ [^ ]$//' | sort -u

which sorts the lines first to put duplicates next to one another and then eliminates them. It's probably safer to use this even if you don't expect the input to ever be unsorted.

I'm sure there are native Keyboard Maestro ways to do each of these steps, and you may prefer that solution, but this problem really fits well with Unix command line tools.

3 Likes

Thank you very much for the explanation and the help, works perfectly!

I like @drdrang's approach to this problem, however when I use sed I try (if I can) to consolidate its commands.

Like so:

Condense a List of Data Records v1.00.kmmacros (7.4 KB)

Macro-Image

Keyboard Maestro Export

Getting the job done with the minimum amount of fuss is the most important thing, but the aesthetics of a single command call appeal to me.  :sunglasses:

Performing this task with only Keyboard Maestro native actions gets a little more complicated, particularly since KM doesn't have a unique filter. But you can still get the job done.

Condense a List of Data Records (KM-Native-Only) v1.00.kmmacros (12 KB)

Macro-Image

-Chris

Awesome, thank you so much!

A footnote for later visitors:

Keyboard Maestro's Execute a JavaScript for Automation action also serves well here, and may provide more flexibility.

Data records condensed.kmmacros (3.8 KB)

Expand disclosure triangle to view JS Source
(() => {
    "use strict";

    const kme = Application("Keyboard Maestro Engine");

    return [
            ...new Set(
                kme.getvariable("dataRecords")
                .split(/[\r\n]+/u)
                .slice(1)
                .map(
                    x => x.split(/\t/u)[0]
                    .trim()
                )
            )
        ]
        .sort()
        .join("\n");
})();

FWIW, a Haskell solution to this problem using Hutton's parser combinators library. Only first name and surname is retained (middle name is discarded).

Condensed data records.kmmacros (8.0 KB)

Expand disclosure triangle to view Haskell Source
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE TupleSections #-}

module Parsing
  ( module Parsing,
    module Control.Applicative,
  )
where

import Control.Applicative
import Control.Monad (replicateM, void)
import Data.Bifunctor
import Data.Char
import Data.List
import Data.Map.Strict (fromList)
import Data.String

between open close p = open >> p >>= \v -> close >> pure v

-- Based on functional parsing library from chapter 13
-- of Programming in Haskell,
-- Graham Hutton, Cambridge University Press, 2016.

newtype Parser a
  = P (String -> [(a, String)])

parse :: Parser a -> String -> [(a, String)]
parse (P p) = p

item :: Parser Char
item =
  P
    ( \case
        [] -> []
        (x : xs) -> [(x, xs)]
    )

instance Functor Parser where
  -- fmap :: (a -> b) -> Parser a -> Parser b
  fmap f p =
    P
      ( \inp ->
          case parse p inp of
            [] -> []
            [(v, out)] -> [(f v, out)]
      )

instance Applicative Parser where
  -- pure :: a -> Parser a
  pure v = P (\inp -> [(v, inp)])

  -- <*> :: Parser (a -> b) -> Parser a -> Parser b
  (<*>) pg px =
    P
      ( \inp ->
          case parse pg inp of
            [] -> []
            [(g, out)] -> parse (fmap g px) out
      )

instance Monad Parser where
  -- (>>=) :: Parser a -> (a -> Parser b) -> Parser b
  p >>= f =
    P
      ( \inp ->
          case parse p inp of
            [] -> []
            [(v, out)] -> parse (f v) out
      )

-- Making choices
instance Alternative Parser where
  -- empty :: Parser a
  empty = P (const [])

  -- (<|>) :: Parser a -> Parser a -> Parser a
  p <|> q =
    P
      ( \inp ->
          case parse p inp of
            [] -> parse q inp
            [(v, out)] -> [(v, out)]
      )

-- Derived primitives
-- sepBy1 p sep = liftM2 (:) p (many (sep >> p))

sepBy1 :: (Alternative f, Monad f) => f a1 -> f a2 -> f [a1]
sepBy1 p sep = (:) <$> p <*> many (sep >> p)

sepBy2 p = (:) <$> p <*> many p

commaSep p = sepBy1 p (string ",")

tagSep p = sepBy1 p (string " @")

count :: Int -> Parser a -> Parser [a]
count = replicateM

satisfy :: (Char -> Bool) -> Parser Char
satisfy p = item >>= go
  where
    go x
      | p x = pure x
      | otherwise = empty

digit :: Parser Char
digit = satisfy isDigit

digits :: Parser [Char]
digits = some digit

lower :: Parser Char
lower = satisfy isLower

upper :: Parser Char
upper = satisfy isUpper

letter :: Parser Char
letter = satisfy isAlpha

alphanum :: Parser Char
alphanum = satisfy isAlphaNum

char :: Char -> Parser Char
char x = satisfy (== x)

noneOf :: String -> Parser Char
noneOf cs = satisfy (`notElem` cs)

oneOf :: String -> Parser Char
oneOf cs = satisfy (`elem` cs)

string :: String -> Parser String
string [] = pure []
string (x : xs) = char x >> string xs >> pure (x : xs)

ident :: Parser String
ident = lower >>= \x -> many alphanum >>= \xs -> pure (x : xs)

constructor :: Parser String
constructor = upper >>= \x -> many alphanum >>= \xs -> pure (x : xs)

nat :: Parser Int
nat = read <$> some digit

int :: Parser Int
int = (char '-' >> nat >>= \n -> pure (-n)) <|> nat

-- Handling spacing
space :: Parser ()
space = void $ many (satisfy isSpace)

quoted :: Parser [Char]
quoted = (oneOf "'\"") >>= \x -> (many (noneOf "'\"")) >>= \xs -> (oneOf "'\"") >>= \y -> pure (x : xs <> [y])

singleQuoted :: Parser [Char]
singleQuoted = between (char '\'') (char '\'') (many (satisfy ('\'' /=))) >>= \xs -> pure ("'" <> xs <> "'")

doubleQuoted :: Parser [Char]
doubleQuoted = between (char '"') (char '"') (many (satisfy ('"' /=))) >>= \xs -> pure ("'" <> xs <> "'")

digitString :: Parser String
digitString = many (satisfy isDigit)

naturalNumber :: Parser Int
naturalNumber = read <$> digitString

token :: Parser a -> Parser a
token p = space >> p >>= \v -> space >> pure v

identifier :: Parser String
identifier = token ident

natural :: Parser Int
natural = token nat

integer :: Parser Int
integer = token int

symbol :: String -> Parser String
symbol xs = token (string xs)

-- First name followed by a comma and a surname.
p :: Parser [Char]
p =
  many letter
    >>= \n ->
      token (char ',')
        >> many letter >>= \s ->
          pure (n <> ", " <> s)

interact' :: (String -> String) -> IO ()
interact' f = do
  path <- getContents
  s <- readFile path
  putStr (f s)

main :: IO ()
main =
  interact' $
    unlines
      . concatMap (map fst . parse p . takeWhile (/= '\t'))
      . lines