RegEx Search and Replace Formatting Problem

@Nige_S

I took your advice to heart, kindly see Section 2 which has been newly added to this this version of the technical overview document.

If you think anything else is missing, please let me know.

Neither do I -- but the linked AppleScript should allow you to pump in a string and extract a date/time. As noted there I don't know enough to deal with multiple detected dates -- I'd assume you'd get an array, and dealing with that should only be a web search or two away.

As to the rest -- the point is that you can determine the attribution line. You can use the Date Data Detector to extract the date and you already have a method to extract the sender name and email.

Then I'm confused -- and not only because mail headers don't contain MIME data. extract_pass2_from_response appears to work on the body of the outgoing reply email, and that is where "unexpected" attribute lines will be found (the first attribution should be fine, because that's yours and will be whatever your localisation says it should be).

Again -- not a Python coder! But you should field test this, bouncing replies between systems set to different locales, to see how it copes. I would, but with no instruction about setting up an initial environment (as required for source "${KMVAR_Global_PythonScriptsPath/#\~/$HOME}/venv/bin/activate" ) I don't know if

is an error in my setup, an error in the script, or an error in the macro.

Excellent, now we are getting into an area I know a little about!

Agree to a certain extent in that:

  1. Attribution line can be determined but all it provides is the date, time and sender. The attribution line does not provide any of a) the recipient b) the recipient cc or c) the subject.

  2. At first I tried deriving the missing fields in that a) recipient = me b) recipient cc = blank / not important and c) subject was the same as the current e-mail.

  3. I tested the above for 10+ days and concluded that my derivation / fall back approach was flawed in a number of ay including a) the recipient was not always me, in a number of cases I was cc'ed on the e-mail and, in some, bcc'ed b) the recipient cc, for the reason just noted became important and necessary to determine and c) the subject changed from time to time as people edited the subject on the fly.

How did I determine and discover these nuances? I ran macOS Mail and Outlook in parallel on every e-mail for the entire 10+ day testingperiod. The results were accurate.

  1. I then investigated different methods of determining the missing recipient, recipient cc and subject fields as follows:

a) I first looked at the raw MIME data contained in the e-mail being responded to and determined that it contained the Missing Fields (i.e., recipient, recipient cc and subject) for all cases of the e-mail being responded to (i.e., Pass 1, using my nomenclature).

I looked more closely at the raw MIME data contained in the e-mail being responded to and discovered that it also contained the Missing Fields for most, but not all, of the cases of the previous e-mails being responded to (i.e., Pass 2, , using my nomenclature). It contained the information for all Outlook generated Pass 2 e-mails and all macOS Mail Forwarded generated Pass 2 e-mails (but not macOS Mail Reply / Reply All generated Pass 2 e-mails or other generated e-mail whose metadata is not included in the Pass 1 raw MIME data file).

b) I then investigated methods of determining the Missing Fields for Pass 2 macOS Mail Reply / Reply All e-mails. The led / resulting in the Mail Store scrapping because I knew -- from 4a -- that the missing information would always be contained in the raw MIME data of the previous e-mail.

c) Putting this all together means:

i) All fields (i.e., Sender, Recipient, Recipient CC, Date and Subject) are pulled from from the raw MIME data. There are no exceptions. There is nothing derived, there are no fallbacks.

ii) I search at least one raw MIME data file that being the raw MIME data from the Pass 1 e-mail. If the Pass 2 e-mail was a macOS Reply / Reply All e-mail or other generated e-mail whose metadata is not included in the Pass 1 raw MIME data file then I scrape the Mail Store (over all e-mail accounts and all folders).

This data scrapping resulted in a searching a second raw MIME data file which produced accurate results. The only time this failed was when the necessary e-mail was dated in which case the user is informed, in all cases it was bullet proof.

iii) The resulting Mail's reformatted headers are perfect matches to Outlook based on my testing regardless of the situation (i.e., being cc'ed / bcc'ed on an e-mail, a subject change, adding / deleting people included, etc.).

As noted above and in teh document none of the fields come from or are pulled from the body of the outgoing e-mail reply, not one!

If you look at the raw MIME data files closely you will discover that they do not contains the term "On [date] [time] [sender] wrote:".

It was therefore necessary to come up with a method of accurately finding the data / missing fields in the raw MIME data filed. The sender, time and date are accurate, and factual sources of the truth (as you yourself noted).

These values are used to find the corresponding entries (i.e., Sender, Recipient, Recipient CC, Date, Subject) from the raw MIME data files! They are the are the search criteria as detailed in Sections 6 to 9 of my document.

With that, I suggest to you that there is no localization and there is near zero chance of where "unexpected" attribute lines will be found. Would be very interested in your thoughts after reading Section 6 to 9 of my document.

Worth noting, when the underlying Pass 2 e-mail is needed (i.e., macOS Mail Reply / Reply All response) then not only is the user advise but the fields are left bank. Fields are only populated for the source of the truth, threw MIME data!

I have tested this as noted above and had no issues.

The first error suggests that there may be an edge case for which the Python script is not handling properly, which is possible. Sections 14 and 15 of my document speak to the edge cases that were were discovered, tested and resolved. Although I have not encountered others, it is possible that you have stumbled onto one which I would be happy to track down,

The second error is a result of the previous e-mail not being found in the Mail Store (i.e., it has be deleted from the Mail Store). This too is noted above and in the document (i.e., the macro cannot scrape what it cannot find).

With respect to setting up the environment see the below which is exactly what I did other than naming my Python folder a bit differently but the process is identical. This should remove any and all environmental differences.

-[Python Setup] Macros.kmmacros (129.5 KB)

One more idea to consider.

What do you think about simplifying this further and making it even more robust.

What do you think about the idea of building a database of the top level header of every email the Mail Store which always includes all fields in real time and then rebuild the headers based on that database by searching on the date, name and recipient in the Attribution Line?

What would you recommend for a database?

Thank you.

I can't speak to the overall concept, but sqlite3 is bundled on every Mac, and integrates well with Keyboard Maestro (via shell script commands), though it does require some work. I wrote a pretty detailed how-to a few years back:

It walks through setting up a full database, and storing and retrieving values via Keyboard Maestro.

-rob.

Appreciated.

I recall you recommending that before with the caveat that it has a steep learning curve. I will read through it and let you know.

The concern is that it may be too much to take on at once with it possibly being better to get the conceit working with something simple like Excel and then adding sqlite3 or another database.

The current construct works well (at least for me) but a database would make it more robust. Will need to think through the time commitment.

Thanks.

Any database you add will have a steep learning curve, because you'll have to figure out how to work with it from Keyboard Maestro. And because there isn't built-in database support (for any database) in Keyboard Maestro, that will mean sending commands via scripts (be they Python, AppleScript, shell, etc.) or manipulating the GUI of the database, which seems like a really bad way to do it.

If there's no screaming urge for a database in your solution—that is, you're sitting there knowing that you can't do something with your data in its current form that you could do with it in a database—it's probably not worth the hassle of converting.

-rob.

Appreciate the words of wisdom.

I believe that the current solution is fairly robust (don't know whether you have read the document I shared). It could be more robust but that would involve moving to a database.

There are two possible reasons fro taking the plunge:

  1. A more robust solution; and

  2. Learning as I think these "big projects" I have done have been very helpful.

With that, and as always, much thanks!

You'll have to explain this to me.

You capture the raw source of the email, and I suppose that there's a high probability these days that the message content is in MIME format. But the message headers aren't -- they are just plain text. (There is, of course, a MIME header to let the receiving client know which version of MIME has been used, eg: MIME-Version: 1.0.)

So:

Pass 1 field values are read directly from the Raw MIME File top-level MIME headers by field name (From, To, Subject, Date, Cc).

...should be:

...are read directly from the message headers, by field name (From, etc...)"

You then say:

Format 2 — Reply or Reply All: A line matching 'On [date], at [time], [name] <[email]> wrote:' The Python Script handles all known variations.

...and list the variants handled -- all of which are quite obviously North American in style. Have you tried French? Or even UK English with its On 4 Mar 2026, at 11:51,...?

Mail's "Reply" format is localised -- your format is known, but what happens when you Reply to an email that includes someone else's version of the Mail Reply attribution? Easy enough to test, just bounce emails between two Macs with different localisations, or use one Mac and switch locales between replies.

On "raw MIME" terminology

You're right. The headers (From, To, Subject, Date, Cc) are plain text fields that precede the MIME body — calling them "raw MIME headers" is loose usage, I should have been more precise. More precisely: Pass 1 field values are read directly from the message's top-level headers by field name. I'll correct the documentation.

On attribution line localisation

This is a real limitation and I won't pretend otherwise (subject to the caveat below). The current regex patterns cover the North American Apple Mail format (On [Month] [D], [YYYY], at [H:MM AM/PM], [Name] <[email]> wrote:) and a handful of Gmail and Outlook variants. You are correct that:

  • UK English Mail uses On 4 Mar 2026, at 11:51, ... — day-first, no year, abbreviated month
  • French Mail uses Le 4 mars 2026 à 11:51, ...
  • Other locales will produce further variations

The Gmail day-first year-absent format (On Mon, 2 Mar at 7:00 PM, ...) was actually encountered in real use and added as a handled variant. But systematic coverage of all Mail localisations has not been done.

On macOS Mail Data Detectors as an alternative

The caveat, how does one avoid localisation given the below commentary of Data Detectors. which I researched from multiple sources!

The underlying framework — NSDataDetector — can detect dates, addresses, links, and phone numbers from natural language text. In principle this means it could extract a date or email address from an attribution line regardless of locale. However, there are two hard limits: first, NSDataDetector has no supported type for Subject lines, recipient fields, or the attribution line as a structured whole — so it cannot solve the full parsing problem even in theory. Data Detectors are therefore not a viable alternative path, and locale-specific regex pattern coverage remains the only practical approach I cn think of, can you think of anther?

Your suggested test — bouncing emails between two Macs with different locale settings — is exactly the right way to surface failures. If you're willing to share examples of attribution lines from non-North-American locales you encounter, I will add them to the pattern set. The architecture handles new variants cleanly; it's purely a matter of adding the regex patterns as they're identified, very easy to do.

Your suggest test -- also gave me an idea. Is there a database of all possible formats because I could work off of that?

Let's begin at the beginning...

It's one thing to customise your attribution line and, given Apple Mail's lack of support for such, this is a good way to do it.

But there's a certain level of arrogance in you re-writing my attribution line when you reply. Perhaps I have my own workflows that highlight my parts of the conversation, based on my attribute line? Just as importantly, if I know you are changing some of the text within the body of the reply then I have to assume you might change other bits -- the quoted reply text can no longer be trusted so why include it at all?

Which is why I think you should stick to changing only your own attribution line and skip the whole issue of parsing other people's formats.

That said...

Obviously not -- many email clients let you customise the attribute line, so the number of possible formats is infinite. But certain things will be common across most emails so you could try and use those. Off the top of my head...

Most attribute lines will:

  1. Contain a date in some format
  2. Contain an email address in some format
  3. End with a :

That last is the least reliable indicator -- it is also the easiest to add to.

So I'd start with (pseudo code):

for each_line in the_reply
   if each_line ends with `:`
      if EmailDataDetector(each_line) returns 1 address
         if DataDataDetector(each_line) returns 1 date
            -- treat as attribution line
         endif
      endif
   endif
endfor

Apologies for the delayed response, I have been swamped with work today.

As a start I think I failed to accurately communicate what is being done by the macro. It is creating a modified header -- only when the data exists in the raw data from past and current e-mails -- to precisely mimic a response that would have been sent from Outlook in stead of macOS Mail.

Should the raw data not contain the meta date then the fields are left blank, nothing is everderived, reconstructed, or written back into any email.

To be clear, the recipient(s) receive a response that is identical to what they would receive were I using an Outlook as my e-mail client; that is it. The responding text is never touched and can be fully trusted.

Where / why do you think the recipient would evert think that I am modifying any content / text.

I again apologize as I think I failed to accurately communicate what the macro is actually doing. The macro's Python script already does essentially what you are suggesting — it searches each candidate line for the presence of a date and an email address, and uses those as the basis for identifying the attribution line. So the logic is aligned with your pseudocode, just implemented differently.

I will suggest that the difference is in the mechanism. The current implementation uses regex patterns, which means it is format-dependent and locale-specific — exactly the limitation you identified. NSDataDetector would do the same job but with Apple's own localised natural language detection underneath, making it genuinely locale-independent without needing to enumerate formats. So your suggestion isconceptually aligned with what's already there as well as being a more robust version of it.

The practical constraint that led to regex being used in the first place is that NSDataDetector has no direct AppleScript or Python interface. A small Swift command-line tool that accepts a line of text and returns detected addresses and dates as JSON would bridge that gap, with Python calling it via subprocess. That's a one-time build dependency but would make the detection locale-independent going forward and is worth pursuing.

Nigel, your Data Detectors / Swift command-line tool suggestion is genuinely good and I want to be clear that I agree it would be an improvement over the current regex approach for attribution line detection. However, I've been thinking about a different architectural direction that I think addresses the root problem more fundamentally and wanted to get your thoughts.

The core issue with any attribution line parsing approach — regex, Data Detectors, or otherwise — is that the attribution line is inherently an unreliable source of data. It is rendered text, not structured data. It reflects the sender's local time with no timezone information, it is subject to client localisation, and it can be customised or malformed in ways that no detection algorithm can fully anticipate. This applies equally to forwarded message headers — Begin forwarded message: blocks introduce their own format variations across clients, and the embedded headers within them are again rendered text rather than structured data. We are essentially trying to reverse-engineer information that was never designed to be machine-parsed.

The alternative I am considering is to build a local database populated at the moment each email arrives, is sent, or is forwarded, reading directly from the top-level message headers — the same authoritative source the macro already uses for Pass 1. Specifically, for every inbound, outbound, and forwarded email, a Keyboard Maestro trigger would fire and write the following to the database:

  • From
  • To
  • Cc
  • Subject
  • Date — critically, the full RFC 2822 date including UTC offset (e.g. Wed, 4 Mar 2026 10:34:23 +0000), which eliminates all timezone ambiguity
  • Message-ID — which uniquely and unambiguously identifies every email regardless of content

When the macro needs to identify the Pass 2 email — whether it originated as a reply or a forwarded message — instead of parsing the attribution line or forwarded message header block and searching Mail with heuristics, the Python script would extract the sender email and approximate date from that block, query the database, and return the exact matching record with full verified header data. The timezone problem, the locale problem, the attribution line format problem, and the forwarded message header variation problem all effectively disappear — because the quoted block is used only as a rough search key, not as a data source.

This is why I think it is architecturally stronger than the Data Detectors approach. Data Detectors would make attribution line parsing more robust and locale-independent, which is a genuine improvement. But it still relies on the quoted block as the source of truth, which means it still inherits all of the limitations of that source — timezone ambiguity, customised formats, missing fields, and client-specific forwarded message formatting. The database approach sidesteps those limitations entirely by going back to the authoritative source at ingestion time.

The obvious caveat is that the database is only as complete as its coverage — emails that arrived before it was built would not be in it, and the ingestion trigger needs to fire reliably on every send, receive, and forward without exception. Those are real engineering challenges. But assuming they can be solved, this feels like the right direction.

Interested in your thoughts, particularly on whether you see weaknesses in the approach that I'm not accounting for.

No, it isn't. Please get the terminology right. It's changing the attribution, the

On 23 Feb 2026, at 09:10, Joe Bloggs <jb111@example.com> wrote:

...in your reply.

If it only changed your attribution you wouldn't need the raw source and you wouldn't need to process other attribution formats -- you could use the message headers, or you could use the attribution of your reply because you know your format, it's fixed by your localisation (which is the initial problem!).

The problem is "Pass 2" -- where you are taking my attribution and changing it to your style. It's one thing change your reply from Mail's default

I'll drop them round later. See you soon!

| On 23 Feb 2026, at 09:10, Nige S <nige@example.com> wrote:
|
| Yes, I'll have two please!
| 
|| 23 Feb 2026, at 09:05, Joel <joel@example.com> wrote:
|| 
|| Would you like some cans of beer?

...to the Outlook-style

I'll drop them round later. See you soon!

------------------------------------
From: Nige S <nige@example.com>
Date: 23 Feb 2026, at 09:10
To: Joel <joel@example.com>
Subject: Beer Time!

Yes, I'll have two please!

| 23 Feb 2026, at 09:05, Joel <joel@example.com> wrote:
| 
| Would you like some cans of beer?

...but quite another to change it to

I'll drop them round later. See you soon!

------------------------------------
From: Nige S <nige@example.com>
Date: 23 Feb 2026, at 09:10
To: Joel <joel@example.com>
Subject: Beer Time!

Yes, I'll have two please!

------------------------------------
From: Joel <joel@example.com>
Date: 23 Feb 2026, at 09:05
To: Nige S <nige@example.com>
Subject: Beer Time!

Would you like some cans of beer?

Hopefully my attempted formatting makes it clear what the difference is.

Of course, people can change the quoted text within their replies whenever they want -- so it should never be relied on. But there's an expectation, especially when whole-message bottom-quoting, that it isn't edited. So it may be better if you don't change other people's attribution lines just to suit your idea of what's right.

But I am not your keeper! :wink: If you do want to carry on then it will a great learning experience, if nothing else.

@Nige_S

Apologies for not responding.

Currently 10:35 PM where I live and I am just finishing working for the day.

I will respond to you tomorrow in the morning. Did not want you to think I was ignoring.

Dude! The Forum is an async communications method -- no sensible person will mind if you don't respond instantly. There's no need to apologise :wink:

Noodling further on this (and still not recommending you do any more than Pass 1)...

I think you'll only need every inbound message. More importantly, an outgoing message doesn't have a date or message ID until after it has been sent, so the only way you can get those is to always CC or BCC yourself and process the inbound version.

Whether you need to record all that could also depend on your email habits. Whether you need to record any of it is also arguable -- Mail indexes all this stuff, it's just a matter of asking the right question:

tell application "Mail"
	set theMsg to item 1 of (every message of mailbox "Inbox" of account id "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" whose all headers contains "Message-ID: <example.com/zHYkRIQas5ylMrQTysJMqiGw/a9n6YeQ6l4HjyvC=.017605023453035221.0344117460@joel.example.com>")
	return {sender, date sent, subject} of theMsg
end tell

--> {"Joel <joel@example.com>", date "Thursday, 5 March 2026 at 10:53:05", "Re: Beer o'clock!"}

(Obvs you'll need to set your own account id and a valid message-id if you want to try it!)

Arguable because querying your home-rolled database will probably be quicker than an all headers search of a mailbox. A compromise would be to store the message-id and corresponding Mail message id in your database -- search the DB by message-id to get the Mail id then ask Mail for the details of that message:

tell application "Mail"
	return {sender, date sent, subject} of item 1 of (every message of every mailbox of account id "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" whose id is 11111)
end tell

Apologies for not using the correct terminology. I am new to all this and am trying to be as accurate as possible.

As to the definition of the attribution line, I understand it to be the phrase that appears immediately above a quoted message in an email chain ( and is often called a "reply header" or "separator line"). It explicitly identifies who sent the original email, the date it was sent, and sometimes the time.

Once we are discussing definition, as to the definition of localization, I understand it to be the process of adapting the subject line, preheader, and overall content to meet the specific language, cultural, and regional needs of the recipient.

If these are not correct then please provide the correct definitions.

Is my understanding correct that by "your" attribution you are referring to the Pass 1 header (i.e., the attribution line that macOS Mail generates when I press Forward / Reply / Reply All)?

Is my understanding correct that you "are comfortable" with Pass 1 because it does not change any of the previous attribution lines and therefore does not leave the recipient with the concern / thought of "what else has changed?

Is there something else that you are concerned about?

In terms of the macro (in its current state)

  1. I confirm that Pass 1 only ever uses the header information. You are correct that limiting the macro to Pass 1 would remove the need to process other attribution line formats.

  2. I confirm that the macro I currently constructed to only process Pass 1 simply by changing the variable Local_HeaderReformatRepeatNumber to 1 (form it current value of 2).

I appreciate the example and can clearly / visually see the difference, no disagreement there.

Is it correct that your concern:

  1. Is that changing the Pass 2 attribution line may leave the recipient with the concern / thought of "what else has changed?

  2. Is not the accuracy of the reformatted Pass 2 header as it uses the raw source data or the content as the body is never changed?

As to changing quoted text, agreed and, as you noted in your example, I am not touching the text at all!

As to changing Pass 2 attribution lines, see above.

Agree that it has been great learning experience!

If I decide to limit this macro to Pass 1 then it will be the second great earning experience that will be shelved as the Display Preset has been shelved as I found a better and easier solution. Not used about it all, the learning has been terrific!

Agreed and understood.

It is just that you have been so generous with your suggestions and time, as well as extremely helpful, I wanted you to know that I was intending to respond.

Agreed with the alternative being to monitor the Sent folder thereby removing the need to CC or BCC myself.

Appears the first step should I wish to pursue this root is to lean about Mail indexing capabilities and how best o access them

Greatly appreciated.

Will certainly bounce a few ideas and Apple Scripts you way should I choose to pursue this. Currently focusing on cleaning up, optimize and standardizing everything built to date.

With respect to where to go from there, it will certainly be something to further my skills and likely between i) building a different approach here and ii) building a macro to automate the taking and summary of meeting notes, tasks and timeline between Jamie and GoodTask. :slight_smile:

To me, who has been reading this thread for general information and with no intent to implement your macro (I'm a very lazy email person :)), this would definitely get my attention: If I received a reply where the attribution line had been completely reconstructed, I would absolutely wonder if anything else had been changed.

Of course, this also points out the fundamental flaw in email: It's all just text, and can be easily changed at any point. Ideally, we should all be checking the mail thread 100% of the time to insure someone hasn't removed a "no" or added a "you'll pay us $1,000,000" line somewhere :).

So in the overall scheme of things, I think it's a minor point, given the edibility of email, but still, it's something I wouldn't expect to see in a reply chain. My replies should look like my replies originally looked.

Just my $0.02.

-rob.

Rob, appreciated and understand your concern / thinking.

The variable Local_HeaderReformatRepeatNumber is currently set to 1 (i.e., Pass 1 only)! :winking_face_with_tongue:

Will give it some more thought though I do find it intersecting because the Outlook style header adds to the readability, it does not detract from it. With that, I do however understand the point that Nigel and you are raising!