Find all words with special characters [SOLVED]

alltiagocom · April 6, 2024, 11:38am

In my language (Portuguese) there are a lot of special "characters" (maybe they are called accents?) such as é, ã, etc...
Since my computer is all in English, because that's how I communicate 99.9% to the time, all my notes are in English, etc, it's very rare that I type in Portuguese.
Once in a while I save some words in the Text Replacement pane.
Today I had to write a looooong letter in Portuguese and since I'm already replacing some of the words with the accented words (for example when I wanted to type "healthy" in Portuguese, I typed "saudavel", but the real word is "saudável"). So since all words are now fixed, I would like to create a macro that searches all words that are not accented and delete them so I can have a list of all accented words. Then I can go and start adding those to the Text Replacement pane.

Is there a RegEx formula that includes all accents or do I have to manually type each option and then find a way to compare each word in the letter with each accented character?

Also, how would I be able to find duplicates? I know that I use some words a lot and I would like to just keep one of those words instead of 30 versions on the same word in the list.

EDIT: I found that this will find all accented characters individually [à-üÀ-Ü]
For context, in Portuguese there are 4 variants: words that are a single character such as É (this means something like it is, depending on context), then the accented characters could be in the beginning Área, middle saudável, or end ficará.

So I need to find a way to include all 4 options. I understand the concept, kinda, but I can't find the way to achieve it in RegEx

freewind1974 · April 6, 2024, 11:46am

Hello,
this might be a good starting point (you might personalize it even further):

[^(A-Za-z0-9 )]

This excludes (^) all characters A-Z, a-z, numbers 0-9 and space.
If you need to exclude more characters, simply add them before the final space.

Jan

alltiagocom · April 6, 2024, 11:50am

This doesn't seem to work that well:

Using my pattern seems to accomplish something, but I can't seem to find a way to expand it...

freewind1974 · April 6, 2024, 12:33pm

I edited my previous answer, I'm sure it's more understandable now.

J

Nige_S · April 6, 2024, 12:34pm

The trick is that you want to find all words that contain accented characters -- or, conversely, delete every word that doesn't -- so you'll need to include "word boundary" metacharacters in your pattern. \b matches a "word boundary", while \B matches any character that isn't a word boundary.

If going the deletion route you'll need to replace non-accented words with something so everything doesn't run together -- a new line seems a reasonable choice:

"Replace any word that doesn't contain at least one of the listed accented characters with a new-line."

When run on your text

EDIT: I found that this will find all accented characters individually [à-üÀ-Ü]
For context, in Portuguese there are 4 variants: words that are a single character such as É (this means something like it is , depending on context), then the accented characters could be in the beginning Área , middle saudável , or end ficará .

...that will return

à
üÀ
Ü
É
Área
saudável
ficará.

Is that near enough for starters?

You could then use a "Filter" action to sort the lines but, AFAIK, there's no easy way to remove duplicates. So I'd turn to a quick shell script: sort | uniq and do everything in one.

Example macro:

List Accented Words.kmmacros (4.3 KB)

Image

This could certainly be improved -- I'm sure there's a better, more general, way to determine accented characters -- but this may be a good start, or even "good enough" for a one-off.

ComplexPoint · April 6, 2024, 12:44pm

One quick sketch:

(UPDATED for punctuation and a fuller Unicode block)

(UPDATED again to converge case variants, and avoid word1/word2 conflations).

(FURTHER UPDATED (in failed attempt to improve Portuguese sort order)

(FINAL update – better Portuguese sort order, I think)

(and subsequently delegated segmentation to Intl.Segmenter)

Words with Latinate diacritics.kmmacros (4.5 KB)

Expand disclosure triangle to view JS source

const
    diacritics = /[\u00c0-\u00ff]/u,
    punctuation = /\p{P}/gu,
    collator = new Intl.Collator("pt"),
    ptComparison = collator.compare,
    segmenter = new Intl.Segmenter(
        "pt", {granularity: "word"}
    );

const accented = Object.keys(
    Array.from(
        segmenter.segment(
            kmvar.local_Source
            .replace(punctuation, " ")
            .toLocaleLowerCase()
        )[Symbol.iterator]()
    )
    .reduce(
        (dict, item) => {
            const w = item.segment.trim();

            return !(w in dict) && diacritics.test(w)
                ? Object.assign(
                    dict, {
                        [w]: 1
                    }
                )
                : dict;
        },
        {}
    )
);

return accented
.sort(ptComparison)
.join("\n");

alltiagocom · April 6, 2024, 12:55pm

Oh wow this seems to do both things in one...
I noticed that the word ódio wasn't there, though.
It's a lower case ó in the beginning of a sentence. Maybe something in the script isn't including that option?

alltiagocom · April 6, 2024, 12:59pm

Thanks for sharing that.
I don't remember who mentioned that boundary trick, but I had it in my notes already.
Since it's something I don't use very often, I completely forgot about it.

I noticed that the üÀ was seen as a group, though, even tough that's not a combination we have in Portuguese, but I wonder why that is?
We have some words where we have 2 accented characters such as acção for example

ComplexPoint · April 6, 2024, 1:10pm

My typo – ended the unicode block a few characters short – should be:

[\u00c0-\u00ff]

See: Latin-1 Supplement - Wikipedia

alltiagocom · April 6, 2024, 1:19pm

I tested it and it indeed adds the word ódio, but it also adds the comma:

Not that this is a big deal, I can always run a Search and Replace action, but maybe you know what that's happening so we can have a script that doesn't include other characters?

I'm already super happy with the current macro anyway
Thank you so much for taking the time to create this.

Nige_S · April 6, 2024, 1:23pm

Because it's there in the sample text:

characters individually [à-üÀ-Ü]

The "-" is a word boundary, so the regex sees "üÀ" as a word.

ComplexPoint · April 6, 2024, 1:23pm

Now updated (in the orginal post) to prune out punctuation:

punctuation = /\p{P}/gu

alltiagocom · April 6, 2024, 1:25pm

Oh ok. It makes perfect sense.

alltiagocom · April 6, 2024, 1:33pm

Thanks again!

I tested it and it now works as expected

I don't want to push my luck and it's totally fine if these 2 "requests" are too much work and time. Again, the current macro does all the work. I just noticed 2 things with my actual letter:
1 - Can we ignore words that are the same? For example the words Não, não, or NÃO, all mean the same. There's no difference between Ã and ã. Would it be simple to implement this? Basically when I use the Text Replacement, I always add the lower case version of words, because when it's the beginning of the sentence it always capitalizes the letter anyway.

2 - I noticed that when I have a slash / in between words, it puts the words together, for example this: víamos/falávamos became víamosfalávamos and unless I use it like this víamos / falávamos with a space before and after the slash, it will always show it as a single word.

Again, the current macro is already a blessing and I appreciate your contribution, if these 2 requests are too complex and time consuming

Nige_S · April 6, 2024, 1:41pm

I left my version case sensitive, to account for proper nouns etc (I also turn off auto-capitalisation wherever possible -- the control freak in me ).

But if you want to make it case insensitive just add the -i to uniq, so you have sort | uniq -i in the shell script action.

ComplexPoint · April 6, 2024, 1:52pm

UPDATED again in the original post to fix these two issues (lower-cased, and replacing punctuation with a space – rather than empty string – to preserve word boundaries)

    .replace(punctuation, " ")
    .toLocaleLowerCase()

alltiagocom · April 6, 2024, 1:57pm

For this particular scenario, the Text Replacement, it doesn't make a big difference. It's even better that I use the lowercase, because if it saves the word Não instead of não, then it will look wrong in the middle of a sentence, but in the beginning of the sentence it will auto-capitalize anyway.

What do you mean?

I just tried your solution and it's not removing the duplicates

and it's also not sorting alphabetically, at least not "logically" speaking.
It seems to first sort all words where the first letter is capital, but no accented character. Then it sorts all lowercase first letter, no accent.
Then first letter capitalized and accented
Finally first letter lower case and accented

alltiagocom · April 6, 2024, 2:01pm

Thank you again!
I noticed that it indeed does what you said.
One thing I noticed, and that seems to be an "issue" with @Nige_S macro as well: it's sorting the words that don't start with an accented character, and then sorts the ones with accented first character, like this:

Is this fixable or is it a limitation of those characters?

For example in Finder it doesn't make a difference:

EDIT: I just added a Filter set to "Sort" and it works, in case that's complicated to do with the script itself. Anyway, thank you so much for your help on this! Super valuable stuff!

ComplexPoint · April 6, 2024, 2:11pm

Further updated above to improve Portuguese sort order:

const collator = new Intl.Collator(
    "pt", {
        numeric: true,
        sensitivity: "base"
    }
);

const ptComparison = (a, b) =>
    collator.compare(a.name, b.name);

.sort(ptComparison)

(but not sure that that has worked – sounds as if you have a better solution)

Nige_S · April 6, 2024, 2:13pm

I'm gonna have to hit google to work that out but, in advance -- nice!

Find all words with special characters [SOLVED]

Options