Find and Replace Vowel Sequences

Desalegn · November 14, 2021, 6:25am

I want to insert a dot after every vowel in a sequence of text.

Here is an example text: x a y u b i m
Find Regex:

([b-df-hj-np-tv-z]) (e|i|o|u|a) ([b-df-hj-np-tv-z]) (e|i|o|u|a)

Replace Regex:

$1 $2 . $3 $4

It is inserting it only on the first one. I am getting x a . y u b i m

But, I want it to be like this: x a . y u . b i . m

I know KM has a repeat function. I don't want to use it because I don't know the length of the vowels and consonants.

The next word i want to process could be as short as just a single vowel, or as long as eight vowels.

Can you guys help me please?

stevelw · November 14, 2021, 8:49am

Your search is not looking for each vowel, it’s looking for each consonant-vowel-consonant-vowel sequence, and there’s only one of those in your example text.

To find all vowels and add a ‘.’ after you just need to find ‘(a|e|i|o|u)’ and replace ‘$1 .’

Desalegn · November 14, 2021, 9:11am

You are right. I was asking to match it the big one. It is a mistake on my side. But, the idea was to replace the first two if there is an additional vowel.

So, what I actually wanted was to get x a . y u . b i m, not x a . y u . b i . m.

I am trying to apply phonological syllabification.

Insert a dot after a vowel X if the vowel X has been followed by another Consonant vowel combination (syllable); otherwise, leave it alone. That was the idea.

Find:

([b-df-hj-np-tv-z]) (e|i|o|u|a) ([b-df-hj-np-tv-z]) (e|i|o|u|a)

Replace:

$1 $2 . $3 $4 .

Gets the right result for the above case. But, the problems is it would fail if I have a word with 1 or 4 syllables like: x a y u b i m d a

I am sorry, I screwed the original request.

stevelw · November 14, 2021, 9:28am

Could you do something like:

Find character length
Repeat (1) times:
1. Find ‘(VOWEL)(.+VOWEL)’ replace ‘$1 .$2’

Bonus points for checking if the text is unchanged meaning you can end the repeat early.

Edit: you’d need to exclude dots already inserted, so ‘(a|e|i|o|u) ([^.]+(a|e|i|o|u)’

Desalegn · November 14, 2021, 9:35am

Total character length is irrelevant because the words could contain a number of consonants.

x a y u b i m d a m n

I want regex to change as above:

x a . y u . b i m d a m n

But, I don't understand your number (2).

stevelw · November 14, 2021, 9:39am

So you don’t want the output to be ‘ x a . y u . b i . m d a m n’? You only want to dot vowels that are followed by a later vowel, up to a maximum of two times?

Desalegn · November 14, 2021, 9:44am

Insert a dot after a vowel iff it is followed by [CV].

[C is for Consonant, V is for a vowel).
x a y u b i m d a m n --> x a . y u . b i m d a m n

a gets a dot because it is followed by [yu] which is CV; u gets a dot because it is followed by [bi]; i doesn't get a dot because it is followed by [md] (which is CC).

There is no maximum. Insert the dot so far as the condition is satisfied.

stevelw · November 14, 2021, 9:54am

I think I understand. You can do this using lookahead assertions.

https://www.rexegg.com/regex-lookarounds.html

On mobile at moment, but this should do what you want if I understand your problem and if I understand lookahead which I’ve not used before.

Find: (a|e|i|o|u)(?= [^(a|e|i|o|u)] (a|e|i|o|u))
Replace: $1 .$2

Edit: replace should be ‘$1 .’

Desalegn · November 14, 2021, 9:59am

I think you have found the solution: there seems to exist some issue still though.

For x a y u b i m d a, I am getting x a . u y u . i b i m d a or x a . u y u . i b i m d a if I put a space after $2

There is sth with the space. Thank for the help. We are almost there

stevelw · November 14, 2021, 10:02am

See my edit: you only want to use $1 in the replace

Desalegn · November 14, 2021, 10:03am

Yes, it is working. Wonderful

Thank you so much.

stevelw · November 14, 2021, 10:04am

And I learnt that lookahead is a thing in regex — that’s really useful for me.

Desalegn · November 14, 2021, 10:06am

And, it is going to save me many (probably in hundreds) hours of manual labor. I am really grateful!

stevelw · November 14, 2021, 10:13am

I think this can be shortened to ([aeiou])(?= [^aeiou] [aeiou])

If that does the same thing then 1) it’s clearer when you look back at it in 5 years time 2) it may run faster which is important if you have a large dataset.

Desalegn · November 14, 2021, 11:11am

Yes, it is working.

Sleepy · November 14, 2021, 2:17pm

I know this was marked as solved. My solution would be very different. I would submit any word to a website which provides a phonetic spelling and return the results. For example, if my word was "antiestablishment," I would use a KM macro to get the result of this website:

http...//www.merriam-webster.com/dictionary/antiestablishment

Then I would use a filter to extract only the phonetic spelling of the word:

an·ti·es·tab·lish·ment

In fact, if I really wanted to impress someone, I would also download the audio file that's visible on that page so KM can actually read the word aloud.

ComplexPoint · November 14, 2021, 3:36pm

filter to extract only the phonetic spelling of the word

That looks like an interesting possibility, and underscores the fact that @Desalegn hasn't yet spelled out the context or the goal.

Focusing on a blockage to one possible solution, rather than explaining the problem itself, is understandable, but in practice it does always diminish the quality and speed of the harvested responses.

Is syllabic segmentation the real goal here ?

(a rule of thumb is that if a regular expression has grown to longer than c. 10 characters, then using a regex could well be creating more problems than it is solving)

Desalegn · November 14, 2021, 4:06pm

I am working on a non-English, indeed barely known language.

For that, I couldn't find a better option than Regex. I am not going to fully rely on it. But, it is going to facilitate the process and minimize the labor work to a large extent.

For now the exact task is interpreting the IPA system to X-Sampa system so that the data will be fed to a computer for developing Speech-to-Text.

That is the context.

ComplexPoint · November 14, 2021, 4:28pm

Understood – is the syllabic model of this phonology or lexicon actually recursive ?

( I wondered about the presence of the word 'recursive' in the thread title. The key limitation of the 'finite automaton' model which underlies regular expressions is that it can't encode recursion.

One or two regular expression engines do attempt a bolt-on fudge – some special syntax – to get around that issue, but if what you are trying to parse is genuinely recursive, then a regular expression is unlikely to be a natural fit. )

Desalegn · November 14, 2021, 4:34pm

@ComplexPoint
Actually, I use the term "recursive" to mean that it would keep on matching to the end of the word. The right word turn out to be "lookahead" or "lookback" as @stevelw correctly noted--at leas that is how he solved the problem.

My regex was stopping on the first match.

i wanted it to matche and replaces the first syllable-->proceeds to the second syllable -->to the third until it ends.

I will remove the term if it is going to cause confusion for future readers.

Find and Replace Vowel Sequences

Options