Find and Replace Vowel Sequences

I want to insert a dot after every vowel in a sequence of text.

Here is an example text: x a y u b i m
Find Regex:

([b-df-hj-np-tv-z]) (e|i|o|u|a) ([b-df-hj-np-tv-z]) (e|i|o|u|a)

Replace Regex:

$1 $2 . $3 $4

It is inserting it only on the first one. I am getting x a . y u b i m

But, I want it to be like this: x a . y u . b i . m

I know KM has a repeat function. I don't want to use it because I don't know the length of the vowels and consonants.

The next word i want to process could be as short as just a single vowel, or as long as eight vowels.

Can you guys help me please?

Your search is not looking for each vowel, it’s looking for each consonant-vowel-consonant-vowel sequence, and there’s only one of those in your example text.

To find all vowels and add a ‘.’ after you just need to find ‘(a|e|i|o|u)’ and replace ‘$1 .’

2 Likes

You are right. I was asking to match it the big one. It is a mistake on my side. But, the idea was to replace the first two if there is an additional vowel.

So, what I actually wanted was to get x a . y u . b i m, not x a . y u . b i . m.

I am trying to apply phonological syllabification.

Insert a dot after a vowel X if the vowel X has been followed by another Consonant vowel combination (syllable); otherwise, leave it alone. That was the idea.

Find:

([b-df-hj-np-tv-z]) (e|i|o|u|a) ([b-df-hj-np-tv-z]) (e|i|o|u|a)

Replace:

$1 $2 . $3 $4 .

Gets the right result for the above case. But, the problems is it would fail if I have a word with 1 or 4 syllables like: x a y u b i m d a

I am sorry, I screwed the original request.

Could you do something like:

  1. Find character length
  2. Repeat (1) times:
    1. Find ‘(VOWEL)(.+VOWEL)’ replace ‘$1 .$2’

Bonus points for checking if the text is unchanged meaning you can end the repeat early.

Edit: you’d need to exclude dots already inserted, so ‘(a|e|i|o|u) ([^.]+(a|e|i|o|u)’

1 Like

Total character length is irrelevant because the words could contain a number of consonants.

x a y u b i m d a m n

I want regex to change as above:

x a . y u . b i m d a m n

But, I don't understand your number (2).

So you don’t want the output to be ‘ x a . y u . b i . m d a m n’? You only want to dot vowels that are followed by a later vowel, up to a maximum of two times?

1 Like

Insert a dot after a vowel iff it is followed by [CV].

[C is for Consonant, V is for a vowel).
x a y u b i m d a m n --> x a . y u . b i m d a m n

a gets a dot because it is followed by [yu] which is CV; u gets a dot because it is followed by [bi]; i doesn't get a dot because it is followed by [md] (which is CC).

There is no maximum. Insert the dot so far as the condition is satisfied.

1 Like

I think I understand. You can do this using lookahead assertions.

https://www.rexegg.com/regex-lookarounds.html

On mobile at moment, but this should do what you want if I understand your problem and if I understand lookahead which I’ve not used before.

Find: (a|e|i|o|u)(?= [^(a|e|i|o|u)] (a|e|i|o|u))
Replace: $1 .$2

Edit: replace should be ‘$1 .’

1 Like

I think you have found the solution: there seems to exist some issue still though.

For x a y u b i m d a, I am getting x a . u y u . i b i m d a or x a . u y u . i b i m d a if I put a space after $2

There is sth with the space. Thank for the help. We are almost there

See my edit: you only want to use $1 in the replace

1 Like

Yes, it is working. Wonderful

Thank you so much.

1 Like

And I learnt that lookahead is a thing in regex — that’s really useful for me. :blush:

1 Like

And, it is going to save me many (probably in hundreds) hours of manual labor. I am really grateful!

1 Like

I think this can be shortened to ([aeiou])(?= [^aeiou] [aeiou])

If that does the same thing then 1) it’s clearer when you look back at it in 5 years time 2) it may run faster which is important if you have a large dataset.

1 Like

Yes, it is working.

2 Likes

I know this was marked as solved. My solution would be very different. I would submit any word to a website which provides a phonetic spelling and return the results. For example, if my word was "antiestablishment," I would use a KM macro to get the result of this website:

http...//www.merriam-webster.com/dictionary/antiestablishment

Then I would use a filter to extract only the phonetic spelling of the word:

an·​ti·​es·​tab·​lish·​ment

In fact, if I really wanted to impress someone, I would also download the audio file that's visible on that page so KM can actually read the word aloud.

2 Likes

filter to extract only the phonetic spelling of the word

That looks like an interesting possibility, and underscores the fact that @Desalegn hasn't yet spelled out the context or the goal.

Focusing on a blockage to one possible solution, rather than explaining the problem itself, is understandable, but in practice it does always diminish the quality and speed of the harvested responses.

Is syllabic segmentation the real goal here ?

(a rule of thumb is that if a regular expression has grown to longer than c. 10 characters, then using a regex could well be creating more problems than it is solving)

I am working on a non-English, indeed barely known language.

For that, I couldn't find a better option than Regex. I am not going to fully rely on it. But, it is going to facilitate the process and minimize the labor work to a large extent.

For now the exact task is interpreting the IPA system to X-Sampa system so that the data will be fed to a computer for developing Speech-to-Text.

That is the context.

Understood – is the syllabic model of this phonology or lexicon actually recursive ?

( I wondered about the presence of the word 'recursive' in the thread title. The key limitation of the 'finite automaton' model which underlies regular expressions is that it can't encode recursion.

One or two regular expression engines do attempt a bolt-on fudge – some special syntax – to get around that issue, but if what you are trying to parse is genuinely recursive, then a regular expression is unlikely to be a natural fit. )

1 Like

@ComplexPoint
Actually, I use the term "recursive" to mean that it would keep on matching to the end of the word. The right word turn out to be "lookahead" or "lookback" as @stevelw correctly noted--at leas that is how he solved the problem.

My regex was stopping on the first match.

  • i wanted it to matche and replaces the first syllable-->proceeds to the second syllable -->to the third until it ends.

I will remove the term if it is going to cause confusion for future readers.

1 Like