Find and Replace Vowel Sequences

Desalegn · November 14, 2021, 4:40pm

Dear @Sleepy: I am achieving almost the same result actually.
Do you know what I get when I run the macro on antiestablishment?

This is the result: **a n . t i e s . t a b . l i s h . m e n t **

Given that the language I am working on has no double vowels, the result is almost perfect.

ComplexPoint · November 14, 2021, 4:44pm

If:

you bump into a noticeable number of edge cases where that finite-state (regular expression) model is hitting snags, and
you have a formal CV model which you can show us,

then we can probably translate that CV model into parser for you in a scripting language for a KM Execute JavaScript or Execute Shell Script action.

Desalegn · November 14, 2021, 4:53pm

I actually have a number of interesting problems that are mind bending to me; but you guys can solve. I will think about it and frame the issue (not to mess up like in this post) carely and post it here.

You guys are amazing. Thank you so much.

Desalegn · November 15, 2021, 8:24am

I am back guys. I am having a slight issue with the lookaround regex. Apparently, it doesn't support complex characters.

IN the search field, I said I have consonants; which are normally simple characters such as b, d. But, X-sampa contains complex consonants such as tS and t_>.

In the regular regex, putting these complex characters along with the simple consonants works fine. But, the lookaround (lookforward) is not working.

Find: ([(a|e|i|o|u)]) ([b-df-hj-np-tv-z|t_>])
finds t_>

But, ([(a|e|i|o|u)]) (?=[b-df-hj-np-tv-z|t_>])
doesn't find/recognize the complex consonant t_>

Here is the actual data:

Find:

([(a|e|i|o|u)])(?= [b-df-hj-np-tv-z|t_>] [a|e|i|o|u)])

Replace comamnd (@stevelw*s solution)
$1 .

Input:
n u t_> u n e q u

It is supposed to produce: n u . t_> u . n e . q u. But, actually, it is producing n u t_> u . n e . q u.

ComplexPoint · November 15, 2021, 8:53am

Do you have a formal or informal CV model of the language at this stage ?

Desalegn · November 15, 2021, 9:04am

CV is the standard syllable structure.

But, the syllable can be CVC as well as CVCC (at the end of words).

List of vowels: (a|e|i|o|u|@|1)
List of Consonants: tS_>|J|p_>|t_>|Z|?|tS|ts_>|dZ|[b-df-hj-np-tv-z]

t_> @ r r a => t_> @ r . r a
l @ b b @ s @ => l @ b . b @ . s @
g @ r r @ f @ => g @ r . r @ . f @
f @ t @ n @ => f @ . t @ . n @
? a b a r @ r @ => ? a . b a . r @. r @
? a z @ n @ => ? a . z @ . n @
? a l @ q q @ s @ => ? a . l @ q . q @ . s @
g e b s => g e b s

if there is xx consonant (identical consonant reduplicated), insert the [dot] in between.
there should not be CCV

I was almost there with this macro, if not for the failure of the lookahead regex for complex consonants.
X-sampa.kmmacros (12.0 KB)

In the macros: the first step, I insert [.] if there are two CC before a V. that to avoid CCV sequence (prohibited).
IN the second step I Insert [.] after every vowel. This the one discussed in this forum. That generates the right syllable structure for the rest of the word.

ComplexPoint · November 15, 2021, 10:53am

That's very helpful – do you think that those rules form a more or less determinate syllabic grammar of the material, or do they sometimes allow for ambiguities which are, for example, lexically or positionally resolved ?

e.g. in English:

the -> CV (determiner, suffix of name Benthe)

but also:

the -> C (suffixes of absinthe, bathe etc)

ComplexPoint · November 15, 2021, 11:21am

Could you expand a bit on what that last term means ?

I presume that the preceding alternatives are all multi-glyph C.

Desalegn · November 15, 2021, 11:23am

There could be factors such as word category (noun, verb, names etc). That will be resolved manually when necessary because the rules will be very complicated if we include other factors. My aim is to capture the basic verb paradigms.

ComplexPoint · November 15, 2021, 11:24am

and this term is ?

Desalegn · November 15, 2021, 11:25am

This is supposed to be the list of regular English/Latin consonants: [b,c,d,f etc] written in the regex language: validation - Regex to match repeated consonant - Stack Overflow

ComplexPoint · November 15, 2021, 11:59am

And the input source segments the glyphs with spaces, except for inside multi-glyph consonants ?

So always x a y u b i m rather than xayubim ?

Desalegn · November 15, 2021, 12:00pm

Yes, absolutely.

ComplexPoint · November 15, 2021, 2:03pm

Mmm ... not easy It would need some back-tracking ...

If we sequence the patterns to try as:

[cvccEnd, cvc, cv]

that yields a mixed bag of success and failure:

t_> @ r r a => t_> @ r . r a
l @ b b @ s @ => l @ b . b @ s
g @ r r @ f @ => g @ r . r @ f
f @ t @ n @ => f @ t
? a b a r @ r @ => ? a b
? a z @ n @ => ? a z
? a l @ q q @ s @ => ? a l
g e b s => g e b s

and if we simply reorder the sequence of testing to:

[cvccEnd, cv, cvc]

then the result is different, but still patchy:

t_> @ r r a => t_> @
l @ b b @ s @ => l @
g @ r r @ f @ => g @
f @ t @ n @ => f @ . t @ . n @
? a b a r @ r @ => ? a . b a . r @ . r @
? a z @ n @ => ? a . z @ . n @
? a l @ q q @ s @ => ? a . l @
g e b s => g e b s

I'll try to give it some thought this evening.

Desalegn · November 15, 2021, 2:06pm

Thank you. But, if you don't have time, I have found some means of improving the solution by @stevelw.

This one ([(a|e|i|o|u|@|1)]) (?=([b-df-hj-np-tv-z]|(tS_>|J|p_>|t_>|Z|?|tS|ts_>|dZ)) [(a|e|i|o|u|@|1)]) seems to capture the complex consonants as well.

t_> @ r . r a
l @ b . b @ . s @
g @ r . r @ . f @
f @ . t @ . n @
? a . b a . r @ . r @
? a . z @ . n @
? a . l @ q . q @ . s @
g e b s

ComplexPoint · November 15, 2021, 2:06pm

Perhaps the trick is to add a tokenizing phase.

ComplexPoint · November 15, 2021, 2:09pm

One thing that jumps to the eye there is that you may be getting a glitch with your ? consonant.

As that has a meaning in regular expressions (it optionally matches any single character), you need to escape it with a preceding backslash \? to treat it as a string literal.

(and if you are entering the backslash itself in a string context where that also has a special use, then you may need to double it: \\?)

Desalegn · November 15, 2021, 2:11pm

Yes, I did that. I think it is the markdown in the forum that is deleting it.

([(a|e|i|o|u|@|1)]) (?=([b-df-hj-np-tv-z]|(tS_>|J|p_>|t_>|Z|\?|tS|ts_>|dZ)) [(a|e|i|o|u|@|1)])

Find and Replace Vowel Sequences

Options