Find and Replace Vowel Sequences

Dear @Sleepy: I am achieving almost the same result actually.
Do you know what I get when I run the macro on antiestablishment?

This is the result: **a n . t i e s . t a b . l i s h . m e n t **

Given that the language I am working on has no double vowels, the result is almost perfect.

If:

  1. you bump into a noticeable number of edge cases where that finite-state (regular expression) model is hitting snags, and
  2. you have a formal CV model which you can show us,

then we can probably translate that CV model into parser for you in a scripting language for a KM Execute JavaScript or Execute Shell Script action.

I actually have a number of interesting problems that are mind bending to me; but you guys can solve. I will think about it and frame the issue (not to mess up like in this post) carely and post it here.

You guys are amazing. Thank you so much.

I am back guys. I am having a slight issue with the lookaround regex. Apparently, it doesn't support complex characters.

IN the search field, I said I have consonants; which are normally simple characters such as b, d. But, X-sampa contains complex consonants such as tS and t_>.

In the regular regex, putting these complex characters along with the simple consonants works fine. But, the lookaround (lookforward) is not working.

Find: ([(a|e|i|o|u)]) ([b-df-hj-np-tv-z|t_>])
finds t_>

But, ([(a|e|i|o|u)]) (?=[b-df-hj-np-tv-z|t_>])
doesn't find/recognize the complex consonant t_>

Here is the actual data:

Find:

([(a|e|i|o|u)])(?= [b-df-hj-np-tv-z|t_>] [a|e|i|o|u)])

Replace comamnd (@stevelw*s solution)
$1 .

Input:
n u t_> u n e q u

It is supposed to produce: n u . t_> u . n e . q u. But, actually, it is producing n u t_> u . n e . q u.

Do you have a formal or informal CV model of the language at this stage ?

1 Like

CV is the standard syllable structure.

But, the syllable can be CVC as well as CVCC (at the end of words).

  • List of vowels: (a|e|i|o|u|@|1)
  • List of Consonants: tS_>|J|p_>|t_>|Z|?|tS|ts_>|dZ|[b-df-hj-np-tv-z]

t_> @ r r a => t_> @ r . r a
l @ b b @ s @ => l @ b . b @ . s @
g @ r r @ f @ => g @ r . r @ . f @
f @ t @ n @ => f @ . t @ . n @
? a b a r @ r @ => ? a . b a . r @. r @
? a z @ n @ => ? a . z @ . n @
? a l @ q q @ s @ => ? a . l @ q . q @ . s @
g e b s => g e b s

  1. if there is xx consonant (identical consonant reduplicated), insert the [dot] in between.
  2. there should not be CCV

I was almost there with this macro, if not for the failure of the lookahead regex for complex consonants.
X-sampa.kmmacros (12.0 KB)

In the macros: the first step, I insert [.] if there are two CC before a V. that to avoid CCV sequence (prohibited).
IN the second step I Insert [.] after every vowel. This the one discussed in this forum. That generates the right syllable structure for the rest of the word.

That's very helpful – do you think that those rules form a more or less determinate syllabic grammar of the material, or do they sometimes allow for ambiguities which are, for example, lexically or positionally resolved ?

e.g. in English:

the -> CV (determiner, suffix of name Benthe)

but also:

the -> C (suffixes of absinthe, bathe etc)

Could you expand a bit on what that last term means ?

I presume that the preceding alternatives are all multi-glyph C.

There could be factors such as word category (noun, verb, names etc). That will be resolved manually when necessary because the rules will be very complicated if we include other factors. My aim is to capture the basic verb paradigms.

1 Like

and this term is ?

This is supposed to be the list of regular English/Latin consonants: [b,c,d,f etc] written in the regex language: validation - Regex to match repeated consonant - Stack Overflow

1 Like

And the input source segments the glyphs with spaces, except for inside multi-glyph consonants ?

So always x a y u b i m rather than xayubim ?

Yes, absolutely.

Mmm ... not easy :slight_smile: It would need some back-tracking ...

If we sequence the patterns to try as:

[cvccEnd, cvc, cv]

that yields a mixed bag of success and failure:

t_> @ r r a => t_> @ r . r a
l @ b b @ s @ => l @ b . b @ s
g @ r r @ f @ => g @ r . r @ f
f @ t @ n @ => f @ t
? a b a r @ r @ => ? a b
? a z @ n @ => ? a z
? a l @ q q @ s @ => ? a l
g e b s => g e b s 

and if we simply reorder the sequence of testing to:

[cvccEnd, cv, cvc]

then the result is different, but still patchy:

t_> @ r r a => t_> @
l @ b b @ s @ => l @
g @ r r @ f @ => g @
f @ t @ n @ => f @ . t @ . n @
? a b a r @ r @ => ? a . b a . r @ . r @
? a z @ n @ => ? a . z @ . n @
? a l @ q q @ s @ => ? a . l @
g e b s => g e b s 

I'll try to give it some thought this evening.

1 Like

Thank you. But, if you don't have time, I have found some means of improving the solution by @stevelw.

This one ([(a|e|i|o|u|@|1)]) (?=([b-df-hj-np-tv-z]|(tS_>|J|p_>|t_>|Z|?|tS|ts_>|dZ)) [(a|e|i|o|u|@|1)]) seems to capture the complex consonants as well.

t_> @ r . r a
l @ b . b @ . s @
g @ r . r @ . f @
f @ . t @ . n @
? a . b a . r @ . r @
? a . z @ . n @
? a . l @ q . q @ . s @
g e b s

1 Like

Perhaps the trick is to add a tokenizing phase.

One thing that jumps to the eye there is that you may be getting a glitch with your ? consonant.

As that has a meaning in regular expressions (it optionally matches any single character), you need to escape it with a preceding backslash \? to treat it as a string literal.

(and if you are entering the backslash itself in a string context where that also has a special use, then you may need to double it: \\?)

1 Like

Yes, I did that. I think it is the markdown in the forum that is deleting it.

([(a|e|i|o|u|@|1)]) (?=([b-df-hj-np-tv-z]|(tS_>|J|p_>|t_>|Z|\?|tS|ts_>|dZ)) [(a|e|i|o|u|@|1)])
1 Like