How to replace word groups when parts over them are overlapping?

I have these replacements:

protective packaging=beschermende verpakking
packaging materials=verpakkingsmaterialen
protective=beschermende

And this source text:

protective packaging materials

When I run the three replacement actions, I get this:

beschermende verpakking materials

Note that 'materials' hasn't been replaced.

The result that I want, is this:

beschermende verpakkingsmaterialen

How should I approach this?

Sort by lexeme count and apply shorter transformations later ?

Or perhaps, more generally identify potential clashes, and adjust the transform sequence accordingly ?

Sounds good. Now I only have to figure out the implementation ;).

On a serious note: I guess that this task is quite complex, as I've never seen a solution.

Not easy to reduce a classical problem of grammatical parsing.

The choice between:

(protective packaging) materials

and

protective (packaging materials)

looks more or less indeterminate.

(An LLM may turn out to have cooked a probability weighting, but that sounds like hammer and eggshell, quite apart from the consumption of wattage)

The easiest way is to start your S'n'R routine with:

protective packaging materials=beschermende verpakkingsmaterialen

And yes, you'll have to explicitly include every "overlapping" term you want to handle.

I was hoping that @ComplexPoint would come up with something better than I could but, as he says, this is classic parsing problem which even literate humans get wrong on a regular basis.

1 Like

Can't give it thought today, but I wonder whether it's possible to define a function over the set of replacements, which identifies all possible aggregations like that.

(so that they could be manually pre-edited, and at run-time applied before the smaller rewrites – in the sense those involving smaller numbers of input words)

Normally, term recognition and replacement works from the source text via the glossary, from LTR.

In this case (especially if the source language is English instead of German), problems can already arise if a glossary entry starts with 'the': this will break the recognition of the other parts of multi-word terms:

For instance, the source terms:

the packing 
machine
packing

will never lead to the recognition of 'the packing machine', because of the 'the'.

I have a [macro] (Replace Strings in Input Box via Tab-Del Glossary) that works the other way around: from a glossary to the source text. Each line of the glossary is split into a source term and a target term. Then the source term is replaced by the target term.

Of course, this glossary could be sorted by decreasing length of the source term, but with 200,000 lines, the replacement actions would take a lot of time.

Replacement speed can be increased by collecting all lines from the glossary that contain any part of the source segment (the line/fragment/piece of text) that is currently being translated, by using:

grep -E "protective|packaging|materials" glossary.txt > glossary_extract.txt

Now I need a command to sort glossaries with two tab-delimited columns on decreasing length of the first column items. I have googled for it, but to no avail ...

For the time being, I used Excel. I've uploaded two examples: extract.txt (extracted matches from the glossary) and sorted.txt (sorted on length of the first column).

examples.zip (5.5 KB)

I have run a test and with 168 lines of term pairs in the temporary sorted.txt file, the replacements are almost instantaneous.

I would be grateful if someone could help me with a command line command (or other solution) to sort a tab-delimited glossary with two columns by decreasing length of the items in the first column.

In the meantime, some creative and kind persons here have provided some solutions. Thank you very much!

No I can proceed in trying to solve this task ...