How to replace word groups when parts over them are overlapping?

ALYB · June 26, 2024, 12:39pm

I have these replacements:

protective packaging=beschermende verpakking
packaging materials=verpakkingsmaterialen
protective=beschermende

And this source text:

protective packaging materials

When I run the three replacement actions, I get this:

beschermende verpakking materials

Note that 'materials' hasn't been replaced.

The result that I want, is this:

beschermende verpakkingsmaterialen

How should I approach this?

ComplexPoint · June 26, 2024, 1:02pm

Sort by lexeme count and apply shorter transformations later ?

Or perhaps, more generally identify potential clashes, and adjust the transform sequence accordingly ?

ALYB · June 27, 2024, 6:00am

Sounds good. Now I only have to figure out the implementation ;).

On a serious note: I guess that this task is quite complex, as I've never seen a solution.

ComplexPoint · June 27, 2024, 6:03am

Not easy to reduce a classical problem of grammatical parsing.

The choice between:

(protective packaging) materials

and

protective (packaging materials)

looks more or less indeterminate.

(An LLM may turn out to have cooked a probability weighting, but that sounds like hammer and eggshell, quite apart from the consumption of wattage)

Nige_S · June 27, 2024, 11:10am

The easiest way is to start your S'n'R routine with:

protective packaging materials=beschermende verpakkingsmaterialen

And yes, you'll have to explicitly include every "overlapping" term you want to handle.

I was hoping that @ComplexPoint would come up with something better than I could but, as he says, this is classic parsing problem which even literate humans get wrong on a regular basis.

ComplexPoint · June 27, 2024, 11:19am

Can't give it thought today, but I wonder whether it's possible to define a function over the set of replacements, which identifies all possible aggregations like that.

(so that they could be manually pre-edited, and at run-time applied before the smaller rewrites – in the sense those involving smaller numbers of input words)

ALYB · June 27, 2024, 11:30am

Normally, term recognition and replacement works from the source text via the glossary, from LTR.

In this case (especially if the source language is English instead of German), problems can already arise if a glossary entry starts with 'the': this will break the recognition of the other parts of multi-word terms:

For instance, the source terms:

the packing 
machine
packing

will never lead to the recognition of 'the packing machine', because of the 'the'.

I have a [macro] (Replace Strings in Input Box via Tab-Del Glossary) that works the other way around: from a glossary to the source text. Each line of the glossary is split into a source term and a target term. Then the source term is replaced by the target term.

Of course, this glossary could be sorted by decreasing length of the source term, but with 200,000 lines, the replacement actions would take a lot of time.

Replacement speed can be increased by collecting all lines from the glossary that contain any part of the source segment (the line/fragment/piece of text) that is currently being translated, by using:

grep -E "protective|packaging|materials" glossary.txt > glossary_extract.txt

Now I need a command to sort glossaries with two tab-delimited columns on decreasing length of the first column items. I have googled for it, but to no avail ...

For the time being, I used Excel. I've uploaded two examples: extract.txt (extracted matches from the glossary) and sorted.txt (sorted on length of the first column).

examples.zip (5.5 KB)

I have run a test and with 168 lines of term pairs in the temporary sorted.txt file, the replacements are almost instantaneous.

I would be grateful if someone could help me with a command line command (or other solution) to sort a tab-delimited glossary with two columns by decreasing length of the items in the first column.

ALYB · July 17, 2024, 8:25am

In the meantime, some creative and kind persons here have provided some solutions. Thank you very much!

No I can proceed in trying to solve this task ...

How to replace word groups when parts over them are overlapping?

Options