Normally, term recognition and replacement works from the source text via the glossary, from LTR.
In this case (especially if the source language is English instead of German), problems can already arise if a glossary entry starts with 'the': this will break the recognition of the other parts of multi-word terms:
For instance, the source terms:
the packing
machine
packing
will never lead to the recognition of 'the packing machine', because of the 'the'.
I have a [macro] (Replace Strings in Input Box via Tab-Del Glossary) that works the other way around: from a glossary to the source text. Each line of the glossary is split into a source term and a target term. Then the source term is replaced by the target term.
Of course, this glossary could be sorted by decreasing length of the source term, but with 200,000 lines, the replacement actions would take a lot of time.
Replacement speed can be increased by collecting all lines from the glossary that contain any part of the source segment (the line/fragment/piece of text) that is currently being translated, by using:
grep -E "protective|packaging|materials" glossary.txt > glossary_extract.txt
Now I need a command to sort glossaries with two tab-delimited columns on decreasing length of the first column items. I have googled for it, but to no avail ...
For the time being, I used Excel. I've uploaded two examples: extract.txt (extracted matches from the glossary) and sorted.txt (sorted on length of the first column).
examples.zip (5.5 KB)
I have run a test and with 168 lines of term pairs in the temporary sorted.txt file, the replacements are almost instantaneous.
I would be grateful if someone could help me with a command line command (or other solution) to sort a tab-delimited glossary with two columns by decreasing length of the items in the first column.