I'm currently building a macro speeding up my legal work. As part of this, I would like to identify all sentences in a text (and subsequently run actions on them).
The problem right now is that I figured out this regex
[^.!?]*[.!?]
to identify the sentences, but I get some false positives with the German legal abbreviations. For instance, the dot after "Abs" and "Nr" should be excluded (hence: not constitute a sentence). These strings ("Abs" and "Nr") are always the same and in this order, but there can be a situation where "Nr" is missing - e.g. "text § 50 Abs. 5 text".
After some more testing, I found another solution: insert a line break after each dot and exclude the dots after "Abs" and "Nr" using a negative lookbehind. This is the regex:
It's been some time since I've done extensive document parsing requiring sentence-level awareness with all the punctuation issues that come with the job.
One trick I've used is to make a pass at the text that changes problematic punctuation to something else.
A simple example:
etc.
Becomes
etc•
At this point I'll do my major parsing, and after that's complete I'll make another pass to change things back.
This sort of thing can require quite a number of passes and still be very, very fast.
Rule-of-the-thumb – never make text parsing more complicated than it needs to be.
When I ignore that rule I inevitably get bit somewhere down the line.
thank you very much for that valuable input. After fiddling around with Regex a bit more I got the idea to scan typical texts in German for abbreviations and create an abbreviation list to be excluded from further text processing - very handy!
This makes the individual regex rather easy to handle. Will post some details soon.