Regex: find sentences including specific abbreviations

dwarfy · August 13, 2018, 7:46am

Dear Keyboard Maestros,

I'm currently building a macro speeding up my legal work. As part of this, I would like to identify all sentences in a text (and subsequently run actions on them).

The problem right now is that I figured out this regex

[^.!?]*[.!?]

to identify the sentences, but I get some false positives with the German legal abbreviations. For instance, the dot after "Abs" and "Nr" should be excluded (hence: not constitute a sentence). These strings ("Abs" and "Nr") are always the same and in this order, but there can be a situation where "Nr" is missing - e.g. "text § 50 Abs. 5 text".

The regex and an example can be found at:

Any idea how to solve this?

Thank you very much!

dwarfy · August 13, 2018, 10:56am

After some more testing, I found another solution: insert a line break after each dot and exclude the dots after "Abs" and "Nr" using a negative lookbehind. This is the regex:

(?<!Abs|Nr)\.

ccstone · August 14, 2018, 3:00am

Hey Marc,

It's been some time since I've done extensive document parsing requiring sentence-level awareness with all the punctuation issues that come with the job.

One trick I've used is to make a pass at the text that changes problematic punctuation to something else.

A simple example:

etc.

Becomes

etc•

At this point I'll do my major parsing, and after that's complete I'll make another pass to change things back.

This sort of thing can require quite a number of passes and still be very, very fast.

Rule-of-the-thumb – never make text parsing more complicated than it needs to be.

When I ignore that rule I inevitably get bit somewhere down the line.

-Chris

ALYB · August 14, 2018, 7:00am

This being true, I'd nevertheless want to add some other approaches:

Make sure that every beginning of a sentence starts with an uppercase letter.
Limit the minimum number of characters in a sentence to 4, to skip most (not all!) abbreviations, perhaps something like .{4,}?
Work with a predefined abbreviations list.
Create an abbreviations list on the fly.

dwarfy · August 14, 2018, 3:13pm

Hey guys,

thank you very much for that valuable input. After fiddling around with Regex a bit more I got the idea to scan typical texts in German for abbreviations and create an abbreviation list to be excluded from further text processing - very handy!

This makes the individual regex rather easy to handle. Will post some details soon.

Thanks again!
Marc

Regex: find sentences including specific abbreviations

Options