Missing tesseract OCR digits only option

FroZen_X · March 24, 2020, 9:48am

Hello everyone,

in tesseract itself you can set to recognize numbers only:
How to make tesseract to recognize only numbers

I don't see that option in keyboard maestro. Is this something that will be added?

Cheers,

FroZen

peternlewis · March 26, 2020, 9:27am

Blacklisting and whitelisting It is not supported in Tesseract 4.0 which is what Keyboard Maestro uses.

I believe this may be resolved in Tesseract 4.1, so when I look at updating Tesseract next, this might become viable.

However it is still problematic really, since the definition of “numbers only” tends not to be quite so clear cut. Often you might want just a couple other characters, maybe decimals, or commas, dollar signs, spaces, returns, dashes, etc.

It is plausible I could expose the white list and black list fields entirely, but then what happens when they drop support for them in 5.0 (like they did in 4.0).

FroZen_X · March 28, 2020, 1:56pm

thanks for the information for now i will go with regex then. That should be sufficient for now. Interesting how they drop support unfortunately.

DJLunacy · March 29, 2020, 1:08am

For what it's worth I did a massive project leveraging KM and a handful of other things a couple of years ago. It use Tesseract 3 I believe it was before it was baked into KM.

What I can tell you is, getting universally consistent results at least in what I was attempting was a major, major pain in the ass.

In my use case it was due to the fact that the characters often used different fonts that weren't very standard in use. Also, there were many times it would detected letters as numbers and vice versa.

IMO unless you have a very standardized font, or something that is easily readable you're going to want to do a white/black list and possibly create your own "language" trained file for best results.

If you don't want to go to that extreme you'll likely be able to leverage some regex to fix your data, however just keep in the back of your head that you're likely going to have random issues with your data set. (Not many, but don't assume it's perfect)

I'm not sure if these tips are lurking in this board but I would also do the following...

Convert your images to Grey scale
Blow them up before running them through OCR (300-400%)
Sometimes playing around with the brightness of the image helps.

There's a law of diminishing returns with regards to how much to blow up your image. In my experience there was a lot of trial and error, before I got my process refined enough to where I could just let regex clean what wasn't OCR'ed properly. (Again, I was running text and numbers at the same time)

Good luck, if you get stuck with the regex, I suck at it LOL but I might have my code saved that may or may not be helpful in your case.

Missing tesseract OCR digits only option

Options