KM's OCR Actions vs Monterey's Live Text OCR... And The Winner Is

KM's OCR and whether it might someday integrate with Monterey's Live Text was discussed in the following thread when Monterey was in Beta, but I want to make a fresh comment now that Monterey is out. (Hooray, KM 10.0 here!)

It appears that KM 10.0 continues to use its own OCR engine rather than the new Monterey Live Text OCR Engine. I guess that's not a surprise, but I've moved on. In other words, all my OCR needs in KM are met by Monterey's "Live Text" feature. And I use it a lot. It seems to have FAR better accuracy and seems to work much faster. Here's how others can copy my success. First, the short explanation:

Step 1: You create a Shortcut in the Shortcuts App in Monterey called "OCRscreen" which performs an OCR of the screen and returns the result.
Step 2: You create an Execute Shell Script action in KM which executes this: "shortcuts run OCRscreen" and saves the result into a variable.

That's the basic idea. But that is missing some important details. For one thing, do you really want the entire screen OCR'd? Usually it's better to isolate a specific region. But keep in mind that Monterey's OCR is very fast. Something that took 30 seconds with KM's engine took me only 1 second with Monterey's Engine. Monterey is also amazingly accurate: it can even read text that looks "handwritten." So because of its speed you probably can get away with reading the whole screen.

However it seems like such a waste to OCR the whole screen, so I've created a Monterey Shortcut called OCRrectangle which does OCR only on a portion of the screen. While it's possible to pass the coordinates to the Shortcut, and I have successfully done that, I find it to be easier to have the area of interest captured by KM and sent to Monterey's Shortcut utility as a file. So that's what I'm going to show here.

Next, in order to call this shortcut, you need to use KM to capture the relevant portion of the screen and then call the shortcut, perhaps like this example:

This approach has been so fantastic, it has changed everything for me. It's so fast and accurate that it feels like KM now has a human trigger reading the screen. Moreover, I can run this OCR in an infinite loop without any apparent system sluggishness, and my M1 Mac doesn't even get warm (I think Intel Macs can also access Live Text.) For me this has been a game changer.

Perhaps later I may post some of the higher level routines that exploit this. For example I have a macro that stores all kinds of triggers and actions in a Dictionary, so that I can specify, for example, "if you see the words "Are you sure you want to quit? then click on Yes."" In fact, my macros can actually FIND the location of words on a screen, which is something neither KM's OCR nor Monterey's OCR can do. And that's amazing too!

I have just one complaint: while Monterey's OCR can read an entire screen in one second, the screen capture action in KM takes two seconds to complete. This feels so strange, because capturing a screen and saving it to file should be about 10 times faster than actually reading/processing it, especially since processing it requires reading the screen from file. Is there anything Peter can do to speed up the screen capture action? (Perhaps there's an API for screen capturing instead of calling the screencapture utility in macOS.)

I'm on a self-imposed sabbatical from these forums for a year because I don't seem to get along well with people, but I'm breaking that today to provide these cool ideas to the community. And if the community likes these ideas, it may help to get KM to incorporate Monterey's OCR.

11 Likes

Keyboard Maestro uses an open-source OCR engine called Tesseract, available for +100 languages. I've heard good things about Apple's Live Text, but it only works in 4 languages, so it's not really a replacement for Tesseract just yet.

1 Like

At launch, Live Text supports seven languages: English, Chinese (simple and traditional), French, Italian, German, Spanish, and Portuguese. I didn't realize the number of languages was more important than speed and accuracy. Apple's Live Text can read my screen 30x faster than Tesseract with 30x fewer mistakes.

I guess this depends on the need to OCR something in any one of the other 97 languages supported by Tesseract and unsupported by Live Text.

Luckily we can have both of them, right?

By the way, I was unable to recreate your workflow. Could you perhaps share the macro and the shortcut?

Hi @Sleepy,

Thanks a lot for sharing.
I just made a macro and shortcuts to work together. It's very useful.

The only difference is that I use the Prompt for Screen Rectangle action before the Screen Capture Area action:

With this setting, after activating the macro, I can just select the area I would like to OCR.

2 Likes

This is terrific @Sleepy, thanks very much. I didn't even know I wanted this but I certainly do. I couldn't find a "Highlight rectangle" action, but this variation works well for me. Thanks again!

3 Likes

I'll comment my main use case - in automation - is OCR'ing a specific area of an image. (Always the same area.)

The image might be displayed or in a file (mostly the former but the displayed image is stored as a file anyway). It would always be the same area. The font would be old and a bit rickety.

I'll have to experiment.

This was a great tip on using Apples Live Text.
My KM Macro is a bit smaller. I use the interactive screencapture option from the terminal tool.
I do not need the coordinates I capture for anything else afterwards.

1 Like

I don't know how to "share shortcuts" from the Shortcut app so I can't do that. Until you get the shortcut working from the screenshot, any KM macro that I send you will simply fail. Let me know when you get the shortcut working.

Getting started with Shortcuts: 1 Basics – The Eclectic Light Company

Search for “export”.

-Chris

That's a good tip, but my shortcut has my folder and username hard-coded into it, which means it's of no use to anyone else without editing it, so I'll leave it to the user to manually type in their own shortcut with their own preferred folder name and location.

Stumbled onto a killer use case for this idea this morning: Too many times in a video meeting, someone will briefly present a Google document or other web site that I want to follow up with. In the old days I would either interrupt to ask them to share or screen capture the URL. Now I can just OCR the URL in a matter of a few seconds. Excellent!

1 Like

You could also have KM running in a loop checking the video for a URL, and if it sees one, it could automatically open the URL in a window. Or a notification with a link could auto-appear and you could click on the link, I guess.

I already do this for computer games on an M1 Mac with a utility (that I haven't shared on this website yet) which lets you automatically take actions, like click on a button on the screen, if it detects words like "Do you really want to quit?"

1 Like

Dear @Sleepy,
Thank you indeed for this great hack! I loved also the prompt rectangle idea from @martin.
As you mentioned - the Apple OCR is working extremely well and helped me with the project that I've spent almost whole day working on, but the result is flawless.
My use-case scenario:
I am logging to Citrix via the Workspace App and I get a passcode as sms (visible in the Messages.app), I couldn't find a way to read/copy the new sms without any input, so with the OCR trigger I can be less careful now and select much bigger portion of the screen, which after some regex filtering is getting exactly the code I need.
Now I have the whole procedure of opening, switching windows and pasting the passcode to the input window automated. Only part I do manually is to take the screenshot, but I believe that if I pre-set the window size of the messages app and get correctly the coordinates for where the new sms is appearing, I can also fully automate the screenshot part too and save annoying 2-3 minutes of my life by sipping a coffee instead.

EDIT: I have actually told the messages app to resize as I need it to, then I know where the SMS is appearing and was able to set the coordinates for the screen capture. Now all is automated!

Thank you all for this great post and the comments!
Stan

3 Likes

Following the discussion above - I have exported the shortcut and the macro, for those that are lazy to reproduce all or are stuck in some process. Hope it will help.
The OCR Shortcut in the Shorcuts App - once imported you need only to change the path as this is obviously linking to my computer - link
The OCR grab (The supported languages from Apple - I have tried with English, German and Spanish and the text recognition was flawless including special characters. Cyrillic doesn't work) -
OCR_Read_OSX.kmmacros (3.4 KB)
The Macro doesn't have enabled trigger for the time being, so you can choose your own. Please edit the path on step 3 (Write System Clipboard to File) to the same location you have given in the Shortcut App.

After importing all and triggering the Macro the rectangle appears - select the area you want to screengrab, the process flows and the OCR-ed content is copied to the clipboard - all is left for the user is to paste it wherever he wants.

Once again, many thanks for the amazing community here.
Stan

3 Likes

That's me! I'm definitely lazy, that's why I like to automate anything I can.

Thank you @stanivanov for sharing that and of course @Sleepy for coming up with this idea and sharing. And @Martin for the bit about Screen Capture Area - sorry if I missed anyone...

It works great and I've adapted for my own needs.

The path and the .jpeg file is saved to needs to be changed in both the OS Shortcut and in the Keyboard Maestro Macro (just wanted to mention that for others following along).

Also, I like to use Local Variable names so I changed the Variable names in your example Macro. It's a good practice to get into as then the Variables don't persist after the Macro has run. Just a tip to bear in mind :grinning:

image

The actual recognition of text is quite incredibly accurate - even with white text on a black background. Again well done @Sleepy for being the pioneer!! :clap: :clap: :clap:

Thanks. I actually measured Monterey's OCR speed once, (which is only a single test, so your results may differ) and it actually was 30x faster. As for its error rate, my claim that it's 30x more accurate is more subjective, based on inspection of some sample results. And for my purposes, errors that are "false negatives" are more important to solve than "false positives". What I mean by that is that it's more important that the OCR doesn't misinterpret words, rather than accidentally come up with the occasional false word. Even Monterey occasionally comes up with a spurious word from time to time. But that doesn't matter when Monterey almost always reads the words it does see correctly. I hope you can understand the distinction that I'm driving at.

There is another fabulous new feature of Monterey that I'm working on integrating with KM, and when I'm confident that my technique will earn three claps, I will also release that. Perhaps others will beat me to it; that's okay.

4 Likes

And just discovered something else very good... I thought I would have to go through the setup for each of my Macs but the Shortcuts App syncs via iCloud so the OS Shortcut was already on my second Mac and as I have Keyboard Maestro syncing, everything was in place already and just worked on the second Mac.

The path I used to save the Screenshot to is not the desktop but here (as it my previous OCR Macro was saving to here and I think this path would be the same on any Mac)

/tmp/screencap.png

image

image

1 Like

Thanks @stanivanov. That was helpful.

I wonder if someone managed to OCR entire PDFs using Live Text (ignoring already existing layers of text in the PDF). I have been trying to use shortcuts to do it, but without success.

I have lots of experience with Monterey OCR. The main issue with what you are asking is that Monterey OCR returns text from the top down, regardless of horizontal position. So if it's a two column source target, you won't get very useful results. If it's something really simple like pages in a book, I think that would be rather easy to get working.

1 Like