I'm trying to build a crazy Macro that can navigate a phone-tree system based on its voice prompts. I was initially thinking of using OpenAI's Whisper to transcribe the voice prompts, and then figure out a way to pass the info to Keyboard Maestro for triggering, but I learned that--at least at the time of this writing--Whisper can't do real-time transcriptions.
Any suggestions for alternative options that can do real-time transcriptions, and then somehow pass that info to KM as some form of trigger?
Wow! That's right on the cutting edge of what I was experimenting with yesterday! It's like you are reading my mind. I'm willing to share what I have so far.
So far, my system even works if the audio is spoken out of the speaker of my speakerphone. But there are ways to directly connect audio to a Mac for better accuracy. What is the nature of the setup for getting your phone's audio into your Mac? What version of macOS do you have?
MacOS (Sonoma) has a feature called "Live Captions" which automatically transcribes anything that it "hears." By "hearing", I mean that it can choose from at least two sources: the microphone, or its own internal sounds that it generates through its speakers. Both of these choices have interesting possibilities. And there may be a third possibility, by getting Audio Hijack, which can isolate audio channels in the Mac even further. (I own Audio Hijack, but I haven't tested it in this solution yet.)
The "output" for Live Captions is a window on the screen where the transcriptions are displayed. For our purposes, it would be much better if the output was sent to a text file instead, but I can see why that's not possible: the Live Captions frequently erases what it has displayed and puts up new text to replace old text. So that means we have to accommodate for this issue.
Fortunately, KM v11 has inbuilt support for the latest, greatest OCR from Apple called Apple Text Recognition. It's extremely fast and accurate. We could place it in a loop to read the Live Captions output. Therefore, we have the ability to do what you want to do.
Fortunately, phone tree systems are usually fairly patient. They don't hang up immediately. So if our OCR takes a few seconds, that shouldn't be a problem. But I don't think it will be slow at all.
You said you wanted "some form of a trigger." I wouldn't necessarily call it a "trigger" here. It's just a process that our software will have to undertake to determine when it's happy that it has finished hearing an option.
How were you planning to test this? Do you have your own voice prompt generator? Or do you have a particular company's system that you want to test this on? BIG IDEA: I think it would be possible with KM to actually generate "voice prompt output" that would be sent out through the Mac's own speakers, and then have the same Mac running a "voice prompt input" macro analyzing those voice prompts. If you do the former, I'd be happy to attempt the latter, since I've already done half the work there.
I'm using both Audio Hijack and Loopback to route audio to and from FaceTime, which handles the actual phone call via Handoff from my iPhone. FaceTime seems to be a bit fragile when trying to outsmart it's input and output routing, but for the most part it works. It's also annoying that you can no longer use the Mac keyboard to type in numbers on the FaceTime keypad during a call, so I have to resort to using SoX for playing back DTMF tones through the FaceTime audio input (via Loopback).
I've just learned that Live Captions, (which is officially still in Beta) uses something that appears similar to Apple TV's copy protection technology which prevents screen shots of the region of the screen where Live Captions are being displayed. With Apple TV video, if you try to take an image of the screen, all you get is a black box, but with Live Captions, all you get is a grey box. (Although curiously, the OCR action works on Apple TV windows, but not on Live Captions window. Weird.)
There may still be a way around this. Yesterday, I think I had it working because I was using it in conjunction with Voice Control, which is a separate MacOS feature.
Because I prefer to avoid brute force mouse-clicking whenever possible, as it adds more layers of complexity and unpredictability than I prefer (and makes the Macro much less portable). The keypad cannot be opened until the call is actually connected, and so you'd need to build in a Found Image condition to click the Keypad button (after first properly giving the Facetime HUD focus), then clicking each number manually. You can obviously do it that way, but I prefer to do things in a more portable and reliable fashion, such as telling SoX to play the built-in macOS system DTMF tones through the audio output with the proper tone lengths and pauses in between tones.
You can do anything you want, and your way looks good, but you wouldn't need a Found Image action to locate the button. A simple math formula would work fine. but yes, the window would have to be active. I didn't know that you didn't want that.
Okay, I think I have a system that works reliably. It requires two macOS features running: (1) Voice Control, and (2) Text to Speech. (and a KM macro that reads the output of Text to Speech, which isn't copy protected like Live text is.) There is no requirement for "Live Text" although I suspect that Voice Control is using the same underlying code as Live Text.
However, Text to Speech requires "focus" and you said you didn't even want focus for FaceTime, so my solution won't work for you. But I'll tell you what my solution is, anyway.
Step 1. Get Voice Control activated (use the audio source of your liking; for testing I just use the Mic)
Step 2. Get Text to Speech opened up. Place the window somewhere on the bottom of your screen. Make sure your screen background doesn't contain any words. I recommend using black as your screen background.
Step 3. Give focus to the Text to speech application by clicking in its typing box.
Prior to this, you should set up a macro that reads that portion of the screen using a KM macro that repeatedly uses Apple Text OCR to read that portion of the screen and puts the data into a variable, which I currently call "SpokenWords".
Now that's the basis for a solution to your needs, apart from the fact that my solution requires focus. But now the real work begins. You would have to write code that looks for "words" and takes the appropriate action. For example, it could monitor the contents of SpokenWords and if it sees the words "press zero for more information" then you could do whatever you do to press a zero (I would use a mouse click, but you can use what you want.)
I am not sure if any builtin macOS features allow transcription to a file in real time, which is what it sounds like you want. You may need a third party application for that. But I'm the kind of person who demands cheap solutions, and this is free.
Just for clarification, it's not that I "didn't even want focus for Facetime". I was just saying that I prefer to build automations that avoid having to manually click the mouse at fixed coordinates or force-focusing a window if there's another way to accomplish the same end result. If the only literal way is to do brute force mouse clicking, then of course I would do that.
The more I think about this, I'm wondering if a more practical feature request would be for a "Matched Audio" condition, similar to how the "Found Image" conditions work. You could designate a sound clip to match to, and then run the condition if it detects a similar waveform in the live audio stream?
Of course that makes sense to me in theory, but I'm sure from Peter's standpoint it would be a lot more complicated than that. But just a thought, regardless.
Well then, you could try my approach. I gave you the elements of a working setup.
However I didn't say the mouse would have to click at "fixed coordinates." What I said was "a simple math formula" could calculate the coordinates of the buttons, which would be based on the coordinates of the FaceTime window, not some fixed address. You could still move the FaceTime window freely, and the buttons would be instantly located.