I have a task I've been looking to automate and am struggling a bit. I have jpg files that have some numbers in them that I'd like to be able to extract and automatically enter into a web form to perform some calculations. I have a flow for getting OCR into a plaintext file, so what I'm looking to do is search within the text of the file and pull out these numbers and store them in the clipboard history, similar to Tom's suggestion in this post.
I'm not awesome with regex but can figure out how to search for each term that I want. I've been having some difficulty with using the "For Each..." action. I have multiple search terms that I want to find and copy each match to the clipboard, however the macro stops after the first instance of For Each executing (I have tried swapping the order and each instance will work, but only if run first).
I've attached my macro and some sample text is below; any thoughts would be greatly appreciated.
Thanks for the sample data, but sometimes the actual characters do not come across when posted directly. Could you please zip this file, /Users/jon/Dropbox/Lenstar.txt, and upload to the forum.
Also, if you’d like to lay out the step by step manual process you use, from selecting data in the file to posting on the web form, perhaps we can better automate your process. Also, please provide the URL of the web form if you can.
P.S. I made an edit of your OP to put the data between triple backquotes:
Thanks for the speedy reply and willingness to help! By way of background, the purpose here is to double check calculations used for implants in cataract surgery. Usually the way the the data are entered into the website is to open the jpg file to view on our imaging server or look at a hard copy that has been printed, and manually type in each of the values to the web form. We operate on 8-15 patients per week every week, so it becomes tedious and time consuming to do it all manually.
When looking at that file, I want to pull the same information for both the right and left eye, so there will be two instances of each pattern and the numbers will be different for all. I'll use examples from the file and bold the numbers I need to get for the form:
AL [mm] 23.49
R1[mm/D/°] 7.68 / 43.93 - the number 7.68 will change but always be a single digit with two decimal places
R2[mm/D/°] 7.63 / 44.26 - the number 7.63 will change but always be a single digit with two decimal places
ACD [mm] 3.35
LT [mm] 4.20
WTW [mm] 11.76
The website for the form is https://www.apacrs.org/barrett_universal2/
AL goes in the Axial Length field
R1 goes in Measured K1
R2 goes in Measured K2
ACD goes in Optical ACD
LT goes in Lens Thickness
WTW goes in WTW
I hope that clarifies things. Many thanks again for taking a look, I appreciate your efforts.
The data structure of your Lenstar.txt file is ambiguous. I'd love to see how it was derived from the JPEG.
The problem is that while there's a Right Eye and Left Eye line, not all the Right Eye variables exist before the Left Eye line. So those lines can't be used to distinguish between the values of some variables for each eye.
There is, of course, a first occurrence and second of each variable but those don't actually indicate which eye they belong to. We might assume first occurrence belongs to Right Eye, but the data structure doesn't confirm that. So, you know, not with my eye.
Anyway, here's a macro that does parse the sample file into values for both eyes. It relies on first occurrence of the Right Eye and uses both Left Eye and second occurrence for the left eye values.
You might want to run this on a few data files to see if it's reliable.
But I'd be interested in knowing if this macro parses more than one data file correctly first. And if not, the next move would be to look at the JPEG header (presumably) itself.
Are you aware of the very real possibility of errors when using OCR? It is far from a perfect process.
So your eye data could have errors and/or missing data.
I’d highly recommend that you try to obtain the eye data from the original source in a digital form, NOT in an image form.
Having said that, if the pattern in the eye data is consistent, then there is a clear pattern for data extraction. Please review your sample data and ensure it is complete and consistent, and without any duplicate data. Then post this revised sample data.
I have a working draft Macro, but the RegEx is very sensitive to errors/changes. Your data must be in good order for RegEx to work.
I ran a few more files through, and while the data extracted are all correct, I’m finding that the OCR does not always follow a consistent pattern in the text file it generates. In the above file, as you rightly mentioned, some of the data for the right eye got placed with the left eye data, but in other instances it’s correctly grouped.
Unfortunately, with our current imaging system, there isn’t a way to obtain the data directly from the source (the data are generated on the machine which makes the measurments, then the calculations are exported to the imaging system, which only exports image files, as opposed to PDF or something else that can have a text layer).
The jpg files have all the data for the left and right on opposite sides of the page, so, I could try to find a way to automatically split it into individual files for left and right eye, which would guarantee that only the data for a single eye were present in a given output file, though this would complicate things down the workflow.
Perhaps you could contact the manufacturer of the measurement machine/system, and see if there is a way to get the data directly, like in XML format. Most likely the measurement system is sending the data in XML to the imaging system, which superimposes the data on the image.
Is your doctor aware of this process, using OCR to get the data? To be honest, I would doubt it meets medical standards.
I didn’t realize you were scanning an image to recognize text. Surprised not to see more OCR errors, actually.
Without seeing the image you are scanning, I suspect the first occurrence/second occurrence would be the most accurate approach, although the variation you cite is concerning.
And you can simplify the scan by masking off one side and then the other so you have two OCR scans for each image.
But I’m going to second JMichaelTX’s concern that this isn’t the way to run a railroad. And I’ll second his suggestion you discuss it with the manufacturer of the system, too. Who knows, you might get a Nobel Prize for figuring this out.
I have found that ABBYY does a good job with OCR. It helps that the jpgs
are directly exported from the machine and not scanned images, so the
quality of the text is about as good as it gets. I wish I were able to
submit a sample image, but I don’t think I could do that without releasing
potentially identifiable information.
In retrospect even if it were able to be automated, I’d have to double
check everything still, so it probably wouldn’t save all that much time.
I looked at the specs, and our machine theoretically has the ability to do
these calculations. I’m not sure if it’s another module that you have to
pay a licensing fee for, but worth looking into as more and more of us are
using this formula for calculations.