Remove Superscript from Text Copied from a PDF Document and Past it into Notes

Let me start over and try to make it more clear.

I have a PDF document with the following text. I created this screen shot using the Mac cmd + 4 then selecting the part of the PDF I wanted.

Screenshot 2023-01-26 at 1.56.44 PM

I want to remove all of the superscript characters from the text and place the updated text into Notes. The superscript characters are the small characters like the small a just before the word first. This image contains subscript a-f and I want to remove all of them.

The output that is placed into Notes should look like this:

4 For it came to pass in the commencement of the first year of the reign of Zedekiah, king of Judah, (my father, Lehi, having dwelt at Jerusalem in all his days); and in that same year there came many prophets, prophesying unto the people that they must repent, or the great city Jerusalem must be destroyed.

Hopefully this is better.
Roger

I hesitate to respond because this is a very deep hole you are digging and I don't leap over deep holes in a single bound.

But to clarify the problem you are dealing with, I OCRed your screen shot and got this:

if 4 For it came to pass in the com-
mencement of the “first year of the
reign of °Zedekiah, king of Judah,
(my father, Lehi, having dwelt at
‘Jerusalem in all his days); and in

>
that same year there came many
4prophets, prophesying unto the
people that they must ‘repent, or
the great city Jerusalem must be

[ destroyed. -

Which shows your superscripts are seen as `“ ° ‘ when they are seen at all (the f is invisible, for example). So OCR is not going to reliably solve this problem for you.

Not even running a spellcheck to find the problems because some of them are simply legitimate punctuation.

So my suggestion is to walk around the deep hole by accessing the PDF text directly (you might try to make a selection of the text), assuming it is text and not an image, and seeing how you might identify the superscripts.

That can be a deep hole too, as this Extract Subscript / Superscript discussion shows. (Clever though trying to find a change in baseline.)

Sorry I can't be more help.

Hey Roger,

This is not clear at all...

You're now suggesting as @mrpasini conjectures that you're OCR'ing the text, but you're actually copying it from a PDF viewer of some sort – right?

  • What PDF viewer?
  • What software are you using in all this?
  • On what version of macOS?
  • With what version of Keyboard Maestro?

If you can – zip a copy of PDF and make it available for testing.

If then then export a page (if you can) and make it available for testing.

-Chris

Yes I have tried both OCR and actual copy and paste but would like to highlight the verse and use cmd-c to get it to the clipboard. Then run the macro to remove the superscript and then put the finished output back to the clipboard to paste it into another program like mac Notes.

I am using Mac Preview as the PDF viewer.
Mac OS Venture 13.1
KBM 10.2

Zip Copy of a sample page

Book-of-Mormon-PDF.pdf.zip (406.1 KB)

I am using verse 4 bottom right as an example. It has superscript a-f but other verses may have more or less.

Output desired using verse 4 as input:

4 For it came to pass in the commencement of the first year of the reign of Zedekiah, king of Judah, (my father, Lehi, having dwelt at Jerusalem in all his days); and in that same year there came many prophets, prophesying unto the people that they must repent, or the great city Jerusalem must be destroyed.

1 Like

I think you are going to have to go the "edit the raw RTF" route. Luckily -- at least, in the sample provided -- the bits to remove all follow the same form:

\f1\i\b\fs18\fsmilli9333 \up10 b 
\f0\i0\b0\fs32 \up0 

...where that b at the end of the first line is the footnote character (which will vary, obviously) and there's a trailing space at the end of the second line. The \f1\ is always at the start of the line.

So ^\\f1(.|\n)*?\\up0 looks a reasonable regex...

Best I can come up with is bounce the RTF through a temporary file -- perhaps a guru can step up with a way to do this directly on the Clipboard? Anyways, this should allow you to copy the text in Preview, run the macro, then paste sans superscripts into Notes (if your sample text is representative:

Regexing RTF.kmmacros (2.1 KB)

1 Like

Nice_S

Thanks that worked. I don't understand how but I will try to figure it out.

Roger

Copy a text block from your PDF and paste directly in TextEdit, without the macro. Save the TextEdit document and close it. Still in TextEdit, do "File"->"Open...", click the "Options" button, tick "Ignore rich text commands", and open the just-saved file.

You'll now see all the RTF formatting code and will see where the superscripts are, the code before/after them, and can match that up with the macro's regex.