Best way to Search and replace large text into concise notes

So I want to write notes from a PDF textbook which is written in complete sentences
I want to replace all "useless words" from the text.

I want some more fine tuning. And obviously there must be a better way.
Some problems:
Handling excpetions
when text says "which is" I don't want to search and replace
Will obviously find out more of these excpetions later and wanted to find out a better way to handle these

There are certain times when text recognition software creates a weird double line break thing in the middle of sentence, want to be able to tackle that as well.

Thanks in anticipation

Right now, I have this:

Seems a little tortuous to use Keyboard Maestro for this. I would consider using a text editor to deal with this task. As text it would be more flexible and easier to manage and debug IMO. (BBEdit has "factories" that you can construct to do such a thing as you have outlined). With a rudimentary knowledge of a programming language like Python this would be a lot easier to try and maintain and grow.

  1. Anyway, back to what you have actually asked, one suggestion to deal with the "which is" problem is to concatenate the two words to "whichis" (Search for "which is" Replace with "whichis") Then you can just rely on your ability to decode in your brain "whichis" or at the end of the script reverse the process by (Search for "whichis" Replace with "which is")

  2. It would seem that using Regex would help with dealing with some of the problems that your code implies that you have encountered. It is designed to see patterns like when are the letters "and" actually an independent word and when are they simply inside another word (like panda). You have tried to identify "and" the word by searching " and " (surrounded by spaces). That only sort of works. What if the text includes "… and, when they come, go to the store" ? Regex is designed to deal with such issues. For example, you can search for words that match "and".

Your code has bugs. It is going to take "breathe" and turn it into "brea"

The free version of BBEdit has a nice ordinary and Regex search and replace capability. Debugging your ideas will be much easier because you can try various ideas, see the effect immediately, and reverse them when necessary. If you do not know Regex, BBEdit is good environment to practice in. And you do not have to learn all of Regex to accomplish what you appear to need to do. Your needs are pretty simple.

If you did not want to deal with the "factories" of BBEdit, once you had figured out all the Search and Replace commands that you need, you could use Keyboard Maestro to automate BBEdit, going through all these commands and sequentially filling in the Search and Replace dialogue of that program and applying them to the text document that you are trying to "rid" of useless words. The list of the Search and Replace texts could be maintained in a simple text document that Keyboard Maestro could draw from. And as you learn and debug additional Search and Replace texts , they could just be added to that document.

2 Likes

Hey Guys,

It should be noted that BBEdit is one of the most AppleScriptable apps on the planet, and even the free version will run AppleScripts and Shell Scripts – so brute-forcing the UI isn't necessary.

A very simple Perl script for doing multiple instances of find/replace:

#!/usr/bin/env perl -0777 -nsw

s!\bLorem\b!•••••!ig;
s!\bipsum\b!••••!ig;
s!\bdolor\b!•••••!ig;
s!\bsit\b!•••!ig;
s!\bamet\b!•••••!ig;

print;

** I would do this differently for a really huge document, but for most things this method is great.

-Chris

1 Like

It might be easier to maintain this project as you originally outlined it if you maintained all your Search/Replace commands in a single text document. Then you could have Keyboard Mastro to access this text document and apply it's commands to whatever document you wish abbreviate.

Your current approach requires you to have an ever expanding Keyboard Maestro macro. Each Search/Replace that you wish to use requires yet another line in the macro which will soon be unwieldily as it continually grows.

An alternative is to maintain a text document. Each line in that document specifies a Search/Replace command. The macro goes through this document, line by line, and applies the commands to the text that you are trying to abbreviate. You can "test" each pattern before you add it. Once you are confident that it is appropriate, you can add it to the document as a new line. And it makes sense to use Regex.

This Search/Replace document can be formatted as

Search Text TAB Replace Text
Search Text TAB Replace Text
Search Text TAB Replace Text
etc.

The Keyboard Maestro macro goes through this text, line by line, and extracts the Search Text and extracts the Replace Text and applies them to the document that you are trying to abbreviate.

The Search Text should be a Regex (regular expression) pattern which will make your goals easier. If you are simply trying to replace the sequence of characters dog with the sequence of characters cat you can do that, but more commonly you will be wanting the replace the word dog with the word cat and a more complicated Regex pattern will be appropriate. With something simple like associated with TAB a/w, Regex works just like a simple search and replace.

SAMPLE SEARCH/REPLACE DOCUMENT
\band\b &
\bbilateral\b B/L
associated with a/w
\bthis\b
\bis\b

Using a text editor like BBEdit, allows you to "see" invisible characters like TAB (the red triangle) and SPACE (the red dot). Otherwise it is hard to know exactly what your document contains. The last line contains only spaces and TAB and is not visible when looking at the plain text but it is there. (See image below)

58%20AM

Things to note.

  1. The Regex command to search for the word "and" is \band\b. \b is the Regex command to find the beginning or end of a word. I could not get this to work in my script without escaping the \ . I am not sure why this is. (Anyone?). However if you escape the \ then it will work. To escape it, you have to double the \ That is why instead of
    \band\b
    the document contains
    \\band\\b

  2. The last line replaces double spaces with single spaces. Double spaces start accumulating when you replace some word like "is" with nothing.The spaces that were originally on both sides of "is" now are joined as a double space. This last line cleans this up. It is possible that you might have to do this more than once to get all the double spaces. I did not test this

HOW DOES THE MACRO WORK?

  1. The first line just assigns the contents of the Search/Replace document to the variable MySrchRplc. If you update your Search/Replace document, you would update this first line.You cannot see the TAB characters, but they are there.
Set Variable “MySrchRplc” to Text
\\band\\b	&
\\bbilateral\\b	B/L
associated with	a/w
\\bthis\\b	
\\bis\\b	
  1. Before running the macro, you put the document that you wish to shorten onto the Clipboard. The second line takes that document you are trying to abbreviate and assigns it to the variable TextToShorten.
Set Variable “TextToShorten” to Text
%SystemClipboard%
  1. The third line initiates a for loop going through each line of the Search/Replace document.
For Each Item in the Collection Execute Actions
The lines in Variable “MySrchRplc”
Execute the Following Actions:
  1. & 5.The fourth and fifth lines use some Regex magic to extract the appropriate text into the SearchFor and ReplaceWith variables. For the first line in the search/replace document SearchFor is assigned \band\b and ReplaceWith is assigned &.
Search Variable “SingleLine” Using Regular Expression (ignoring case)
Search for “(^[^\t]*)”
And capture to:
Ignored
SearchFor
Stop macro and notify on failure.
Search Variable “SingleLine” Using Regular Expression (ignoring case)
Search for “([^\t]*$)”
And capture to:
Ignored
ReplaceWith
Stop macro and notify on failure.
  1. The sixth line does the actual work of going through the TextToShorten document and making the appropriate substitutions.
Search and Replace Variable “TextToShorten” Using Regular Expression (ignoring case)
Search for “%Variable%SearchFor%”
Replace with “%Variable%ReplaceWith%”
Notify on failure.
  1. The seventh line displays the now abbreviated document.
Display Text in Window
%Variable%TextToShorten%

StripUselessV2.kmmacros (4.4 KB)

1 Like

Out of curiosity, why would a really huge document be troublesome? I guess I do not know what "really huge" is :grinning:

But my experience with BBEdit and Search/Replace is that it is very fast and trouble free with the documents that I have been dealing with which are on the order of 1,000 pages.

I agree with this approach. My only suggestion would be to use the vertical bar (|) instead of TAB as a delimiter, since it is visible.

1 Like

Hey Robert,

It depends upon how much memory you have on your Mac (and how fast the hardware is).

On my old 17" Mid-2010 i7 MacBook Pro with only 8 GB of memory my Perl script takes ± 2 seconds to run on an 80 MB 500,000 line document in BBEdit.

That's good enough for most things, but if you're working with a file that's hundreds of GBs you'll want to do something different.

-Chris

I agree that this may not be a KM job.

I manipulate quite a lot of text and for jobs like this I find TextSoap to be a good solution. It's probably over-capable for most of my needs and perhaps yours, but the Batch Find and Replace Text will allow you to put your Search and Replace instances into two columns.

1 Like