Search within a string for certain values, but exclude others?

Hello - I have a list of a few hundred thousand job titles that I want to sort into general categories.
I have a process now where I filter the list on a word, like "physician", export that list and then search in Excel for "exclusion" words that are not appropriate for the job type I am categorizing.

For the example described above, first pass was IF includes "physican"
Second pass was to search for the following words and then tag with a YES if the words existed, and <blank> if they did not.

Final pass was to filter for <blank> and there was my list.

Excel formula below if you are interested in the key words that were used as not pertaining to physician.

=IF(COUNT(SEARCH({"svcs","rep","relation",",liason","recruit","HR","contact","mgr","train","revenue","asst","Manager","office","coord","practice","special","scribe","bill","coding","code"},A2)),"Yes","")

Job Title
Transplant Physician
PHYSICIAN OFFICE RECEPTION
MANAGER PHYSICIAN PRACTICE
NON-EMPLOYED PHYSICIAN
PHYSICIAN-Independent
LPN - PHYSICIAN OFFICE
MLP-Physician Asst
Physician Coding Spec II
NON-PHYSICIAN (HOUSE)
Assoc Physician-in-Chief
Physician's Asst.
RN Physician Prac Coord
Physician Communications Spec
PHYSICIAN EXTENDER
Resident Physician Pediatrics
PHYSICIAN RELATIONSHIP MANAGER
PHYSICIAN PRACTICE BILLING COORDINATOR
PHYSICIAN
PHYSICIAN INTERNAL MED
PHYSICIAN-CLINIC
PHYSICIAN/MEDICAL DIRECTOR-INFECTION CONTROL
PHYSICIAN PRAC MGR
PHYSICIAN ASST CARDIAC MD OFFICE
Physician
PHYSICIAN ADVISOR CA
Physician Relations Coord
PHYSICIAN OFFICE - OFFICE
NON PHYSICIAN PRACTI
Physician FP and General Surgery and ER
CHIEF PHYSICIAN ASST
Physician Office Coordinator
SUPERVISOR PHYSICIAN PRACTICE

Hi, Jacques —

Your post is missing a few words. My guess is that you enclosed them in angle brackets (“<” and “>”), which the Discourse forum software suppresses under the supposition that they are HTML code.

Hi. Welcome to the Keyboard Maestro Forum.

Could you please clearly state the question/problem/issue you are having?
I've reread your post several times, and it is not clear what you want from us.

If you want help with processing data in your workflow, then we can provide you with better and faster help if you will post the following:

  1. Clearly state the objective of your workflow.
  2. Your manual workflow steps to achieve your objective.
  3. Real-world examples of your source, input data.
  4. Real-world examples of the results of your workflow.
  5. What tools (apps) you will be using.

If you have files of the example data, you will need to zip them first in order to upload to the forum.

Erik -
You are right! I wrote the word “blank” with the “<” brackets on both sides - I will clarify. Thank you for contacting me.
Jacques

1 Like

Hey Jacques,

I've fixed this for you.

It is possible to get Markdown to render the brackets by using HTML entitites.

&lt;blank&gt;

<blank>

Or you can use back-ticks to display as in-line code (plain-text).

`<blank>`

<blank>

-Chris

[quote=“JMichaelTX, post:3, topic:6362”]
Hello - I have a list of a few hundred thousand job titles in an Excel/CSV file that I want to sort into categories. I have about 40 categories. I have a manual process now where I filter the entire list in Excel, using a filter that contains a word that is a primary category, like “physician”. Then, I export that list and then create a formula in Excel that looks for “exclusion” words. For example, the job title Physician Assistant should not be in my final list of physicians, even though it contains the word physician. So I use the word “Assistant” as one of the exclusion criteria for category “Physician”.

For the example described above, first pass was IF any items in the list included “physican”. Second pass was to search for the words in the formula below and then tag with a YES if the words existed, and a blank ("") if they did not. Final pass was to filter for blank ("") and there was my list.

The goal for this filter is to find text strings that include “Physician” but NOT “svcs”,“rep”,“relation”,",liason",“recruit”,“HR”,“contact”,“mgr”,“train”,“revenue”,“asst”,“Manager”,“office”,“coord”,“practice”,“special”,“scribe”,“bill”,“coding”,“code”.

Attached is a file of job titles.

Physicians List.xlsx.zip (41.6 KB)

Hey Folks,

I'm talking to Jacques about this to find out more about his workflow, but in the meantime...

He wants to find lines containing “physicians” and then exclude lines containing words in an exclusion-list.

The task turns out to be really easy to do with Perl (rusty as I am). I chose Perl for the job, because he said he's dealing with a few hundred thousand lines of data.

The script won't work out-of-the-box, unless you use it as a BBEdit filter – I need more info to tailor it specifically to Jacques' task. This is just a proof-of-concept.

I've extracted the text from his Excel worksheet and posted the text file for testing.

Jacques Datafile.txt.zip (7.9 KB)

-Chris

#!/usr/bin/env perl -sw
# Auth: Christopher Stone
# dCre: 2017/02/20 22:30
# dMod: 2017/02/20 00:23
# Task: Filter text file for items the word “physician” and not containing the words in the exclusion list.

my @exclusionList = qw(
asst
bill
code
coding
contact
coord
HR
liason
Manager
mgr
office
practice
recruit
relation
rep
revenue
scribe
special
svcs
train
);

my $exclusions = join("|", @exclusionList);
$myRegEx = "\\b($exclusions)\\b";

while (<>) {
   if (m!physician!i) {
      if (! m!$myRegEx!i) {
         print;
      }
   }
}

I do have BBedit, but let me add some more details, that may be helpful.

  • I can extract the info to text file if necessary, but it comes to me as a csv file.
  • It seems like the code that Chris has offered would let me filter the data for one category at a time, which would be useful, but (greedy me) I would like to be able to process multiple rule-sets on each line of text, resulting in tagging that text line or data with an identifier that I can then pivot to see the number of items in each category.

Below is an example of the process that I am thinking of - hopefully it makes my need more clear.
Thank you for looking at this.