RegEx to find specific word

iamdannywyatt · November 20, 2023, 1:03pm

Warning: I'm still learning RegEx, so this might be basic for some of you. In this particular case I just copied this from a website.

After some research I was able to find the RegEx to find a specific word while excluding words that include that word, for example, if I want to find anything with the word rocket, but exclude rocketed

(?:^|\W)rocket(?:$|\W)

Now when I test it at https://regex101.com, I get this result where it includes the 2 spaces before and after:

Also, if my string is rocket. it will also validate. I want it to be exclusive to what I type and exclude everything else, no matter what comes before or after

I also noticed that using roCket for example, will not validate using that expression.
I was able to change that by using:
(?i)(?:^|\W)rocket(?:$|\W)(?-i)
Is this a good approach?

RogerB · November 20, 2023, 1:18pm

Try using

\brocket\b

The "\b" means match a word boundary.

iamdannywyatt · November 20, 2023, 1:23pm

That's great and simple! Thanks!
So if I want to match anything (case insensitive), would this be the way to do it?
(?i)\brocket\b(?-i)

I tested it and it seems to work on both the regex101 website and KM.

Also, for readability purposes, I think this would be better
(?i)\b(rocket)\b(?-i)

It also seems to work. Do you see it creating any issues in other scenarios?

RogerB · November 20, 2023, 2:11pm

I'll start with a disclaimer that I am by no means a regex expert, so if one of them happens along listen to them, not me

I don't think you need the final "(?-i)" as there is nothing after it, the opening "(?i)" is sufficient to make the subsequent pattern match case insensitive.

Adding the parentheses around "rocket" creates a capturing group. You're not doing anything with the captured data but that doesn't matter, and if you find it makes it more readable in this case then I don't see it's doing any harm. However, it's probably best not to get into the habit of adding parentheses to every regex for readability as it might cause confusion if you are trying to use capturing groups in the future.

That's impossible to answer, as the scenarios the regex pattern could be used against are almost infinite. In my experience (which largely consists of blundering my way to a solution through trial and error) you can't assume that a regex that works perfectly to match some text in document A will work at all in document B, since, for example, an extra space or line break in document B can mean it doesn't match what you expect it to. It's best if you know the text that you're using the regex against so that you can test it thoroughly and then you can be confident it will work as you expect. Having said all of that, in the case of putting parentheses around "rocket" here, I can't see it causing any problems, but I can't promise it won't.

iamdannywyatt · November 20, 2023, 3:48pm

Well you are way above my level or expertise anyway, so I'll take your advice, especially when it works as expected

I tried it without and it works, as you said.
I'm not sure I understand what you mean by "there is nothing after it"?
Can you share a real example where (?-i) would be used?

But in that case, wouldn't that be a scenario where the parenthesis for readability could represent an empty (or useless) capture group? I would be able to see, according to the context (such as a macro), that that particular case is for readability. Or is there a real downside to that? Like, would something stop working or something?

Yeah, my question was more related to something that would be more obvious. like something that you could think of that happens 90% of the time or something.
Sure, each scenario is different and when the time comes where something doesn't seem to work, I will must likely have to check why it isn't and will learn an exception to the rule

Appreciate your contribution to this! I'm taking notes here so I can learn more as I go.

RogerB · November 20, 2023, 4:39pm

I simply meant that it's at the end of the regex pattern, nothing else follows it. What your regex was saying is:

Pattern Element	Meaning
(?i)	switch to case insensitive mode for the pattern that follows
\b	match a word boundary
rocket	match the literal string "rocket"
\b	match a word boundary
(?-i)	switch to case sensitive mode for the pattern that follows

Since you have the (?-i) at the very end of the pattern, with nothing after it, it doesn't do anything, so it might as well not be there.

As for a "real" example, you would use it if you needed part of your pattern to match with case insensitivity and part without, it allows you to switch between the two modes. To see it working copy and paste this pattern into regex101 and then experiment with test strings to see what matches it:

(?i)case (?-i)insENsitiVE

Start with a test sting of "case insENsitiVE", which matches the pattern, and notice that you can change the case of any of the letters in the word "case" and it continues to match, but if you change the case of any letter in "insensitive" it no longer matches.

I don't think it will stop working as such, I'm just pointing out that parenthesis have a specific meaning to the regex engine (forming a group), and if you are using them for readabiity just be aware that you might confuse yourself at some point in the future if you're trying to create a complex pattern with capture groups. As long as you're aware of that you can watch out for it.

iamdannywyatt · November 20, 2023, 5:37pm

Thanks for clarifying.

You see, this is one of the "issues" when you know a bit of "everything": things can start becoming confusing when you mix things up.
Even though I'm not an expert when it comes to HTML, I'm pretty comfortable with it and so I looked at (?i) as an opening "tag", the way I would use for example in HTML, and then (?-i) as a closing tag, like . So to me, the way I was reading the RegEx was like "everything inside the opening and closing tag, make it case insensitive".

Now I understand it. (?i) makes the first match case insensitive. If I wanted another match after that to be case sensitive, I would use (?-i)
In this case the -i means it's "negativing" the insensitivity.

It is clear now. Again, thanks for taking the time to clarifying this for me. A new thing to add to my notes

Yes, that's what I mean. As long as I'm aware of what that is, since I don't use RegEx to share it with other people who could misinterpret it, it's ok.

RegEx to find specific word

Options