Writing macro to filter invalid characters out of automatically generated filenames: which characters are invalid?

Mac OS Catalina
I am writing a macro to regex filter invalid characters out of automatically generated filenames (journal references). Surprisingly I am having a difficult time figuring out which characters are invalid. When I google what appears to be a simple question, answers vary widely from a long list to only one (backslash).
thanks in advance for your time and help

The only "invalid characters" in Catalina are ":" in the Finder and "/" in Unix -- each is the "directory separator" in their realm (though your Mac does a remarkable job translating between the two for you). "." can be problematic when the first character of a file name -- it makes the file invisible. Otherwise you should be able to use anything in UTF-16.

BUT you should think ahead. Not all applications are as accepting of "strange" characters. If you use things like "<", ">", "$", "!" in your file names you'll make things more difficult should you ever want to process them via the command line. If you want to pass a file on to someone else, what does their system accept? Or perhaps you try to move data onto a NAS, only to find it's limited to Windows file name conventions...

The further you stray from A-Za-z0-9-_ (and no, there's no space character in that list) the more chance there is of something going wrong. But most of us break that guideline many times daily and never even care :slight_smile:

For reasonably expressive naming along with a reasonable level of safety, at work we recommend sticking to Windows/NTFS conventions and avoiding ' " * / : < > ? \ | -- adding the single quote in there to try and avoid "straight quote/curly quote/backtick" confusion...

Sidenote: There are other ways to reference articles -- have you considered eg saving them under their DOIs and maintaining a database of titles->DOIs? More convoluted, sure, but also more future-proof.

1 Like

Thanks very much for your reply.
Yes, I use DOI but sometimes it gets a bit tediousl
I was confused because I did some testing and for example : in a filename is automatically converted to / which suggested to me that it is invalid.
If I want to take your list and create a macro to filter out all those characters ( ' " * / : < > ? \ | --) in the clipboard, do you know if it can be done in one shot with a regex or should I just do a series of seach and replace actions, one for each character.
thanks

One shot -- you create a character class between [ and ], meaning "match any character in this list". The only gotcha is that \ has special meaning in a regex character class so you have to "escape" it with a preceding \ (it also has special meaning in KM text box, which is why we also escape it in the first action to get a literal \):

Search and Replace Specials.kmmacros (2.8 KB)

3 Likes

great. thanks very much !
Is there a way to handle the fact that period (".") should be considered invalid except if it precedes the file extension as in test.pdf ?

Probably -- except why are you arbitrarily deciding that it's invalid? Any "decent" software will only consider the final period of a file name to be the extension separator -- eg, using KM's "Base Name" filter on I am a file.name.txt correctly returns I am a file.name -- so I don't know what problem you are trying to solve by setting your own rules, and can't guess at the possible consequences...

I did some tests and you are (obviously) right, so I will drop . from the list.
You are brilliant ! thanks again very much

Not so much, it turns out :wink:

The one place you might want to remove a period is when it's the first character of the file name, to stop the file being "hidden" by macOS. I was going to say I'd be surprised if a paper's title started with a period, but.. academics do the weirdest things :wink:

So you might want to add another action, a simple "delete one or more periods at the start of the string":

thanks very much.
When I read the regex
^start of line
\ escape the .
.
what does the + mean ?

"One or more of the preceeding character" -- so this will take .filename.txt, ..filename.txt, ...filename.txt, and even .........................filename.txt all to filename.txt (like I said, academics do the weirdest things...).

As always, when doing file name alterations like this you should be aware of the potential for namespace clashes -- if you did have two papers titled "Title of $Paper" and ".Title of Paper" and based file names on the output of your regexs, the second could overwrite the first if it was saved/moved to the same directory (one of the reasons for using [supposedly unique] DOIs).

1 Like

OK thanks again very much
I am reviewing special regex characters (thanks to you)