Multibyte Emojis and Substring “Characters”

Using KM 9.0.6

The "Get Substring" action is not multibyte emoji aware. Some emoji (e.g. :white_check_mark:) match as a single character, many others match as 2 characters, and some of the newer ones with combining forms (commonly used for skin tone variants like :ok_hand:t2::ok_hand:t4::ok_hand:t6:) require 4 characters to match.

For example, with the single-emoji string value ":ok_hand:t4:", asking for the subscription to the first character returns nothing, asking for the first 2 characters returns ":ok_hand:" (the generic form of the OK hand gesture), and you have to ask for the first 4 characters to return ":ok_hand:t4:".

I realize this may well be a very difficult bug to fix (Unicode...), but it definitely feels like a bug. No matter how complex the internal string encoding for Unicode is, ":ok_hand:t4:" conceptually is one character.

Regular expression matching is better but still falls down on the complex combining form emoji. Ideally I would expect . to match any emoji as a single character, but it matches most, it still fails to match something like ":ok_hand:t4:", instead matching only the first 2 bytes, returning ":ok_hand:" if you save a substring match.

I've attached an example macro below to illustrate this.

Emoji Substring Test.kmmacros (2.5 KB)

3 Likes