Unicode Changed When Importing Characters Through os.environ to Python Code

martin · January 3, 2021, 1:52am

Hello,

I want to convert the Hebrew letter שׁ into Unicode.

If I type the Hebrew letter into python code:

#!/usr/bin/python3
str = 'שׁ'
print(str.encode('ascii', 'backslashreplace'))

I get the desired result: b'\\ufb2a'.

However, if I set a variable to this letter in KM, import its value to a Python value, and then do the conversion, it becomes: b'\\u05e9\\u05c1'.

#!/usr/bin/python3
import os
str = os.environ['KMVAR_Local__Var']
print(str.encode('ascii', 'backslashreplace'))

I can see what was going on. ש and the dot are separated in this process, but I don't know why.

Is there another way (using Python or any other language) in KM to get the desired result other than having to type the letter myself into the Python code? (My desired workflow is to copy a word from elsewhere and then let KM process my clipboard, converting the word in the clipboard into Unicode. But I don't want b'\\ufb2a' to be changed to b'\\u05e9\\u05c1'.)

Thanks!

martin · January 3, 2021, 2:41am

It is interesting to note that in the following quote

I want to convert the Hebrew letter שׁ into Unicode.
If I type the Hebrew letter into python code:

#!/usr/bin/python3
str = 'שׁ'
print(str.encode('ascii', 'backslashreplace'))

the first שׁ is \ufb2a, and the second שׁ is \u05e9\u05c1.
The latter was directly copied from my Python code. On my computer, the result is \ufb2a, not \u05e9\u05c1. But after pasting the code to this forum, it becomes \u05e9\u05c1.

ComplexPoint · January 3, 2021, 2:02pm

Am I failing to reproduce that here ?

For example, from:

I am getting:

martin · January 3, 2021, 2:29pm

Hello @ComplexPoint,

You need to add encode('ascii', 'backslashreplace'):

print(str.encode('ascii', 'backslashreplace'))

ComplexPoint · January 3, 2021, 2:33pm

Why ? Tell me more ?

The ASCII code doesn't, of course, include code points for Unicode characters ...

Could you summarise, for an old man, the difficulty that you are bumping into ?

martin · January 3, 2021, 2:46pm

Hi @ComplexPoint,

I'm sorry. I can't help. I got it from somewhere else online.
There is a description of the encode method here:

I've tried other error codes, but none gets me the result I wanted like 'backslashreplace' does.

Could you summarise, for an old man, the difficulty that you are bumping into ?

As I described in the OP, If I type שׁ into the Python code, I get b'\\ufb2a', but if I type it into the KM variable field and then import its value to a Python variable, I get b'\\u05e9\\u05c1'. In other words, its Unicode value changed. I hope there is a way to prevent it from being changed.

KevinCoates · January 3, 2021, 3:30pm

I don't know if this will work, but perhaps try:

Click on the cogwheel icon at the top right of the "Set Variable to Text" action

Select "Process Nothing"

And then try the macro?

martin · January 3, 2021, 3:40pm

Hello @KevinCoates, thanks for the suggestion. I have tried it. No, it did not work. I still get the same result.
Somehow, either Keyboard Maestro decomposed שׁ into two parts: ש + the dot, or the decomposition happened during the import via os.environ.

martin · January 3, 2021, 4:03pm

I'm making progress.
This time, I avoid using KM variable. Instead, I copy שׁ into the System Clipboard and then make Python import directly from the System Clipboard. It works as desired.

#!/usr/bin/python3
import subprocess
def getClipboardData():
    p = subprocess.Popen(['pbpaste'], stdout=subprocess.PIPE)
    retcode = p.wait()
    data = p.stdout.read()
    return data

S1 = getClipboardData()
S2 = S1.decode('utf-8')
print(S2.encode('ascii', 'backslashreplace'))

Output:

Does it suggest that Keyboard Maestro engine decomposed שׁ into ש + the dot?

martin · January 7, 2021, 6:59pm

I found out Safari and Chrome/Firefox handle this differently.

In Safari, שׁ will be decomposed to "\u05e9\u05c1", whereas in Chrome/Firefox, it remains "\ufb2a". This may shed some lights into the issue. What Keyboard Maestro does may be due to what Apple has designed.

@peternlewis Any ideas?

martin · January 7, 2021, 7:02pm

I typed the previous reply in Chrome. It remains "\ufb2a".

Now, I'm typing the שׁ in Safari. I expect it to be decomposed to "\u05e9\u05c1".

Edit: Interestingly, no, it did not. But it did happen to Canvas by instructure.com. Now I'm more puzzled.

martin · January 7, 2021, 7:46pm

I can confirm that Safari does the twist in https://stackoverflow.com/ by changing the Unicode of שׁ from "\ufb2a" to "\u05e9\u05c1" , but Chrome does not.

peternlewis · January 8, 2021, 6:02am

From a practical point of view the two forms represent the same character presumably, in precomposed and decomposed forms.

Keyboard Maestro does not precompose or decompose strings anywhere in its code (with one exception for string equality purposes when evaluating menu titles).

So basically, wherever you get is what is happening in other parts of the system or whatever various APIs Keyboard Maestro is using.

str.encode('ascii', 'backslashreplace')

means encode the string into ASCII, replacing any characters that can't be encoded with backslash unicode characters, so that is fine for unicode characters, and will detect the different encodings.

Keyboard Maestro does not have any options for decomposing or precomposing strings.

I didn't see any simple ways to do it in python, but it is possible by the looks of it.

martin · January 8, 2021, 3:57pm

Thanks, @peternlewis.
I do grading on Canvas by instructure.com. It seems to me only when the Unicode character values of the Hebrew words match, students' answers are accepted as correct. When a student's answer seems to match the answer key but is taken as incorrect, I need to compare their Unicode character values to find out why. If the Unicode character values are changed, I won't be able to do an accurate comparison.

Making Python to read directly from the System Clipboard has served my needs. Looks like Apple's problem. I will ask students to avoid Safari.

martin · March 5, 2021, 5:23pm

My issue was not clear when I had the problems described above. Now It becomes clearer to me after some tests.

Copy from and Paste to:

Firefox:
Firefox to Firefox: NO normalization
Other apps to Firefox: NO normalization
Firefox to other apps: normalization

Even if we use AppleScript, JXA, or Python to directly read the SystemClipboard that contains the text copied from Firefox, the text is still normalized. Since copying and pasting from Firefox to Firefox does not involve normalization, Firefox probably does not normalize the text during the copy process. I have no idea when the normalization happens.

Safari (MacOS, not iOS):
Safari to Safari: normalization
Other apps to Safari: normalization
Safari to other apps: NO normalization

For Safari (MacOS), the normalization also happens at least on Canvas by instructure.com. In the fill-in-blank questions of Classic Quizzes, when students type Hebrew words in quizzes and hit "submit", the input was normalized, but the answer key was not. In that of the New Quizzes, however, both the input and the answer key are normalized. It's a mystery to me.

Chrome:
Chrome to Chrome: NO normalization
Other apps to Chrome: NO normalization (Firefox overrides)
Chrome to other apps: NO normalization (Safari overrides)

I believe other Chromium-based browsers should work the same way as Chrome. But I only tested on Brave Browser.

Conclusion: Firefox and Safari behave in the opposite way. Chrome behaves normally and consistently (except when it is overridden by Firefox and Safari).

Unless developers for Safari and Firefox make changes, I guess we have to live with it.

As for as Keyboard Maestro is concerned:
When Keyboard Maestro use Python gets the value of a variable via os.environ KMVAR, the string is normalized in the process.
But if it reads directly from the SystemClipboard, the text is NOT normalized.

When Keyboard Maestro use JXA to get the value of a variable or read directly from the SystemClipboard, the text is NOT normalized.

Unicode Changed When Importing Characters Through os.environ to Python Code

Options