I do linguistics and work extensively with text in non-Latin scripts, chiefly Cyrillic. I'm hoping that other users could shed light on what KM is doing internally in terms character set encoding because it doesn't seem to handle Cyrillic well.
I have a sqlite3 database with Russian terms, encoded in UTF-8. My KM action takes a word off the macOS clipboard, puts it into a KM variable and launches a Perl script to query the db and return a numeric parameter from it. Although the action correctly generates the query based on a KM variable, it appears to be in an encoding that the sqlite3 engine doesn't recognize.
For example, this is a proper query:
SELECT rank FROM corpus WHERE word LIKE 'удовлетворительный'
but it returns no rows because KM is doing something to the encoding of the Russian term that I can't seem to account for. I suspect this an encoding issue in the KM variable because simply re-typing the Russian term inside the single quotes and executing it directly in sqlite3 works as expected. In any case, here's the script:
#!/usr/bin/perl
use DBD::SQLite;
use feature 'unicode_strings';
my $db_path = "/Users/alan/Documents/dev/RussianNationalCorpus";
my $dbh = DBI->connect("dbi:SQLite:dbname=$db_path","","");
my $query_word = $ENV{KMVAR_ru_word}
my $query = "SELECT rank,word,lemma FROM corpus WHERE word = '$query_word'";
$sth = $dbh->prepare($query);
my $rv = $sth->execute();
my $rank;
while(my @row = $sth->fetchrow_array()) {
$rank = $row[0];
}
print $rank; print "\n";
$dbh->disconnect();
Playing around with encoding the variable into UTF-8 inside the script did not make any difference. Thinking it's something strange about how Perl is handling the encoding, I rewrote the script in Python. Same thing. Some queries work, others do not. Any query with the letter "й" never works. I've solved the problem by just piping the macOS pasteboard to another tool, bypassing (I think) KM's variable and clipboard handling. This always works:
#!/usr/local/bin/zsh
pbpaste | xargs /Users/alan/Desktop/GetRNCRank
So, in summary, it seems that KM is doing something to the encoding of non-Latin characters behind the scenes. Anyone else encounter this? Workarounds?