How can I devide chinese part and english part into two part in clipboard?

minjie_he · March 26, 2017, 1:24pm

I do this for making some anki cards. At first,I first copy the chinese part, and then copy the english part. As state in this topic.
How can I get the second to last item in clipboard？ - Questions & Suggestions - Keyboard Maestro Discourse

and then I use KM to copy them to Anki.

Because I do a lot this every day. I think there is someway to copy just onetime and then use KM to devide the chinese and english part into two parts and then save to two variables or two clipboard items.

So if there is anyway to deal with the clipboard. I think regular expressions can do this for me. But how to do it (the details is hard for me).

the raw materials is like this:

That would probably cost you about fifty bucks...

那大概要花掉你50美元左右。

or this

Why can't you spend a few bucks on a coat?...

你为什么不能花点儿钱买件外套呢？

How could I just copy one time and use KM to devide them into two parts ?

Thanks for the help I received rencent days!

cfriend · March 26, 2017, 5:01pm

I’m not sure how regex feels about Chinese characters, but If it handles them easily someone will probably have a fairly simple solution for you soon.

JMichaelTX · March 26, 2017, 8:46pm

Here is a very simple macro that is based on the relative position of the English and Chinese parts. A better solution would check for the start of Chinese characters. I think this can be done, but don't have time right now to figure this out.

Please give this a try and let us know if it works for you:

##Macro Library Parse String into English and Chinese Parts


####DOWNLOAD:
<a class="attachment" href="/uploads/default/original/2X/e/e45a338c94b24f54b865feb7c0d6a9a7eaf0fdd1.kmmacros">Parse String into English and Chinese Parts.kmmacros</a> (3.5 KB)

---

###ReleaseNotes

**HOW TO USE**

* Assign a trigger to this macro
* Select the text, starting with English through the end of Chinese part.
* Trigger this macro

**METHOD**

This method assumes that the English part and Chinese part are separated by one or more end-of-line markers, with  English being on the first line(s), and Chinese being on lower lines.

---

<img src="/uploads/default/original/2X/1/1bc6924bd2b7a6d5db87b396d8378cbf62982923.png" width="467" height="1000">

ComplexPoint · March 27, 2017, 12:47am

我想用JS就因该有办法吧

// isCJK :: String -> Bool
function isCJK(s) {
    var c = s.charCodeAt(0);
    return c >= 0x4E00 && c <= 0x9FFF;
};

( Full script for a KM Execute JavaScript for Automation action below )

If we copy a string like:

Thank you very much 三块肉给你妈吃

into the clipboard, we should be able to get the two parts into separate KM variables with something like this:

Split clipboard on boundary between Roman and Chinese.kmmacros (23.1 KB)

JS ES5 script for the Execute JavaScript for Automation action.

(Note that this version assumes that the English comes first, and is followed by the Chinese - the script would need to be adjusted for the reverse case)

'use strict';

var _slicedToArray = function () {
    function sliceIterator(arr, i) {
        var _arr = [];
        var _n = true;
        var _d = false;
        var _e = undefined;
        try {
            for (var _i = arr[Symbol.iterator](), _s; !(_n = (_s = _i.next())
                    .done); _n = true) {
                _arr.push(_s.value);
                if (i && _arr.length === i) break;
            }
        } catch (err) {
            _d = true;
            _e = err;
        } finally {
            try {
                if (!_n && _i["return"]) _i["return"]();
            } finally {
                if (_d) throw _e;
            }
        }
        return _arr;
    }
    return function (arr, i) {
        if (Array.isArray(arr)) {
            return arr;
        } else if (Symbol.iterator in Object(arr)) {
            return sliceIterator(arr, i);
        } else {
            throw new TypeError("Invalid attempt to destructure non-iterable instance");
        }
    };
}();

(function () {
    'use strict';

    // SPLITTING ON BOUNDARY BETWEEN 罗马字 and 汉子 ---------------------------

    // isCJK :: String -> Bool
    function isCJK(s) {
        var c = s.charCodeAt(0);
        return c >= 0x4E00 && c <= 0x9FFF;
    };

    // romanHanSplit :: String -> [String]
    var romanHanSplit = function romanHanSplit(s) {
        return map(function (x) {
            return x.join('');
        }, splitBy(function (a, b) {
            return !isCJK(a) && isCJK(b);
        }, stringChars(s)));
    };

    // GENERIC FUNCTIONS ------------------------------------------------------

    // head :: [a] -> a
    var head = function head(xs) {
        return xs.length ? xs[0] : undefined;
    };

    // map :: (a -> b) -> [a] -> [b]
    var map = function map(f, xs) {
        return xs.map(f);
    };

    // show :: a -> String
    var show = function show(x) {
        return JSON.stringify(x);
    }; //, null, 2);


    // Splitting not on a delimiter, but whenever the relationship between
    // two consecutive items matches a supplied predicate function

    // splitBy :: (a -> a -> Bool) -> [a] -> [[a]]
    var splitBy = function splitBy(f, xs) {
        if (xs.length < 2) return [xs];
        var h = head(xs),
            lstParts = xs.slice(1)
            .reduce(function (_ref, x) {
                var _ref2 = _slicedToArray(_ref, 3),
                    acc = _ref2[0],
                    active = _ref2[1],
                    prev = _ref2[2];

                return f(prev, x) ? [acc.concat([active]), [x], x] : [acc, active.concat(x), x];
            }, [
                [],
                [h], h
            ]);
        return lstParts[0].concat([lstParts[1]]);
    };

    // stringChars :: String -> [Char]
    var stringChars = function stringChars(s) {
        return s.split('');
    };

    // TEST -------------------------------------------------------------------

    var kme = Application("Keyboard Maestro Engine");

    var lstParts = romanHanSplit(
        kme.getvariable('romanPlusCJK')
    );

    kme.setvariable('romanPart', {
        to: lstParts.length > 0 ? lstParts[0] : ''
    });

    kme.setvariable('cjkPart', {
        to: lstParts.length > 1 ? lstParts[1] : ''
    });

    return show(lstParts);
})();

ComplexPoint · March 27, 2017, 12:56am

On Sierra you can you ES6 JavaScript, which yields a slightly shorter and cleaner script:

(() => {
    'use strict';

    // SPLITTING ON BOUNDARY BETWEEN 罗马字 and 汉子 ---------------------------

    // isCJK :: String -> Bool
    const isCJK = s => {
        const c = s.charCodeAt(0);
        return (c >= 0x4E00 && c <= 0x9FFF);
    };

    // romanHanSplit :: String -> [String]
    const romanHanSplit = s => map(
        x => x.join(''),
        splitBy(
            (a, b) => !isCJK(a) && isCJK(b),
            stringChars(s)
        )
    );


    // GENERIC FUNCTIONS ------------------------------------------------------

    // head :: [a] -> a
    const head = xs => xs.length ? xs[0] : undefined;

    // map :: (a -> b) -> [a] -> [b]
    const map = (f, xs) => xs.map(f);

    // show :: a -> String
    const show = x => JSON.stringify(x); //, null, 2);


    // Splitting not on a delimiter, but whenever the relationship between
    // two consecutive items matches a supplied predicate function

    // splitBy :: (a -> a -> Bool) -> [a] -> [[a]]
    const splitBy = (f, xs) => {
        if (xs.length < 2) return [xs];
        const
            h = head(xs),
            lstParts = xs.slice(1)
            .reduce(([acc, active, prev], x) =>
                f(prev, x) ? (
                    [acc.concat([active]), [x], x]
                ) : [acc, active.concat(x), x], [
                    [],
                    [h],
                    h
                ]);
        return lstParts[0].concat([lstParts[1]]);
    };

    // stringChars :: String -> [Char]
    const stringChars = s => s.split('');
    

    // TEST -------------------------------------------------------------------

    const lstParts = romanHanSplit("Thank you very much 三块肉给你妈吃");

    const kme = Application("Keyboard Maestro Engine");

    kme.setvariable('romanPart', {
        to: lstParts.length > 0 ? lstParts[0] : ''
    });

    kme.setvariable('cjkPart', {
        to: lstParts.length > 1 ? lstParts[1] : ''
    });

    return show(lstParts);
})();

minjie_he · March 27, 2017, 1:05am

In my test,sometimes it works, and sometimes not

minjie_he · March 27, 2017, 1:21am

Seems great.
Since I get this thing want to copy them to two inputbox.

so this step is unnecessary.

can you delete this display from the script. and I guess this may works

gglick · March 27, 2017, 1:48am

Here's my suggestion for this kind of macro. This solution should hopefully also capture any half-width digits that are present in the Chinese as per your first example sentence.
Divide Clipboard into EN and CH and Paste.kmmacros (2.8 KB)

JMichaelTX · March 27, 2017, 2:27am

I much prefer your solution over mine.

Now that you have shown me how to detect Chinese (Han) characters, I might suggest this RegEx that would allow numbers, and selected symbols, at the end:

(\p{Han}+[\p{Han}\d\. \-。]*)

This would work with this text:

For this test case, see regex101: build, test, and debug regex

minjie_he · March 27, 2017, 5:11am

have a bug like this

He should have stuck to his guns and refused to meet her.

他本应坚持己见，拒绝与她会面。

this is what I get.

It miss something in chineset part.

minjie_he · March 27, 2017, 5:15am

This also does not work for this situation.

He should have stuck to his guns and refused to meet her.

他本应坚持己见，拒绝与她会面。

gglick · March 27, 2017, 5:19am

The regex was failing to account for that comma character. I've corrected it by modifying @JMichaelTX's superior regex (thanks for that, by the way!) and it now works in my tests. Give this a shot and see how it goes.

Divide Clipboard into EN and CH and Paste.kmmacros (2.8 KB)

peternlewis · March 27, 2017, 5:46am

Rather than try to allow of some set of acceptable characters allowed in the Chinese part, why not just break it at the first Han character?

\A\R*(?s:(.*?)\R*(\p{Han}.*?)\R*)\z

The \R* should get rid of all the extraneous space, but probably using the Trim Whitespace filter on the variables would be easier, as well as make the regex simpler, just:

(?s:(.*?)(\p{Han}.*))

JMichaelTX · March 27, 2017, 6:11am

Peter, I don't think your RegEx would allow digits at the end of the Chinese text, or it would include any English after it.

Try testing this string:

That would probably cost you about fifty bucks...

那大概要花掉你50美元左右 100-200。 and some english
and some more english on a new line.

gglick · March 27, 2017, 6:34am

@JMichaelTX is right; when I tested your suggested regexes in BBEdit against the test string

He should have stuck to his guns and refused to meet her.

他本应坚持己见，拒绝与她会面。

The first one matched everything, and the second one only matched up to 3 Chinese characters. However, you did give me a great idea for how to significantly simplify the regex involved here, and which seems to work well assuming the test is always in the form shown by @minjie_he, with the English and Chinese on their own lines:

Divide Clipboard into EN and CH and Paste.kmmacros (2.8 KB)

Now it should only look for single lines that start with either an English or Chinese character and grab the rest of the line, with no more need to worry about accounting for special characters or edge cases. As a bonus, it should also now no longer matter whether the English or Chinese line comes first. Thanks for the inspiration!

UPDATE: I should have thought to test your first regex in a different tool than BBEdit before posting; testing it at regex101.com shows that it expertly captures both the English and Chinese in a single search, which is why it matches everything. My apologies, and thanks for the lesson!

peternlewis · March 27, 2017, 6:45am

Was the intention not to split it at the first Han character, and so everything before that is the English text and everything after that is the Chinese text?

The second one was, again, sigh, mangled by the forum so that the ** characters were removed.

(?s:(.*?)(\p{Han}.*))

That will capture everything before the first Han character into the first capture, and everything from then on into the second capture. If that is not what is desired then I clearly misread something somewhere.

After that, Filter and trim white space on both variables.

gglick · March 27, 2017, 6:58am

As far as I can tell that is indeed what is desired, and now that I see your intended regex I can tell that it would work very nicely. I think I still ever-so-slightly prefer performing separate regex searches for lines that start with English or Chinese characters simply because it then no longer matters whether Chinese or English comes first, but I very much appreciate the lesson in regex simplification here regardless

minjie_he · March 27, 2017, 7:07am

works well for most situation! Thanks a lot!

How can I devide chinese part and english part into two part in clipboard?

Options