Convert HTML to Markdown in buffer for pasting?

I find myself frequently copying web content and need to convert certain elements that are in common with Markdown over so as to preserve as much of the formatting (and especially HTML links) as possible.

Is there a current macro available to do this for quick copy/paste?

One approach might be to make use of this site:

http://heckyesmarkdown.com

(with or without some degree of automation)

Or, you could install pandoc

https://pandoc.org/installing.html

and use it to convert HTML to Markdown

https://pandoc.org/demos.html

1 Like

I wonder if using a ruby script might be better?

and there's also a Node module:

FWIW, basic pandoc use, which could be fine-tuned with some of the command line options in http://pandoc.org/MANUAL.html might look something like this:

Paste any clipboard HTML as Markdown (using pandoc).kmmacros (25.2 KB)

Untitled

JS Source

(() => {
    'use strict';

    ObjC.import('AppKit');

    const main = () =>
        bindLR(
            elem(
                'public.html',
                ObjC.deepUnwrap(
                    $.NSPasteboard.generalPasteboard
                    .pasteboardItems.js[0].types
                )
            ) ? (
                Right(
                    ObjC.deepUnwrap(
                        $.NSString.alloc.initWithDataEncoding(
                            $.NSPasteboard.generalPasteboard
                            .dataForType('public.html'),
                            $.NSUTF8StringEncoding

                        )
                    )
                )
            ) : Left('No HTML in clipboard'),
            strHTML => {
                const
                    fp = filePath(
                        Application('Keyboard Maestro Engine')
                        .getvariable('pandocPath')
                    );
                return doesFileExist(fp) ? (
                    Right(strHTML)
                ) : Left('pandoc not found at: ' + fp);
            }
        );

    // GENERIC FUNCTIONS --------------------------------------

    // https://github.com/RobTrew/prelude-jxa

    // Left :: a -> Either a b
    const Left = x => ({
        type: 'Either',
        Left: x
    });

    // Right :: b -> Either a b
    const Right = x => ({
        type: 'Either',
        Right: x
    });

    // bindLR (>>=) :: Either a -> (a -> Either b) -> Either b
    const bindLR = (m, mf) =>
        undefined !== m.Left ? (
            m
        ) : mf(m.Right);

    // doesFileExist :: FilePath -> IO Bool
    const doesFileExist = strPath => {
        const ref = Ref();
        return $.NSFileManager.defaultManager
            .fileExistsAtPathIsDirectory(
                $(strPath)
                .stringByStandardizingPath, ref
            ) && 1 !== ref[0];
    };

    // elem :: Eq a => a -> [a] -> Bool
    const elem = (x, xs) => xs.includes(x);

    // filePath :: String -> FilePath
    const filePath = s =>
        ObjC.unwrap(ObjC.wrap(s)
            .stringByStandardizingPath);

    // MAIN ---
    return main();
})();

1 Like

That is really nice @ComplexPoint. I have to say that pandoc seems to be about the most robust and easy to install option.

One thing I found when I used this was that it seems as though it doesn't convert things quite right to markdown. Let's take the message earlier you posted as an example. When I copy it and then run your macro, it generates the following pasted output:

::: {.cooked style="word-wrap: break-word; line-height: 1.4; overflow: hidden; color: rgb(225, 222, 214); font-family: Helvetica, Arial, sans-serif; font-size: 14px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(32, 33, 39); text-decoration-style: initial; text-decoration-color: initial;"}
One approach might be to make use of this site:

[http://heckyesmarkdown.com](http://heckyesmarkdown.com/){.onebox}

(with or without some degree of automation)

Or, you could install pandoc

<https://pandoc.org/installing.html>

and use it to convert HTML to Markdown

[https://pandoc.org/demos.html [1]{.badge .badge-notification .clicks
title="1 click"
style="display: inline-block; font-weight: normal; white-space: nowrap; border-radius: 10px; padding: 3px 5px; min-width: 8px; vertical-align: middle; color: rgb(189, 177, 159); font-size: 0.7579em; line-height: 1; text-align: center; background-color: rgb(37, 38, 45); top: -1px; position: relative; border: none;"}](https://pandoc.org/demos.html){.onebox}
:::

::: {.section .post-menu-area .clearfix style="display: block; margin: 20px 0px; position: relative; color: rgb(225, 222, 214); font-family: Helvetica, Arial, sans-serif; font-size: 14px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(32, 33, 39); text-decoration-style: initial; text-decoration-color: initial;"}
::: {.actions style="text-align: right; float: right; display: inline-block;"}
:::
:::

\

This just doesn't seem to follow markdown conventions.

I did try to use markdown_strict and markdown_mmd, but while these remove the colons, it seems to just not dow anything else really (echo "$KMVAR_pandocInput" | "$KMVAR_pandocPath" -f html -t markdown_strict --strip-comments):

One approach might be to make use of this site:

<a href="http://heckyesmarkdown.com/" class="onebox">http://heckyesmarkdown.com</a>

(with or without some degree of automation)

Or, you could install pandoc

<a href="https://pandoc.org/installing.html" class="uri onebox">https://pandoc.org/installing.html</a>

and use it to convert HTML to Markdown

<a href="https://pandoc.org/demos.html" class="onebox">https://pandoc.org/demos.html<span> </span><span class="badge badge-notification clicks" title="1 click" style="display: inline-block; font-weight: normal; white-space: nowrap; border-radius: 10px; padding: 3px 5px; min-width: 8px; vertical-align: middle; color: rgb(189, 177, 159); font-size: 0.7579em; line-height: 1; text-align: center; background-color: rgb(37, 38, 45); top: -1px; position: relative; border: none;">1</span></a>

It seems that this is simply not working altogether with the changed markdown type.

Yes – Pandoc does err on the side of trying to preserve too much meta-information, and it's probably not the right tool if you are copying from fairly complex and highly formatted HTML pages like those generated by the discourse forum software.

(I find it useful enough for paragraphs of articles, where I just want to get any links straight into Markdown format)

I'm just going out now, but in the meanwhile, it might be worth trying the basic framework of the macro above, but:

  1. Disabling the shell script action which calls pandoc
  2. Adding an alternative shell script action next to it which calls something else – for example one of those Ruby gems.
  3. Changing the file path checking - away from a check for pandoc, to either nothing, or a check for something like the ruby gem.

This (EU) evening, if nothing better has yet come up, I might experiment with domchristie's turndown, to see how easily it can be 'browserified' and used in a Keyboard Maestro macro.

Here is a slightly different approach using Dom Christie's Turndown (a Node module, which uses the DOM of a running browser to convert HTML -> MD).

In Keyboard Maestro, we can use a slightly adjusted version of it in either a Chrome Javascript or Safari Javascript action, and it requires the corresponding browser to be running.

Chrome seems to run it significantly faster.

(EDIT: That may have just have been because I had JS debugging enabled in Safari

(Note also that Safari use requires Develop > Allow JavaScript from Apple Events to be enabled in the Safari menu system)

Behaviour:

The macro is intended to:

  1. paste HTML or RTF as Markdown, and to
  2. paste any plain UTF8 text in the clipboard with its format unchanged.

Options

Turndown provides some Markdown options. This draft of the Execute JS action uses just two of them at the top of the code, and others can be added.

const main = () =>
    TurndownService({
        // https://www.npmjs.com/package/turndown#options
        headingStyle: 'atx',
        bulletListMarker: '-'
    }).turndown(
        document.kmvar.clipHTML
    );

Paste as Markdown (using Turndown thru browser).kmmacros (70.2 KB)

JS Source – JXA extraction of clipboard text for Turndown

(() => {
    'use strict';

    ObjC.import('AppKit');

    const main = () => {
        const
            ts = ObjC.deepUnwrap(
                $.NSPasteboard.generalPasteboard
                .pasteboardItems.js[0].types
            );
        return elem(
            'public.html', ts
        ) ? (
            Right(clipFromUTI('public.html'))
        ) : elem(
            'public.rtf', ts
        ) ? (
            htmlFromRTFLR([
                'doctype', 'html', 'body', 'xml',
                'style', 'p', 'font', 'head', 'span'
            ], clipFromUTI('public.rtf'))
        ) : elem(
            'public.utf8-plain-text', ts
        ) ? (
            Right(`<pre>${clipFromUTI('public.utf8-plain-text')}</pre>`)
        ) : Left('No HTML, RTF or UTF8 text in clipboard');
    };

    // GENERIC FUNCTIONS ----------------------------

    // https://github.com/RobTrew/prelude-jxa

    // Left :: a -> Either a b
    const Left = x => ({
        type: 'Either',
        Left: x
    });

    // Right :: b -> Either a b
    const Right = x => ({
        type: 'Either',
        Right: x
    });

    // bindLR (>>=) :: Either a -> (a -> Either b) -> Either b
    const bindLR = (m, mf) =>
        undefined !== m.Left ? (
            m
        ) : mf(m.Right);

    // elem :: Eq a => a -> [a] -> Bool
    const elem = (x, xs) => xs.includes(x);

    // CLIPBOARD

    // clipFromUTI :: String -> String
    const clipFromUTI = strUTI =>
        ObjC.deepUnwrap(
            $.NSString.alloc.initWithDataEncoding(
                $.NSPasteboard.generalPasteboard
                .dataForType(strUTI),
                $.NSUTF8StringEncoding

            )
        );

    // RTF -> HTML

    // htmlFromRTFLR :: [String] -> String -> Either String String
    const htmlFromRTFLR = (exceptTags, strRTF) => {
        const
            as = $.NSAttributedString.alloc
            .initWithRTFDocumentAttributes($(strRTF)
                .dataUsingEncoding($.NSUTF8StringEncoding), 0
            );
        return bindLR(
            typeof as
            .dataFromRangeDocumentAttributesError !== 'function' ? (
                Left('String could not be parsed as RTF')
            ) : Right(as),

            // Function bound if Right value obtained above:
            rtfAS => {
                let error = $();
                const htmlData = rtfAS
                    .dataFromRangeDocumentAttributesError({
                            'location': 0,
                            'length': rtfAS.length
                        }, {
                            DocumentType: 'NSHTML',
                            ExcludedElements: exceptTags
                        },
                        error
                    );
                return Boolean(ObjC.unwrap(htmlData) && !error.code) ? Right(
                    ObjC.unwrap($.NSString.alloc.initWithDataEncoding(
                        htmlData,
                        $.NSUTF8StringEncoding
                    ))
                ) : Left(ObjC.unwrap(error.localizedDescription));
            }
        );
    };

    // MAIN ---
    return main();
})();

2 Likes

Updated the macro above – generalizing it:

  • from paste HTML as MD, to
  • Paste as MD.

In other words, it now aims to:

  1. Paste either HTML or RTF as Markdown, and
  2. paste plain UTF8 text unchanged.

Wow, very nice continued work. I think I like the pandoc approach better sans browser running, but I'm not sure where to go with it. I'll give this other new approach a go too since it is powerful to know how to use the browser like you have done in there.

1 Like

sans browser running

Yes, intuitively, I do share that preference for fewer dependencies.

On the other hand, as I quickly noticed when testing what happens when no browser is running, if we are copying web content, then the system probably does have either Safari or Chrome running somewhere : -)

Well, I would normally agree, but sometimes I copy from clients like chat clients that also use HTML rendering but do not have a Chrome or Safari engine running directly. Humm.

Understood.

(Of course, the macro doesn’t require that you are copying from Safari or Chrome – simply that there is a copy of one of them running somewhere in the background)

Just made an extension for Firefox and Chrome to do just that (paste as Markdown): http://markitdown.medusis.com