Solved: Options in automating batch conversion of Word.doc to PDFs

Hello all,

We’re currently looking at having to upload thousands of files into a local server site (Sharepoint), for sharing amongst colleagues… In my situation, I ‘only’ have a few hundred that fall under my domain… :slight_smile:

I was full of confidence that this would be a quick one-two using Automator, but from what I’ve managed to find out - MS Word on the Mac & Automator are no longer on speaking terms, with the option to Convert Format of Word Document now seemingly missing from Automator.

Similarly, thousands of Apps in the MAS to batch-convert from PDF to Word, but not so much for the other way around [the one I found, supports .txt files, not doc./docx files…]

What I just want to check, is the following:

KM could help speeding up things by either
a.) running a series of steps in a macro that would do all the clicking for you, through the various windows/sub-menu’s of opening in Word, saving/exporting as PDF [quicker - but still time consuming, relatively speaking, since the actual File is still being opened, and then saved as a PDF again];

OR

b.) Go with AppleScript(??) or something similar (Command line/Terminal) to have all of that run in the background, so to speak, without anything opening directly? [KM then simply acts as the mechanism to invoke the script]?

Is that about the gist of it? Or am I missing something else that might be of use inside KM?

Would appreciate someone confirming if that’s it - in which case (assuming the latter), I’ll duly head off to the ether, to see if I can find something that has already been pieced together (that actually works - seems the few readily available worked in Mavericks, but not later than that.)

This is a fairly difficult task. I’ve done such things before (such as when converting from AppleWorks to Pages.

Basically controlling the UI needs to be done carefully, with lots of pauses for the longest possible cases. Given that you can then generally sit back and watch while it does it, the fact that it is not particularly fast is usually an acceptable consequence.

Appreciate the reply, Peter, pleased I wasn’t missing something obvious.

I had managed to set up a macro that is invoked using BTT (or even a trigger word spoken through Dragon), which does what is needed.

That said, I’m happy to report that I managed to figure something out.
This page saw Daniel Grau pop up a droplet originally written by John Welch - it didn’t work initially, but I managed to tweak it slightly, and it’s working as it should now.

[It was initially telling me the droplet was corrupted. I “opened the package contents” and played around with the script, and had to tweak it to open MS Word for Mac, as opposed to MS Word over in my Virtual machine. Saved it as an application, dropped it in the Finder toolbar, and when I drag/drop a Word file on it, it saves that file as a PDF in the same location, without any intervention required on my side. NOTE: The first time I opened it, I had to grant access to Word to the specific folder in Finder (not sure why). I’m running El Cap 10.11.5 and Word for Mac 2016 15.23.2 (160624).]

Here be the script, in case anyone else would find this useful:

on open of theFiles
   tell application "/Applications/Microsoft Word.app" to set theOldDefaultPath to get default file path file path type documents path
   repeat with x in theFiles
      set theDoc to contents of x
      tell application "Finder"
         set theFilePath to container of theDoc as text
         set theFilename to (name of theDoc) & ".pdf"
      end tell
      tell application "/Applications/Microsoft Word.app"
         set default file path file path type documents path path theFilePath
         open theDoc
         set theActiveDoc to the active document
         save as theActiveDoc file format format PDF file name theFilename
         close front document of application "/Applications/Microsoft Word.app"
      end tell
   end repeat
   tell application "/Applications/Microsoft Word.app" to set default file path file path type documents path path theOldDefaultPath
end open

Please note that I know next to nothing about AppleScript. Kindly be careful about using the above - it works my side without any issues, but I cannot guarantee the same that side!

By the way - not made clear above, this also works with multiple files selected in Finder, and dropped onto the Droplet.

Hey Neal,

First let's clean up that script just a little.

--------------------------------------------------------------------------------
# Task: A Droplet to Convert Word Documents to .PDF format.
# dMod: 2016/07/05 22:00
--------------------------------------------------------------------------------
on open theFileList
   
   tell application id "com.microsoft.Word"
      set oldDefaultPath to get default file path file path type documents path
   end tell
   
   repeat with theFile in theFileList
      
      tell application "Finder"
         set theFileParentPath to theFile's container as text
         set theFileName to get theFile's name
         set theFileNameExtension to theFile's name extension
      end tell
      
      set AppleScript's text item delimiters to ("." & theFileNameExtension)
      set newFileName to (text item 1 of theFileName) & ".pdf"
      
      tell application id "com.microsoft.Word"
         set default file path file path type documents path path theFileParentPath
         open theFile
         
         tell active document
            save as file format format PDF file name newFileName
            close
         end tell
         
      end tell
      
   end repeat
   
   tell application id "com.microsoft.Word"
      set default file path file path type documents path path oldDefaultPath
   end tell
   
end open
--------------------------------------------------------------------------------

NOTE: I've change the hard-coded path to Word's bundle id:

application id "com.microsoft.Word"

Neal – this may work for you – but I don't have PC-Word in a virtual machine to test with.

Try running this from the Script Editor.app to find out:

tell application id "com.microsoft.Word"
   path to it
end tell

Next let's provide a downloadable droplet:

Word Document to .PDF Converter v1.01.zip (63.4 KB)

NOTE: This can be opened using the Script Editor.app.

Finally – since I really dislike using droplets – here's how to run the droplet on the selection in the Finder:

--------------------------------------------------------------------------------
set theDroplet to "~/test_directory/KM_TEST/Word Document to .PDF Converter.app"
tell application "System Events"
   set theDroplet to (POSIX path of disk item theDroplet) as «class furl»
end tell

tell application "Finder"
   set finderSelectionList to selection as alias list
   if length of finderSelectionList = 0 then error "No files were selected in the Finder!"
   open finderSelectionList using application file theDroplet
end tell
--------------------------------------------------------------------------------

One advantage of using a droplet this way is the job is handed-off to the droplet application – Keyboard Maestro is relieved of the load and will be somewhat more responsive to other tasks.

-Chris

3 Likes

Really appreciate the above Chris - will give this a bash. Very useful to see all the various options of going about doing this, makes things far easier to learn from!

Yeah, Chris really knows how to help others learn. He’s awesome!

Chris - one more silly question!

I’ve implemented the Script, and it works a treat.
About that last bit - am I correct in presuming that I save the droplet script in /Folder Wherever - and then, create a KM Macro with a suitable trigger, and have it ‘run’ that last AppleScript?

In other words, select the Word files over in Finder, fire off the trigger, which then runs that script and calls/hands-over to the Droplet, to do its thing?

I’ve done that, and it works - just want to check that this is what you had in mind, as opposed to my missing something more obvious… I guess, a further alternative would be implementing it as a Service - but that would actually be more cumbersome, given I’d have to wade through all the already existing Services’ options available…

Thanks again for the help!

Hey Neal,

Yes.

Right.

No, you understood me.

If you give your droplet a properly unique name you can even call it by name:

tell application "Finder" to set finderSelectionList to selection as alias list
if length of finderSelectionList = 0 then error "No files were selected in the Finder!"

tell application "Word Document to .PDF Converter"
   open finderSelectionList
end tell

NOTE: I needed to make slight changes to the droplet to accommodate this and have accordingly changed the droplet file above.

-Chris

Fantastic - thanks Chris. This is going to save me hours of time!

Hello, Chris,

I have been searching for a solution to my problem for a quite a long time. I need to convert thousands of MS Word .doc files to .docx format (the real XML format, not just adding “x” to the .doc extension). I’m hoping that you will have pity on me and provide me with an altered version of your “Word Document to .PDF Converter” (i.e., “Word Doc to Docx Converter”) that will accomplish the task. Thank you so very much, in advance, for your time and assistance!

This works for me:

In Chris’ script change line 20 to…

set newFileName to (text item 1 of theFileName) & ".docx"

and line 27 to…

save as file format format document file name newFileName

When you open the created docx Word will show “Compatibility Mode”. I don’t know why it does that, I’m not an MS Word expert.

But it is definitively an (XML) docx. To check, open it with BBEdit and you’ll see something like this:

Thanks for the help, Tom!

I’m no MS Word expert, either, but from what I’ve been able to discern, your suggested adjustments to Chris’ script merely change the extension to “.docx,” which is the same as manually adding an “x” to “.doc” in the finder. In both scenarios, word opens the file in “Compatibility Mode” because the file is not truly formatted as full XML. Word just thinks that you want it to be in XML format and attempts to fill in the blanks so that people with earlier, incompatible versions of MS Word can also open/read it. In contrast, when I manually “Save As” in the finder and choose “Word Document (.docx),” the file gets converted to full XML format (without losing any formatting/tables/images/lists, etc.) and “Compatibility Mode” does not appear.

So, I supposed that what I need the script to do is impersonate the actions of a human being who is manually converting each file in the finder:

  1. Open each “Word 97-2004” .doc file.
  2. “Save As” (i.e., convert) each .doc file to “Word Document (.docx).”
  3. CRITICAL: Do not lose formatting/tables/images/lists, etc. upon conversion.
  4. Immediately close each file after saving/converting it.
  5. Repeat the process until all files have been converted.

No! I also thought it was still a doc after the first try, but it’s not the case. That’s why I mentioned the BBEdit test. Look at the screenshot: this is a docx :slightly_smiling_face:

The script calls Word and makes it save the file as format document (line 27; in the original script it’s format PDF), which should be the regular Word document format (docx). To just rename the files you don’t need the AppleScript. But renaming doc to docx doesn’t make no sense at all.

PS:

If you don’t trust BBEdit open it with any hex editor:


Old Word doc has this:

Found it:

add without maintain compatibility to line 27. So the complete line 27 is:

save as file format format document file name newFileName without maintain compatibility

Then it doesn’t show the “Compatibility Mode” when you open the docx.

1 Like

OK, here is the whole script saved as droplet. Just drop your doc files on it.

doc to docx.app.zip (72.9 KB)

It does all of your 5 points with the following limitations:

(1) It accepts only doc files. You can’t drop folders or non-doc files.

(3) The conversion is done by MS Word. That means the script has no influence on the conversion process.
In the past I’ve already seen complicated doc documents that looked a bit different as docx documents. If you have such documents you have to visually control the result. But this an MS Word problem, and I think you won’t find any better docx converter than MS Word himself.

Looks like Tom cracked it. :smile:

Message me if any further assistance is needed.

-Chris

Thanks for asking, your script is doing fine and it behaves well :wink:

The confusing thing was Word’s Maintain compatibility setting.
When saving a doc as docx

  • in the normal GUI (Word Mac 2011) it defaults to Off

  • whereas via AppleScript it defaults to On
    (can be disabled with without maintain compatibility).

If it was enabled during conversion the document opens with the Compatibility message, which could give you the impression it’s still a binary doc. (Which isn’t the case.)

I have tried to run Tom’s app (thanks a million, BTW), and it does work as desired, maintaining formatting and everything else in the new .docx files that get created. However, I soon discovered that the process is very slow when I attempt to batch-convert 1,000+ files (which is what I need to do, since I have over 54,000 old .doc files). It opens and converts files at a rate of about 1 per 10 seconds. Plus, Word pauses the conversion process whenever it runs into a file that contains Macros or problematic encoding, thereby preventing me from being able to let it run when I go to bed. I have to be physically present in order to click away each “Macros Warning” panel or “Select the encoding that makes your document readable” panel that appears. Is there any way to get around having to “open” each file while still maintaing the critical “no loss of formatting” results? (For instance, in BBEdit, I can batch search-and-replace in 100,000 files without the files actually opening.)

Say Thanks to Chris (@ccstone). I just changed two lines in the script.

I have over 54,000 old .doc files

What!? :bomb::astonished:

I have to be physically present in order to click away each “Macros Warning” panel or “Select the encoding that makes your document readable” panel that appears. Is there any way to get around having to “open” each file while still maintaing the critical “no loss of formatting” results?

Well, you can try it with textutil (command line tool). Type man textutil in the Terminal.

It seems to support doc and docx, but I tend to think that the conversion will be far less accurate than with Word. (I haven’t tried it.)