Solved: Options in automating batch conversion of Word.doc to PDFs

Really appreciate the above Chris - will give this a bash. Very useful to see all the various options of going about doing this, makes things far easier to learn from!

Yeah, Chris really knows how to help others learn. He’s awesome!

Chris - one more silly question!

I've implemented the Script, and it works a treat.
About that last bit - am I correct in presuming that I save the droplet script in /Folder Wherever - and then, create a KM Macro with a suitable trigger, and have it 'run' that last AppleScript?

In other words, select the Word files over in Finder, fire off the trigger, which then runs that script and calls/hands-over to the Droplet, to do its thing?

I've done that, and it works - just want to check that this is what you had in mind, as opposed to my missing something more obvious... I guess, a further alternative would be implementing it as a Service - but that would actually be more cumbersome, given I'd have to wade through all the already existing Services' options available...

Thanks again for the help!

Hey Neal,

Yes.

Right.

No, you understood me.

If you give your droplet a properly unique name you can even call it by name:

tell application "Finder" to set finderSelectionList to selection as alias list
if length of finderSelectionList = 0 then error "No files were selected in the Finder!"

tell application "Word Document to .PDF Converter"
   open finderSelectionList
end tell

NOTE: I needed to make slight changes to the droplet to accommodate this and have accordingly changed the droplet file above.

-Chris

Fantastic - thanks Chris. This is going to save me hours of time!

Hello, Chris,

I have been searching for a solution to my problem for a quite a long time. I need to convert thousands of MS Word .doc files to .docx format (the real XML format, not just adding “x” to the .doc extension). I’m hoping that you will have pity on me and provide me with an altered version of your “Word Document to .PDF Converter” (i.e., “Word Doc to Docx Converter”) that will accomplish the task. Thank you so very much, in advance, for your time and assistance!

This works for me:

In Chris’ script change line 20 to…

set newFileName to (text item 1 of theFileName) & ".docx"

and line 27 to…

save as file format format document file name newFileName

When you open the created docx Word will show “Compatibility Mode”. I don’t know why it does that, I’m not an MS Word expert.

But it is definitively an (XML) docx. To check, open it with BBEdit and you’ll see something like this:

Thanks for the help, Tom!

I’m no MS Word expert, either, but from what I’ve been able to discern, your suggested adjustments to Chris’ script merely change the extension to “.docx,” which is the same as manually adding an “x” to “.doc” in the finder. In both scenarios, word opens the file in “Compatibility Mode” because the file is not truly formatted as full XML. Word just thinks that you want it to be in XML format and attempts to fill in the blanks so that people with earlier, incompatible versions of MS Word can also open/read it. In contrast, when I manually “Save As” in the finder and choose “Word Document (.docx),” the file gets converted to full XML format (without losing any formatting/tables/images/lists, etc.) and “Compatibility Mode” does not appear.

So, I supposed that what I need the script to do is impersonate the actions of a human being who is manually converting each file in the finder:

  1. Open each “Word 97-2004” .doc file.
  2. “Save As” (i.e., convert) each .doc file to “Word Document (.docx).”
  3. CRITICAL: Do not lose formatting/tables/images/lists, etc. upon conversion.
  4. Immediately close each file after saving/converting it.
  5. Repeat the process until all files have been converted.

No! I also thought it was still a doc after the first try, but it’s not the case. That’s why I mentioned the BBEdit test. Look at the screenshot: this is a docx :slightly_smiling_face:

The script calls Word and makes it save the file as format document (line 27; in the original script it’s format PDF), which should be the regular Word document format (docx). To just rename the files you don’t need the AppleScript. But renaming doc to docx doesn’t make no sense at all.

PS:

If you don’t trust BBEdit open it with any hex editor:


Old Word doc has this:

Found it:

add without maintain compatibility to line 27. So the complete line 27 is:

save as file format format document file name newFileName without maintain compatibility

Then it doesn’t show the “Compatibility Mode” when you open the docx.

1 Like

OK, here is the whole script saved as droplet. Just drop your doc files on it.

doc to docx.app.zip (72.9 KB)

It does all of your 5 points with the following limitations:

(1) It accepts only doc files. You can’t drop folders or non-doc files.

(3) The conversion is done by MS Word. That means the script has no influence on the conversion process.
In the past I’ve already seen complicated doc documents that looked a bit different as docx documents. If you have such documents you have to visually control the result. But this an MS Word problem, and I think you won’t find any better docx converter than MS Word himself.

1 Like

Looks like Tom cracked it. :smile:

Message me if any further assistance is needed.

-Chris

Thanks for asking, your script is doing fine and it behaves well :wink:

The confusing thing was Word’s Maintain compatibility setting.
When saving a doc as docx

  • in the normal GUI (Word Mac 2011) it defaults to Off

  • whereas via AppleScript it defaults to On
    (can be disabled with without maintain compatibility).

If it was enabled during conversion the document opens with the Compatibility message, which could give you the impression it’s still a binary doc. (Which isn’t the case.)

I have tried to run Tom’s app (thanks a million, BTW), and it does work as desired, maintaining formatting and everything else in the new .docx files that get created. However, I soon discovered that the process is very slow when I attempt to batch-convert 1,000+ files (which is what I need to do, since I have over 54,000 old .doc files). It opens and converts files at a rate of about 1 per 10 seconds. Plus, Word pauses the conversion process whenever it runs into a file that contains Macros or problematic encoding, thereby preventing me from being able to let it run when I go to bed. I have to be physically present in order to click away each “Macros Warning” panel or “Select the encoding that makes your document readable” panel that appears. Is there any way to get around having to “open” each file while still maintaing the critical “no loss of formatting” results? (For instance, in BBEdit, I can batch search-and-replace in 100,000 files without the files actually opening.)

Say Thanks to Chris (@ccstone). I just changed two lines in the script.

I have over 54,000 old .doc files

What!? :bomb::astonished:

I have to be physically present in order to click away each "Macros Warning" panel or "Select the encoding that makes your document readable" panel that appears. Is there any way to get around having to "open" each file while still maintaing the critical "no loss of formatting" results?

Well, you can try it with textutil (command line tool). Type man textutil in the Terminal.

It seems to support doc and docx, but I tend to think that the conversion will be far less accurate than with Word. (I haven’t tried it.)

OK, this is none of my business, but I have to ask:
Do you really need to convert al 54,000 files now? Are you, or someone, going to read them in the near future? Why not put them in a folder that Spotlight can index, and then convert when needed? Please feel free to ignore these questions.

OK, having said that, have you consiered:
Pandoc -- a universal document converter

It is designed to work in batch mode. I have NOT used it on this number of files, but based on other reports, I think it is worth considering/testing.

2 Likes

textutil:

I’ve tried it now. Forget it.

pandoc:

Gives me this message:

Pandoc can convert from DOCX, but not from DOC.
Try using Word to save your DOC file as DOCX, and convert that with pandoc.

If that pops up than Word has a problem which you can’t just ignore. Even if you’d find a GUI-less, silent way to convert without showing you that message, then you would end up with converted but faulty documents next morning.

And there’s little chance that any script or other automated process could select the correct encoding for you.

If you still don't have a solution, and haven't already done so, then you might do an Internet search of "batch convert doc to pdf".
I got many hits, some (many) are for Windows, but if you're using Sharepoint then it suggests that someone on your team is using Windows. It might be easier to do this conversion in Windows than on a Mac. I read somewhere that Adobe Acrobat Pro Win could do this type of batch conversion, from .doc to .pdf.

First of all, major thanks to Chris!

Yep; I'll now explain. I have over 54,000 .doc files in which I need to batch search-and-replace certain sections/paragraphs of text before I can legally publish the content of each of those 54,000 files on my Web site. I have two options:

  1. I already have a server-level program that can batch search-and-replace in .docx (XML) files. However, before I can make use of that program, I first need to convert my 54,000 .doc files to .docx.

  2. [PREFERRED] I would love to be able to batch search-and-replace directly in the 54,000 old, original .doc files so that I don't even have to bother with first converting them to .docx. I do, in fact, already have an Automator workflow that individually opens each specified .doc file, searches-and-replaces (using wildcards) as required, saves the files, and then closes the files. The deal-breaker, however, is that it results in MS Word taking the following steps:

A. opens all specified .doc files at once
B. then it begins to edit the files one-by-one
C. then it begins to save the files one-by-one
D. then it begins to close the files one-by-one

It will not begin Step B, Step C, or Step D until the previous step has been completely finished for all specified files. So, if I attempt to batch search-and-replace in -- for example -- 500 files, Step A results in MS Word systematically opening 500 file windows before it will even begin to do the actual work of searching-and-replacing in any one of the 500 files. As you might imagine, attempting to open 500 windows soon causes MS Word to hang and become completely unresponsive, necessitating a force-quit. This workflow would be absolutely perfect if I could alter it in such a way that MS Word would open, search-and-replace, save, and close 1 file at a time.