Solved: Options in automating batch conversion of Word.doc to PDFs

OK, this is none of my business, but I have to ask:
Do you really need to convert al 54,000 files now? Are you, or someone, going to read them in the near future? Why not put them in a folder that Spotlight can index, and then convert when needed? Please feel free to ignore these questions.

OK, having said that, have you consiered:
Pandoc -- a universal document converter

It is designed to work in batch mode. I have NOT used it on this number of files, but based on other reports, I think it is worth considering/testing.

2 Likes

textutil:

I’ve tried it now. Forget it.

pandoc:

Gives me this message:

Pandoc can convert from DOCX, but not from DOC.
Try using Word to save your DOC file as DOCX, and convert that with pandoc.

If that pops up than Word has a problem which you can’t just ignore. Even if you’d find a GUI-less, silent way to convert without showing you that message, then you would end up with converted but faulty documents next morning.

And there’s little chance that any script or other automated process could select the correct encoding for you.

If you still don't have a solution, and haven't already done so, then you might do an Internet search of "batch convert doc to pdf".
I got many hits, some (many) are for Windows, but if you're using Sharepoint then it suggests that someone on your team is using Windows. It might be easier to do this conversion in Windows than on a Mac. I read somewhere that Adobe Acrobat Pro Win could do this type of batch conversion, from .doc to .pdf.

First of all, major thanks to Chris!

Yep; I'll now explain. I have over 54,000 .doc files in which I need to batch search-and-replace certain sections/paragraphs of text before I can legally publish the content of each of those 54,000 files on my Web site. I have two options:

  1. I already have a server-level program that can batch search-and-replace in .docx (XML) files. However, before I can make use of that program, I first need to convert my 54,000 .doc files to .docx.

  2. [PREFERRED] I would love to be able to batch search-and-replace directly in the 54,000 old, original .doc files so that I don't even have to bother with first converting them to .docx. I do, in fact, already have an Automator workflow that individually opens each specified .doc file, searches-and-replaces (using wildcards) as required, saves the files, and then closes the files. The deal-breaker, however, is that it results in MS Word taking the following steps:

A. opens all specified .doc files at once
B. then it begins to edit the files one-by-one
C. then it begins to save the files one-by-one
D. then it begins to close the files one-by-one

It will not begin Step B, Step C, or Step D until the previous step has been completely finished for all specified files. So, if I attempt to batch search-and-replace in -- for example -- 500 files, Step A results in MS Word systematically opening 500 file windows before it will even begin to do the actual work of searching-and-replacing in any one of the 500 files. As you might imagine, attempting to open 500 windows soon causes MS Word to hang and become completely unresponsive, necessitating a force-quit. This workflow would be absolutely perfect if I could alter it in such a way that MS Word would open, search-and-replace, save, and close 1 file at a time.

Have you considered/tried writing a simple Word VBA macro, that cycles through the .doc files in a given folder one at a time, doing the search and replace, and then save as PDF?

Via VBA, you can tell Word to auto-confirm, auto-accept changes. If you don't want to keep Word running for too long, then put a timer (or number of files) to limit each run.

I've never created or used a VBA macro, but I will sure look into figuring out how to do it! Fingers crossed . . . .

BTW, does it have to save as PDF or can it be DOCX?

The VBA macro can save in any format Word supports.
I thought your objective was to produce PDFs, so why not do it all in one process?

VBA is pretty easy. I'd suggest this:

(1) Do a google search on each of the following:

  • "Word VBA batch update"
  • "Word VBA convert .doc to PDF"
  • Word VBA Find and Replace"

Most of the hits will be for Windows Word VBA, but it won't matter in most cases

(2) Record a Word VBA macro doing the steps manually

  • You can then edit this macro to fine tune it.

It was the OP (@Cassady ) who wanted the PDF (June).
@calbear asked for docx (September).

Maybe better to split the thread?

Thanks for the clarification, Tom.

No need to split -- the main issue is dealing with large numbers of .doc files. Whether the output is PDF or .docx is easily handled, at least by Word.

I've been researching and testing the last couple of days, trying my best to create what you envisioned. Unfortunately, I just can't get anything to work as required. Maybe it's because I'm using a Mac and most instructions are for Windows. Or, maybe it's because I'm using a different version (2011) of Word for Mac that conflicts with the Mac instructions that I have found. What's become quite clear to me is that I'm in way over my head. :frowning:

I just did a google search of:

Which quickly let me to this:

Find & ReplaceAll on a batch of documents in the same folder

It was written for Word Windows, so you will need to make some adjustments, particularly the folder paths and file open/save. Test it using a Test Folder with 2 or 3 .doc files (copied from your source).

This macro is on this great web site:
Word Macros and Visual BasicÂŽ for Applications FAQ

There are lots of tips on how to use VBA.

If you get stuck, try a google on "Word 2011 Mac VBA file open", for example.
Replace "file open" with whatever task you need help with.
If you still can't make it work, post your Word VBA macro here, and I'll try to help.

I still can’t get it. I’ve been at my desk literally all day working on this yet again. VBA is completely alien to me, and I can’t make work any of the instructions that I have found. I don’t even know for sure when/where/how to activate this particular VBA, so “testing” it has proven very difficult. Anyway, following is the code that I have thus far. I changed only the initial “PathToUse” declaration. After that, I could not identify anything else to change.


Option Explicit

Public Sub BatchReplaceAll()

Dim FirstLoop As Boolean
Dim myFile As String
Dim PathToUse As String
Dim myDoc As Document
Dim Response As Long

PathToUse = "/Users/Jason/Desktop/macro-testing"

'Error handler to handle error generated whenever
'the FindReplace dialog is closed

On Error Resume Next

'Close all open documents before beginning

Documents.Close SaveChanges:=wdPromptToSaveChanges

'Boolean expression to test whether first loop
'This is used so that the FindReplace dialog will
'only be displayed for the first document

FirstLoop = True

'Set the directory and type of file to batch process

myFile = Dir$(PathToUse & "*.doc")

While myFile <> ""

   'Open document
   Set myDoc = Documents.Open(PathToUse & myFile)

   If FirstLoop Then

   'Display dialog on first loop only

   Dialogs(wdDialogEditReplace).Show

   FirstLoop = False

   Response = MsgBox("Do you want to process " & _
   "the rest of the files in this folder", vbYesNo)
   If Response = vbNo Then Exit Sub

   Else

   'On subsequent loops (files), a ReplaceAll is
   'executed with the original settings and without
   'displaying the dialog box again

   With Dialogs(wdDialogEditReplace)
   .ReplaceAll = 1
   .Execute
   End With

   End If

   'Close the modified document after saving changes

   myDoc.Close SaveChanges:=wdSaveChanges

   'Next file in folder

   myFile = Dir$()

Wend

End Sub

@calbear, sorry you're having so much trouble. If you can't make any sense of VBA, then you'll need to get help from someone who does.
Will just a little knowledge of VBA, it should be possible to mod the VBA macros available on the internet.

Sorry, but I don't have time right now to take on a project like this.

There are a number of tech support, tech help, programming services web sites available. You might try:

I haven't used it recently, but I did years ago, and it was pretty good.
I believe their basic service is free, at least for a short period (30 days).
Most likely you will be able to find a VBA expert there to help you.

Good luck.

I think I've almost got a solution for batch-converting .doc to .docx, 1-by-1. I now have a script (which I saved as a "service" for the Finder) that works like a charm for converting batches of .doc to .pdf, 1-by-1, without losing any formatting. Here's that script:

property theList : {"doc", "docx"}
on run {input, parameters}
    set output to {}
    tell application "Microsoft Word" to set theOldDefaultPath to get default file path file path type documents path
    repeat with x in input
        try
            set theDoc to contents of x
            tell application "Finder"
                set theFilePath to container of theDoc as text
 
                set ext to name extension of theDoc
                if ext is in theList then
                    set theName to name of theDoc
                    copy length of theName to l
                    copy length of ext to exl
 
                    set n to l - exl - 1
                    copy characters 1 through n of theName as string to theFilename
 
                    set theFilename to theFilename & ".pdf"
 
                    tell application "Microsoft Word"
                        set default file path file path type documents path path theFilePath
                        open theDoc
                        set theActiveDoc to the active document
                        save as theActiveDoc file format format PDF file name theFilename
                        copy (POSIX path of (theFilePath & theFilename as string)) to end of output
                        close theActiveDoc
                    end tell
                end if
            end tell
        end try
    end repeat
    tell application "Microsoft Word" to set default file path file path type documents path path theOldDefaultPath
    return output
end run

However, when I try to edit it for .docx conversion, I get this error:

The action “Run AppleScript” encountered an error.

Here's my edited code for conversion to .docx:

property theList : {"doc"}
on run {input, parameters}
    set output to {}
    tell application "Microsoft Word" to set theOldDefaultPath to get default file path file path type documents path
    repeat with x in input
        try
            set theDoc to contents of x
            tell application "Finder"
                set theFilePath to container of theDoc as text

                set ext to name extension of theDoc
                if ext is in theList then
                    set theName to name of theDoc
                    copy length of theName to l
                    copy length of ext to exl

                    set n to l - exl - 1
                    copy characters 1 through n of theName as string to theFilename

                    set theFilename to theFilename & ".docx"

                    tell application "Microsoft Word"
                        set default file path file path type documents path path theFilePath
                        open theDoc
                        set theActiveDoc to the active document
                        save as theActiveDoc file format format DOCX file name theFilename
                        copy (POSIX path of (theFilePath & theFilename as string)) to end of output
                        close theActiveDoc
                    end tell
                end if
            end tell
        end try
    end repeat
    tell application "Microsoft Word" to set default file path file path type documents path path theOldDefaultPath
    return output
end run

I see no reason why it wouldn't work if I can just figure out what, exactly, the "error" is and fix it. :slight_smile: I don't know how to go about researching the error because there is no error #/name/ID.

Looks like you have a number of duplicate words in this statement.

I think you are going in circles now :wink: As far as I can tell your script does the same as Chris’ script from above, which I packed as doc>docx version in the applet.

I showed you the correct output format in the earlier post:

save as file format format document file name theFilename without maintain compatibility

(You have written “file format format DOCX” which doesn’t exist.)

If you want to save it with Compatibility remove the without maintain compatibility part.

To find the available formats and other options open Word’s AppleScript dictionary with Script Editor.
‌

The critical difference is that the new version converts files 1-by-1, rather than first opening all of the files at once before even beginning to convert the first file (which -- in the case of my particularly-sized files -- would crash MS Word after about 50 windows) .

So, just to clarify, the following script works perfectly to convert batches of hundreds/thousands of .doc files to .docx, without losing any formatting, graphics, tables, etc. I saved it as a "service" (doc2docx1by1) whereby I highlight all files in the Finder that I wish to convert and then select the "doc2docx1by1" service from the Finder's contextual menu.

property theList : {"doc"}
on run {input, parameters}
    set output to {}
    tell application "Microsoft Word" to set theOldDefaultPath to get default file path file path type documents path
    repeat with x in input
        try
            set theDoc to contents of x
            tell application "Finder"
                set theFilePath to container of theDoc as text

                set ext to name extension of theDoc
                if ext is in theList then
                    set theName to name of theDoc
                    copy length of theName to l
                    copy length of ext to exl

                    set n to l - exl - 1
                    copy characters 1 through n of theName as string to theFilename

                    set theFilename to theFilename & ".docx"

                    tell application "Microsoft Word"
                        set default file path file path type documents path path theFilePath
                        open theDoc
                        set theActiveDoc to the active document
                        save as theActiveDoc file format format document file name theFilename without maintain compatibility
                        copy (POSIX path of (theFilePath & theFilename as string)) to end of output
                        close theActiveDoc
                    end tell
                end if
            end tell
        end try
    end repeat
    tell application "Microsoft Word" to set default file path file path type documents path path theOldDefaultPath
    return output
end run

The applet opens one file after the other here:
It opens 1 file, saves it as docx, closes it, then proceeds to the next one.

But I’m glad if you have found a version that suits you better :wink:

without losing any formatting, graphics, tables, etc.

Of course. With either script it’s Word that does the conversion.

Does anyone here know if it is even possible to create an “always running until manually stopped” AppleScript that will perform a COMMAND-PERIOD action on any MS Word file that remains open for more than 15 seconds? I am asking this because I am running batch operations on thousands of .doc/.docx files and MS Word “stalls out” fairly often on certain files. Pressing COMMAND-PERIOD immediately fixes the problem and enables the batch operation to proceed again.