Scrubbing Word documents (don't be afraid to get a little dirty)

by Andrew Savikas

Related link: http://ask.slashdot.org/article.pl?sid=05/08/09/170209&tid=215&tid=95&tid=4



This post from Slashdot caught my eye because it hit close to home. (The post asked for tips on cleaning up and standardizing Word files for eventual posting to the Web.) I've been cleaning up Word files for some time now, and it's never the same job twice.

The poster asked for advice, but rather than reply directly on Slashdot and get lost in the crowd, I thought I'd post some tips from the trenches right here. When I say "you", I mean "the guy who asked Slashdot".

  • If you're working with Word 2003 for Windows, consider doing the bulk of the cleanup with XSLT. As MS rolls out the new XML-based file format, you'll be ahead of the game.
    You can very quickly strip out a ton of cruft with just a few XSLT templates.


    For example, the following dozen or so lines of XSLT remove all "direct" formatting from a Word document. (Direct formatting means things like using the "B" button to apply bold to some text, rather than using a character style like "Strong".) This stylesheet (written by Evan Lenz, co-author of the excellent book Word 2003 XML) is taken directly from Hack 97 in Word Hacks:

    <xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml">

    <xsl:template match="@*|node()">
    <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
    </xsl:template>

    <xsl:template match="w:p/w:pPr/*[not(self::w:pStyle)]"/>

    <xsl:template match="w:r/w:rPr/*[not(self::w:rStyle)]"/>

    </xsl:stylesheet>

    For more on how to do stuff like this, see chapter 10 in Word Hacks.

  • Even if you're not using Word 2003 on Windows, consider using XSLT for a lot of the cleanup. Saving out as HTML, then running it through Tidy will give you a starting point for using XSLT.


  • Also, Word VBA macros are a natural choice for much of the cleanup you described. While VBA is far from the darling of Slashdot readers, it is fairly simple to learn, and the ability to record macros and then edit them is a Good Thing.

    As I'm sure you know, no two manuscripts are the same, so having a broad selection of short, single-purpose utility macros can be a huge time saver. You can even run them on a batch of files from a DOS command prompt, which is very useful. I take a look at a set of files, take note of which of the 2-dozen or so utility macros need to be run, then fire up a Ruby script to run them for me.

    In fact, here's a simplified version (no usage message, etc.) of the Ruby script I use. Enjoy:

    # batchmacro.rb
    # Born 3/14/2005
    # Andrew Savikas, O'Reilly Media, Inc.

    # Use win32ole package
    require 'win32ole'
    require 'getoptlong'

    $macros_to_run = ""

    # Process command-line options
    opts = GetoptLong.new(
    [ "--macro", "-m", GetoptLong::REQUIRED_ARGUMENT]

    )

    # process the parsed options
    opts.each do |opt, arg|
    if opt == "--macro" then
    $macros_to_run = arg
    end
    end

    if ARGV.size == 0 then
    exit
    end

    # Get current instance of Word, or launch new one if needed
    begin
    wrd = WIN32OLE.connect('Word.Application')
    rescue WIN32OLERuntimeError
    puts("no instance of Word running ... launching new one. Please wait ...")
    wrd = WIN32OLE.new('Word.Application')
    $close_word = true
    end

    wrd.Visible = 1

    # Everything else is a document on which to run macro
    ARGV.each do |file|
    doc = wrd.Documents.Open(File.expand_path(file))
    $macros_to_run.split.each do |macro|
    puts("Running #{macro} on #{file} ...")
    wrd.Run(macro)
    end
    doc.Save()
    doc.Close()
    end

    if $close_word then
    puts "Closing Word ..."
    wrd.Quit()
    end
    puts "Done."



  • Several Slashdotters suggested using RegExp's to help clean things up. If you're using Word 2000 or later on Windows, you can use Perl-style RegExp's right from a Word macro. Of course, if Perl's more your style, go ahead and use Perl from within Word, just like you can do in emacs or vim.


In practice, the wide variety in the content and quality of the Word manuscripts processed will likely call for all of these tools. Before most O'Reilly manuscripts (yes, most of them are written in Word -- at the authors' request) hit the shelves, they've been poked, prodded and processed by VBA, Ruby, Perl, sed, and sometimes OpenOffice. The bigger the toolbox, the quicker the job.



How do you get those Word docs in shape?