oreilly.comSafari Books Online.Conferences.


The Making of Effective awk Programming

by Arnold Robbins, author of Effective awk Programming, 3rd Edition

Editor's note: Arnold Robbins has been an O'Reilly author for more than eight years, authoring or coauthoring some of its best-selling, most enduring titles, including the sixth edition of Learning the vi Editor, the third edition of Unix in a Nutshell: System V Edition, the second edition of sed & awk, and the second edition of Learning the Korn Shell.

Related Reading

Effective awk Programming
Text Processing and Pattern Matching
By Arnold Robbins

In all that time, he's learned a thing or two about the O'Reilly book production process. But when he was ready to update Effective awk Programming he wanted to use the Texinfo markup language. O'Reilly production prefers authors use the DocBook markup language. Arnold compromised by agreeing to manage the conversion process for the final production of his book. In this article, he chronicles the challenges he faced translating his book from Texinfo to DocBook. The breadth of technical detail and the extensive code examples he provides here offer unique insight into one author's experience working with O'Reilly's book production department to create a book.


O'Reilly & Associates published the third edition of Effective awk Programming in May 2001. The book provides thorough coverage of the awk programming language, as standardized by the IEEE POSIX standard for portable operating system applications. This standard is based on Unix and its utilities. Effective awk Programming also doubles as the user's guide for GNU awk (known as gawk), explaining the extensions and features that are unique to gawk. It includes a wealth of sample programs and library functions that demonstrate good awk programming style.

gawk is the standard version of awk on GNU/Linux and most BSD-based systems. It is also popular on commercial Unix and Windows systems because it has a number of useful extensions, and because it can handle large data sets (records with hundreds or thousands of fields, arrays with thousands of elements) that often cause other implementations to give up. The third edition of Effective awk Programming describes the current version of gawk, 3.1.

The GNU project uses the Texinfo markup language for all of its documentation. Texinfo is a pleasant markup language in which to work. It is semantically driven: you markup what something is, not how to print it; it allows easy nesting of different constructs; it is not as painful to type as HTML or DocBook XML; and it provides for translation into multiple output formats.

Printed documents may be generated directly from Texinfo input files by using TeX. The Texinfo distribution includes the file texinfo.tex, which is a set of TeX macros that directly implement the Texinfo language, and scripts for running TeX. Other output formats are generated by the makeinfo program, which is a rather large and complicated C program that knows how to produce GNU Info, HTML, and these days, DocBook XML.

The use of Texinfo for Effective awk Programming presented a problem for O'Reilly. Their production process prefers the use of DocBook markup (particularly the XML variant) since it may be used to produce both printed and browsable versions of the same book. (Browsable versions are necessary for the CD-ROM editions of their books, as well as for the Safari Bookshelf.) Furthermore, O'Reilly has a series design used for all their books: the TeX output from texinfo.tex, while reasonable enough, doesn't looks anything like an O'Reilly book.

By the time of the initial discussions with O'Reilly, I had produced four O'Reilly books in DocBook SGML, so I was quite comfortable with it. And as the author of the gawk.texi, I was also very comfortable with Texinfo. Therefore, because both O'Reilly and I were committed to getting Effective awk Programming published, I promised to manage the conversion from Texinfo into DocBook for the final book production.

I reasoned that since makeinfo could already produce HTML, and since HTML and DocBook are conceptually similar, it shouldn't be that hard to modify the code to generate DocBook. I had worked with the makeinfo source code in the past, so I wasn't scared, even if I was a bit naive.

Delaying the conversion to DocBook until the end had two other related, significant advantages. First, I was able to use the Texinfo version for the technical review, incorporating all the changes from the review into the documentation that would eventually ship with gawk. And second, O'Reilly agreed to do their copy editing on a paper copy of the Texinfo version of the manuscript. I then entered the copy edits into the Texinfo source file, again allowing the distributed version to benefit from O'Reilly's considerable editorial expertise.

(At this point I'd like to pause and acknowledge the significant contributions made by Chuck Toporek, my editor. His comments helped to enormously improve the organization and presentation of the material in the book. Mary Sheehan's copy edits were also very valuable. I learned a lot about good writing during the work on this book.)

Furthermore, Chuck and the rest of the people at O'Reilly bent over backwards to make sure that they complied with the GNU Free Documentation License (FDL), under which the book is published. The final DocBook XML source for the book is available from the O'Reilly Web site. The Texinfo version, of course, is part of the gawk distribution.)

Converting to DocBook

Fortunately, I didn't have to write the DocBook changes for makeinfo from scratch. Philippe Martin had done the bulk of this already, and I was able to obtain his patches to the makeinfo source code. His code did the vast majority of what I needed.

Philippe's version generated DocBook SGML. At the time, O'Reilly was moving away from SGML, towards the XML version of DocBook. The differences boiled down mostly to using lowercase for tags, always providing a full closing tag (<emph>whatever</emph> versus <emph>whatever</>), using the trailing-slash version of tags that don't enclose objects (such as <xref linkend="..."/>), and fully quoting all the parameters inside of tags (<colspec colnum="1"/> vs. <colspec colnum=1>).

Also, Philippe's code often generated a single DocBook tag for multiple different Texinfo commands, when in fact DocBook has tags that correspond to the original Texinfo commands. For example, it might produce <literal> for both @command{} and @file{}. This needed to be fixed, so that the generated output would contain separate <command> and <filename> tags. In other words, as much as possible, it was necessary to preserve the semantic-based nature of the Texinfo markup in the generated DocBook.

This work was straightforward, and over a week or two, I did the bulk of it, getting makeinfo to the point where it produced a basic DocBook XML version of gawk.texi on which I could do further post-processing.

The current release of Texinfo includes Philippe's original changes, as well as my improvements. Philippe has gone further with the development, and besides DocBook XML, makeinfo can produce a variant of XML that uses a Texinfo DTD that is similar to the DocBook XML DTD. Indeed, most of the reformatting problems described below are no longer needed with the current version. For further details, see the Texinfo distribution.

Making Usable DocBook

Generating technically correct DocBook markup was just the beginning of the process. While the file might go through an XML parser without any problems, it would still need to be readable, so that O'Reilly's production editors could work with it directly. It also needed to adhere to O'Reilly's markup conventions, such as the id="..." parameter in <chapter> and section tags, and in <xref> tags for cross references. There was still a ways to go.

General Cleanups

First, the makeinfo output needed lots of simple cleanups. Some of these related to anomalies in the output, others to removing Texinfo-specific output features which were better expressed using different fonts in DocBook. The first script, fixup.awk, evolved to handle many of these. This section presents the most interesting of the changes that had to be made.

makeinfo generated some boiler-plate material at the front of the file that wasn't necessary for O'Reilly's DocBook tools. It looks like this:

<!-- This is /home/arnold/ORA/db/gawk.sgml, produced by makeinfo
version 4.0 from gawk.texi.   --><para>
<!DOCTYPE book PUBLIC "-//Davenport//DTD DocBook V3.0//EN">
<title>The GNU Awk User's Guide</title>


Notice that the <para> and </para> tags are misplaced. This early version of makeinfo was over-zealous about wrapping things in paragraph tags. The first part of fixup.awk strips off this leading junk. It works by having the first rule look for the first <chapter> tag. When that's seen, it sets a flag. The second rule checks the flag. If it hasn't been seen yet, the next statement gets the next line of input:

#! /bin/gawk -f

# strip leading gunk from file
/<chapter/        { chapter_seen = 1 }
! chapter_seen    { next }

The next bit removes trailing white space (space and TAB characters) and removes leading white space inside lists and examples. The first rules uses the sub() function to unconditionally remove trailing white space. (This is needed only because I find such white space gets in the way when editing.)

The in_term variable indicates being inside the terms of a variable list. Inside list item bodies or examples, the strip_spaces variable is true (non-zero), so the sub() function removes all leading white space. The closing tags set the strip_spaces flag back to false:

# strip trailing white space
/[ \t]+$/         { sub(/[ \t]+$/, "") }

# strip leading spaces inside lists
/<listitem>/      { stripspaces++ ; in_term = 0 }
/<\/listitem>/    { stripspaces-- }

# fix up examples
/<screen>/        { in_screen++ ; stripspaces++ }

stripspaces != 0  { sub(/^ +/, "") }

/<\/screen>/      { in_screen-- ; stripspaces-- }

The Texinfo command @var{} is used to describe something that is variable, such as a user's supplies. It corresponds to the DocBook <replaceable> tag. In an O'Reilly book, <replaceable> items get printed in a Constant Width Italic font. This is entirely appropriate in most contexts, such as within examples, or in lists where items represent a combination of a command and its parameters.

However, O'Reilly conventions indicate that variable items should be in regular italics when used in prose discussion. For example:

<!-- Correctly marked up DocBook XML -->
<literal>ls -l</literal> <replaceable>file</replaceable>
The <command>ls</command> with the <option>-l</option> gives
extra information about <emphasis>file</emphasis>.

The generated DocBook used <replaceable> everywhere. This next bit of code makes the context-sensitive transformation for us:

Pages: 1, 2, 3, 4

Next Pagearrow

Linux Online Certification

Linux/Unix System Administration Certificate Series
Linux/Unix System Administration Certificate Series — This course series targets both beginning and intermediate Linux/Unix users who want to acquire advanced system administration skills, and to back those skills up with a Certificate from the University of Illinois Office of Continuing Education.

Enroll today!

Linux Resources
  • Linux Online
  • The Linux FAQ
  • Linux Kernel Archives
  • Kernel Traffic

  • Sponsored by: