oreilly.comSafari Books Online.Conferences.


The Making of Effective awk Programming
Pages: 1, 2, 3, 4

The First Pass

Top level section headings within a chapter are often referred to in publishing as "A-level headings," or just "A heads" for short. Similarly, the next level section headings are "B heads," "C heads," and so on. The variables ah, bh, ch, and dh, represent heading levels. At each level, the variable for the levels below it must be set to zero. The variable tab represents the current table number within a chapter. The chnum variable tracks the current chapter. Thus, this first rule sets all the variables to zero, extracts the current id, and computes a new one:

Pass == 1 && /^<chapter/ {
    ah = bh = ch = dh = tab = 0
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    curchap = sprintf("ch-%d", ++chnum)
    newtag = sprintf("%s-%s", book, curchap)
    tags[oldid] = newtag

The next few rules are similar, and handle chapter-level items that aren't actually chapters:

Pass == 1 && /^<preface/ {
    ah = bh = ch = dh = tab = 0
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    curchap = "ch-0"
    newtag = sprintf("%s-%s", book, curchap)
    tags[oldid] = newtag

Pass == 1 && /^<appendix/ {
    ah = bh = ch = dh = tab = 0
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    applet = substr("abcdefghijklmnopqrstuvwxyz", ++appnum, 1)
    curchap = sprintf("ap-%s", applet)
    newtag = sprintf("%s-%s", book, curchap)
    tags[oldid] = newtag

Pass == 1 && /^<glossary/ {
    ah = bh = ch = dh = tab = 0
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    curchap = "glossary"
    newtag = sprintf("%s-%s", book, curchap)
    tags[oldid] = newtag

Next comes code that deals with section tags. The first rule handles a special case. Two of the appendixes in Effective awk Programming are the GNU General Public License (GPL), which covers the gawk source code and the GNU Free Documentation License (FDL), which covers the book itself. The sections in these appendixes don't have ids, nor do they need them. The first rule skips them.

The second rule does much of the real work. It extracts the old id, and then it extracts the level of the section (1, 2, 3, etc.). Based on the level, it resets the lower-level heading variables and sets up the new id.

The third rule handles tables. Table numbers increase monotonically through the whole chapter and have two-digit numbers:

Pass == 1 && /<sect[1-4]>/ { next }     # skip licenses

Pass == 1 && /^<sect[1-4]/ {
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    level = substr($1, 6, 1) + 0    # get level
    if (level == 1) {
        sectnum = ah
        bh = ch = dh = 0
    } else if (level == 2) {
        sectnum = ah "." bh
        ch = dh = 0
    } else if (level == 3) {
        sectnum = ah "." bh "." ch
        dh = 0
    } else {
        sectnum = ah "." bh "." ch "." dh
    newtag = sprintf("%s-%s-sect-%s", book, curchap, sectnum)
    tags[oldid] = newtag

Pass == 1 && /^<table/ {
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    newtag = sprintf("%s-%s-tab-%02d", book, curchap, ++tab)
    tags[oldid] = newtag

The Second Pass

By using -v Debug=1 on the gawk command line, I could do debugging of the code that gathered old ids and built new ones. When debugging is true, the program simply skips the second pass, by reading through the file and doing nothing. More debug code appears in the END rule, below:

Pass == 2 && Debug { next }

If not debugging, this next rule is what replaces old ids in various tags with the new one:

Pass == 2 && /^<(chapter|preface|appendix|glossary|sect[1-4]|table)/ {
    oldid = gensub(/<[a-z]+[0-9]? +id="([^"]+)">/, "\\1", 1, $0)
    tagtype = gensub(/<(chapter|preface|appendix|glossary|sect[1-4]|table).*/, "\\1", 1, $0)
    printf "<%s id=\"%s\">\n", tagtype, tags[oldid]

The following rule updates cross references. Cross-reference tags contain a linkend="..." clause pointing to the id of the place they reference. Since I knew that linkend= only appeared in cross references, that was all I had to look for. The while loop handles multiple cross references in a single line. The loop body works by splitting apart the line into three pieces: the part before the linkend=, the linkend clause itself, and the rest of the line after it. It then builds up the output line by concatenating the preceding text with the new linkend clause:

Pass == 2 && /linkend=/ {
    str = $0
    out = ""
    while (match(str, /linkend="[^"]+"/)) {
        before = substr(str, 1, RSTART - 1)
        xreftag = substr(str, RSTART, RLENGTH)
        after = substr(str, RSTART + RLENGTH)
        oldid = gensub(/linkend="([^"]+)"/, "\\1", 1, xreftag)
        if (! oldid in tags) {
            printf("warning: xref to %s not in tags!\n", oldid) > "/dev/stderr"
            tags[oldid] = "UNKNOWNLINK"
        out = out before "linkend=\"" tags[oldid] "\""
        str = after
    if (str)
        out = out str
    print out

Finally, the last rule is the catch-all that prints out lines that don't need updating:

Pass == 2 { print }

The END Rule

The END rule does simple cleanup. The abnormal variable is true if the wrong number of arguments were provided. The if statement tests it and exits immediately if it's true, avoiding execution of the rest of the rule.

It turns out that the rest of the rule isn't that involved. It simply dumps the table mapping of the old ids to the new ones if debugging is turned on:

    if (abnormal)
        exit 1
    if (Debug) {
        for (i in tags)
            printf "%s -> %s\n", i, tags[i]

Production and Post-Production

Once the new ids were in place, that was it. Since the O'Reilly DocBook tools work on separate per-chapter files, all that remained was to split the large file up into separate files, and then print them. I verified that everything went through their tools with no problems, and submitted the files to Production.

Production went quite quickly. A large part of this was due to the fact that copy editing had already been done on the Texinfo version. Usually it's done as part of the production cycle.

O'Reilly published the book, and I released gawk 3.1.0 at about the same time. The gawk.texi shipped with gawk included all of O'Reilly's editorial input.

It would seem that all ended happily. Alas, this was mostly true, but one non-trivial problem remained.

A major aspect of book production done after the author submits his files is indexing. While gawk.texi contained a number of index entries, most of which I had provided, this served only as an initial basis upon which to build. Indexing is a separate art, requiring training and experience to do well, and I make no pretensions that I'm good at it.

Nancy Crumpton, a professional indexer, turned my amateur index into a real one. Also, during final production, there were the few, inevitable changes made to the text to fix gaffes in English grammar or to improve the book's layout.

I was thus left with a quandary. While the vast majority of O'Reilly's editorial input had been used to improve the Texinfo version of the book, there were now a number of new changes that existed only in the DocBook version. I really wanted to have those included in the Texinfo version as well.

The solution involved one more script and a fair amount of manual work. The following script, desgml.awk, removes DocBook markup from a file, leaving just the text. The BEGIN block sets up a table of translations from DocBook entities to simple textual equivalents. (Some of these entities are specific to Effective awk Programming.) The specials array handles tags that must be special-cased (as opposed to entities):

#! /bin/awk -f
    entities["darkcorner"]  = "(d.c.)"
    entities["TeX"] = "TeX"
    entities["LaTeX"]   = "LaTeX"
    entities["BIBTeX"]  = "BIBTeX"
    entities["vellip"]  = "\n\t.\n\t.\n\t.\n"
    entities["hellip"]  = "..."
    entities["lowbar"]  = "_"
    entities["frac18 "] = "1/8"
    entities["frac38 "] = "3/8"
    # > 300 entities removed for brevity ...

    specials["<?lb?>"] = specials["<?lb>"] = " "
    specials["<keycap>"] = " "

    RS = "<[^>]+>"
    entity = "&[^;&]+;"

As in many of the other scripts seen so far, this one also uses RS as a regular expression that matches tags with the variable entity encapsulating the regular expression for an entity.

The single rule processes records, looking for entities to replace. The first part handles the simple case where there are no entities (match() returns zero). In such a case, all that's necessary is to check the tag for special cases:

    if (match($0, entity) == 0) {
        printf "%s", $0

The next part handles replacing entities, again using a loop to pull the line apart around the text of the entity. If the entity exists in the table, it's replaced. Otherwise it's used as-is, minus the & and ; characters:

    # have a match
    text = $0
    out = ""
    do {
        front = substr(text, 1, RSTART - 1)
        object = substr(text, RSTART+1, RLENGTH-2)  # strip & and ;
        rest = substr(text, RSTART + RLENGTH)
        if (object in entities)
            replace = entities[object]
            replace = object
        out = out front replace
        text = rest
    } while (match(text, entity) != 0)
    if (length(text) > 0)
        out = out text
    printf("%s", out)

The special_case() function translates any special tags into white space and handles cross references, replacing them with an upper-case version of the id:

function special_case(  rt, ref)
    # a few special cases
    rt = tolower(RT)
    if (rt in specials) {
        printf "%s", specials[rt]
    } else if (rt ~ /<xref/) {
        ref = gensub(/<xref +linkend="([^"]*)".*>/,"\\1", 1, rt)
        ref = toupper(ref)
        printf "%s", ref

I ran both my original XML files and O'Reilly's final XML files through the desgml.awk script to produce text versions of each chapter. I then used diff to produce a context-diff of the chapters, and went through each diff looking for indexing and wording changes. Each such change I then added back into gawk.texi. This process occurred over the course of several weeks, as it was tedious and time-consuming.

However, the end result is that gawk.texi is now once again the "master version" of the documentation, and whenever work starts on the fourth edition of Effective awk Programming, I expect to be able to generate new DocBook XML files that still contain all the work that O'Reilly contributed.

Conclusion and Acknowledgements

Translating something the size of a whole book from Texinfo to DocBook was certainly a challenge. Using gawk made the cleanup work fairly straightforward, so I was able to concentrate on revising the contents of the book without worrying too much about the production. Furthermore, the use of Texinfo did not impede the book's production since O'Reilly received DocBook XML files that went through their tool suite, and the distributed version of the documentation benefited enormously from their input.

I would like to thank Philippe Martin for his original DocBook changes and Karl Berry, Texinfo's maintainer, for his help and support. Many thanks go to Chuck Toporek and the O'Reilly production staff. Working with them on Effective awk Programming really was a pleasure. Thanks to Nelson H.F. Beebe, Karl Berry, Len Muellner, and Jim Meyering as well as O'Reilly folk Betsy Waliszewski, Bruce Stewart, and Tara McGoldrick for reviewing preliminary drafts of this article.

Effective awk Programming

Related Reading

Effective awk Programming
Text Processing and Pattern Matching
By Arnold Robbins

Linux Online Certification

Linux/Unix System Administration Certificate Series
Linux/Unix System Administration Certificate Series — This course series targets both beginning and intermediate Linux/Unix users who want to acquire advanced system administration skills, and to back those skills up with a Certificate from the University of Illinois Office of Continuing Education.

Enroll today!

Linux Resources
  • Linux Online
  • The Linux FAQ
  • Linux Kernel Archives
  • Kernel Traffic

  • Sponsored by: