LinuxDevCenter.com
oreilly.comSafari Books Online.Conferences.

advertisement


The Making of Effective awk Programming
Pages: 1, 2, 3, 4

Fixing Index Entries

The next task was to work on the indexing entries. The original gawk.texi file already had a number of index entries that I had placed there. makeinfo translated them into DocBook <indexterm> entries, but they still needed some work. For example, occasionally additional material appeared on the same line as the closing </indexterm> tag. More importantly, special characters in the text of an index entry, such as < and >, were not turned into &lt; and &gt; in the generated DocBook. Also, O'Reilly's convention is to not have any font changes in the contents of an index entry. The fixindex.awk script dealt with all of these. The first part handles splitting off any trailing text:

#! /bin/gawk -f

# <indexterm> always comes at the beginning of a line.
# 1. If there's anything after the </indexterm>, insert a newline
# 2. Remove markup in the indexed items

/<indexterm>/   {
    if (match($0, /<\/indexterm>./)) {
        front = substr($0, 1, RSTART + 11);
        rest = substr($0, RSTART + RLENGTH - 1)
    } else {
        front = $0
        rest = ""
    }

If the text of the index entry has font changes in it, the next part extracts the contents of the entry, removes the font changes, and then puts the tags back in:

    if (match(front, /<(literal|command|filename)>/)) {
        text = gensub(/<indexterm>(.+)<\/indexterm>/, "\\1", 1, front)
        gsub(/<\/?(literal|command|filename)>/, "", text)
        front = "<indexterm>" text "</indexterm>"
    }

Looking at this now, sometime later, I see that the removal and restoration of the <indexterm> tags isn't necessary. Nevertheless, I leave it here to show the code as I wrote it then.

The rest of the rule deals with index entries for the <, <=, >, and >= operators, converting them into the appropriate DocBook entities. Finally, it prints the modified line and any trailing material that may have been present, and then gets the next input line with next. The final rule simply prints lines that aren't indexing lines:

    gsub(/><=/, ">\\&lt;=", front)
    gsub(/>< /, ">\\&lt; ", front)
    gsub(/>>=/, ">\\&gt;=", front)
    gsub(/>> /, ">\\&gt; ", front)
    print front
    if (rest)
        print rest
    next
}


{ print }

Fixing Options

As you may have noticed, the scripts have been progressing from larger-scope fixes to smaller-scope fixes. This next script deals with a fine-grained, typographical detail.

In the Italic font O'Reilly uses to represent options, the correct character to use for a hyphen or dash is the en-dash, discussed earlier. This is represented by the DocBook &ndash; entity. Furthermore, gawk's long options start with two dashes, not one. In both the Italic font in the text and in the Roman font in the index, the two dashes run together when printed, making them difficult to distinguish.

This next script solves both problems. It converts plain dash characters to &ndash;, and inserts an &thinsp; character between two en-dashes. The &thinsp; is a very small amount of horizontal spacing whose job is to provide just such tiny amounts of separation between characters. This script also works by setting RS to a regular expression matching the text of interest, modifying the capture value in RT, and then printing the record and new text back out.

The <primary> and <secondary> tags only appear inside <indexterm> tags. The <option> tags delimit options in the book's main text:

#! /bin/awk -f

BEGIN {
    RS = "<(primary|secondary|option)>-(-|[A-Za-z])+"
}

{
    if (RT != "") {
        new = RT
        new = gensub(/--/, "\\&ndash;\\&thinsp;\\&ndash;", "g", new)
        new = gensub(/-/, "\\&ndash;", "g", new)

    } else
        new = ""
    printf("%s%s", $0, new)
}

Manual Work

After going through all the above scripts, the book was almost ready for prime time. All my scripts had produced a DocBook XML document that was quite close to what I would have produced had I been entering the text directly in DocBook. It took considerably less effort than if I tried to convert the text from Texinfo to DocBook using either the sed stream editor, or manually, using editor commands (the colon prompt within vim).

Nevertheless, my Notes file lists a fair number of manual changes that I had to make, things that weren't amenable to scripting. Most of these, though, could be tackled using the vim command line. (Believe me, if I could have fixed these with a script too, I would have. But sometimes there are things that a program just isn't quite smart enough to handle.)

After all of these changes, I was at the final stage. In fact, this was during the technical review stage, and for a brief while before submitting the book to O'Reilly's Production department, I was making edits in parallel, in both the Texinfo and the DocBook versions of the book. The main reason for this was to avoid having to remake all the manual edits. It was easier to make a few incremental changes in parallel than to just edit the Texinfo file, regenerate DocBook, and then have to redo all the manual edits.

Fixing Identifiers

One final transformation was needed before submitting the book to Production. O'Reilly has a standard convention for naming chapters, sections, tables, and figures within the id="..." clause of the appropriate tags. For example, <sect2 id="eap3-ch-3-sect-2.1">. These same identifiers are used in <xref> tags for cross references.

However, makeinfo produced identifiers based on the original names of the @node lines in the gawk.texi file. For example, <sect1 id="How20To20Contribute">. (Here, the spaces in the original node name are replaced by 20, which is the numeric value of the space character, in hexadecimal.) I needed to transform these generated identifiers into ones that followed the O'Reilly convention.

The following script, redoids.awk (re-do ids), does this job. It makes two passes over the input. The first pass extracts the existing ids from chapter, section, and table tags. It maintains the appropriate chapter and section level counts, and by using them, generates the correct new tag for the given item. The first pass builds up a table (an associative array), mapping the old ids to the new ones.

The second pass goes through the file, actually making the substitutions of new id for old. It can't be done all in one pass since there are cross references, both forwards and backwards, scattered throughout the text.

Setting Up Two Passes

The BEGIN block checks that exactly one argument was given, and prints an error message if not. It then sets some global variables, namely, the book name and IGNORECASE, which causes gawk to ignore case when doing regular expression matching:

#! /bin/gawk -f

BEGIN {
    if (ARGC != 2) {
            print("usage: redoids file > newfile\n") > "/dev/stderr"
            abnormal = 1
            exit 1
    }

    book = "eap3"
    IGNORECASE = 1

This next part actually sets up two passes over the input. It first initializes Pass to 1. Next, it adds a variable assignment, Pass=2, to ARGV, and then the input filename, and increments ARGC.

The upshot is that gawk reads through the file twice, with the variable Pass being set appropriately each time through. The code for the two passes then distinguishes which pass is which by testing Pass:

    # set up two passes
    Pass = 1
    ARGV[ARGC++] = "Pass=2"
    ARGV[ARGC++] = ARGV[1]
}

Pages: 1, 2, 3, 4

Next Pagearrow




Linux Online Certification

Linux/Unix System Administration Certificate Series
Linux/Unix System Administration Certificate Series — This course series targets both beginning and intermediate Linux/Unix users who want to acquire advanced system administration skills, and to back those skills up with a Certificate from the University of Illinois Office of Continuing Education.

Enroll today!


Linux Resources
  • Linux Online
  • The Linux FAQ
  • linux.java.net
  • Linux Kernel Archives
  • Kernel Traffic
  • DistroWatch.com


  • Sponsored by: