oreilly.comSafari Books Online.Conferences.


The Making of Effective awk Programming
Pages: 1, 2, 3, 4

O'Reilly books use a Constant Width Bold font to indicate user input in examples and a plain Constant Width font for computer output. Texinfo only uses plain Constant Width, distinguishing computer output with a leading glyph, in this case, -|. (TeX output uses a similar, but nicer looking symbol.) Error messages are prefixed with a different glyph that comes out in the DocBook file as error-->. This next bit removes these glyphs. It also supplies the <userinput> tags for any line whose first character is either $ or &gt; (the > symbol). These represent the Bourne shell primary and secondary prompts, respectively, which are used in printed examples of interactive use:

in_screen != 0 {
    gsub(/-\| */, "");
    gsub(/error--> /, "");
    if (/^(\$|&gt;) /)
        $0 = gensub(/ (.+)/, " <userinput>\\1</userinput>", "g")

The gensub() ("general substitution") function is a gawk extension. The first argument is the regular expression to match. The second is the replacement text. The third is either a number indicating which match of the text to replace, or "g", meaning that the change should be done globally (on all matches). The fourth argument, if present, is the value of the original text. When not supplied, the current input record ($0) is used. The return value is the new text after the substitution has taken place.

Here the replacement text includes \\1, which means "use the text matched by the part of the regular expression enclosed in the first set of parentheses." What this ends up doing is enclosing the command entered by the user in <userinput> tags, leaving the rest of the line alone.

Texinfo doesn't have sidebars, which are blocks of text set off to the side for separate, isolated discussion of issues. They are typically used for more in depth discussion items or for longer examples. In gawk.texi, I got around the lack of sidebars by using regular sections and adding the words "Advanced Notes" to the section title. This next bit of code looks for sections that have the words "Advanced Notes" in their titles and converts them into sidebars. While it's at it, it removes all inline font changes from the contents between <title> and </title> tags, since such font changes are against O'Reilly conventions:

# deal with Advanced Notes, turn them into sidebars
/^<sect/  { save_sect = $0 ; next }

/<title>/ {
             if (/Advanced Notes/) {
                 print "<sidebar>"
                 sub(/Advanced Notes: /, "")
             } else if (save_sect) {
                 print save_sect
             save_sect = ""

             # remove font changes from titles
             if (match($0, /<title>.+<\/title>/)) {
                 before = substr($0, 1, RSTART - 1)
                 text = substr($0, RSTART + 7, RLENGTH - 15)
                 after = substr($0, RSTART + RLENGTH)
                 gsub(/<[^>]+>/, "", text)
                 print before "<title>" text "</title>" after

/<\/sect/ {
             if (in_sidebar) {
                 print "</sidebar>"
                 in_sidebar = 0

There are three different kinds of dashes used in typography. "Em-dashes" are the length of the letter "m." "En-dashes" are the length of the letter "n." They are shorter than em-dashes. And plain dashes, or hyphens, are the shortest of all. The makeinfo output represents an em-dash as two dashes. This last chunk turns them into the &mdash; DocBook entity. This change is not done inside examples (! in_screen). The very last rule simply prints the (possibly modified) input record to the output:

/([a-z]|(<\/[a-z]+>))--[a-z]/ && ! in_screen {
    $0 = gensub(/([a-z]|(<\/[a-z]+>)?)--([a-z])/, "\\1\\&mdash;\\3", g, $0)

{ print }

As mentioned earlier, the early DocBook version of makeinfo generated lots of unnecessary <para> tags. The output had numerous empty paragraphs, and removing them by hand was just too painful. The following simple script, rmpara.awk strips out empty paragraphs.

This script works by taking advantage of gawk's ability to specify a regular expression as the record separator. Here, records are separated by the markup for empty paragraphs. By setting the output record separator to the null string (ORS = ""), a print statement prints the preceding part of the file.

#! /usr/local/bin/gawk -f
    RS = "<para>[ \t\n]+</para>\n*"
    ORS = ""

And since we're working with paragraph tags, the following small rule puts <para> tags inside lists and index entries on their own lines. This makes the DocBook file easier to work with. The final rule simply prints the record, which is all text in the file up to an empty paragraph:

/(indexterm|variablelist)><para>/ {
    sub(/<para>/, "\n&")

{ print }

Fixing Tables

A significant problem, requiring a separate script, had to do with the formatting of tables. The Texinfo @multitable ... @end multitable translates pretty directly into a DocBook <table>. However, the formatting of the output, while fine for machine processing, was essentially impossible for a human to work with directly. For example:

<table> <title></title> <tgroup cols="2"><colspec colwidth="31*">
<colspec colwidth="49*"> <tbody> <row>
<entry><literal>[:alnum:]</literal> </entry> 
<entry> Alphanumeric characters.  </entry> </row><row> <entry>
<literal>[:alpha:]</literal> </entry> <entry> Alphabetic characters.  
</entry> </row><row> <entry><literal>[:blank:]</literal> 
</entry> <entry> Space and tab characters.  </entry> </row><row> 
<entry> <literal>[:cntrl:]</literal> </entry> <entry> Control 
characters.  </entry> </row></tbody> </tgroup> </table>

Each row in a table should be separate, and each entry (column) in a row should have its own line (or lines). For this, I wrote the next script, fixtable.awk. It is similar to the rmpara.awk script, in that it uses a regular expression for RS. This time the regular expression matches DocBook tags. Thus the record is all text up to a tag, and the record separator is the tag itself plus any trailing white space.

The associative array tab (for "table") contains all the table-related tags that should be on their own lines. The <colspec> tag contains parameters, thus it does not have the closing > character in it:

#! /bin/gawk -f

    RS = "<[^>]+> *"
    tab["<table>"] = 1
    tab["<colspec"] = 1
    tab["<tbody>"] = 1
    tab["<tgroup"] = 1
    tab["</tgroup>"] = 1
    tab["</tbody>"] = 1
    tab["<row>"] = 1
    tab["</row>"] = 1

gawk sets the variable RT (record terminator) to the actual text that matched the RS regular expression. Any trailing white space in RT is saved in the variable white, and then removed from RT. This is necessary in case the tag in RT isn't one for tables. Then the white space has to be put back into the output to preserve the original file's contents:

    # remove trailing white
    # gensub returns the original string if the re doesn't match
    if (RT ~ / +$/)
        white = gensub(/.*>( +$)/, "\\1", 1, RT)
        white = ""
    sub(/ +$/, "", RT)

This next part does the work. It splits RT around white space. (This is necessary for the <colspec> tag.) If the tag is in the table, we print the preceding record, a newline, and then the whole tag on its own line. <entry> tags are printed on their own lines. Finally, any other tags are printed together with the preceding record, without intervening newlines, and with the original trailing white space:

    split(RT, a, " ")
    if (a[1] in tab)
        printf ("%s\n%s\n", $0, RT)
    else if (a[1] == "<entry>")
        printf ("%s\n%s", $0, RT)
        printf ("%s%s", $0, RT white)

The result of running this script on the above input is:


<tgroup cols="2">

<colspec colwidth="31*">

<colspec colwidth="49*">



<entry><literal>[:alnum:]</literal> </entry>
characters.  </entry>


<entry><literal>[:alpha:]</literal> </entry>
characters.  </entry>


<entry><literal>[:blank:]</literal> </entry>
<entry>Space and
tab characters.  </entry>


<entry><literal>[:cntrl:]</literal> </entry>
characters.  </entry>



Although there are still extra newlines, at least now the table is readable, and further manual cleaning up isn't difficult.

Pages: 1, 2, 3, 4

Next Pagearrow

Linux Online Certification

Linux/Unix System Administration Certificate Series
Linux/Unix System Administration Certificate Series — This course series targets both beginning and intermediate Linux/Unix users who want to acquire advanced system administration skills, and to back those skills up with a Certificate from the University of Illinois Office of Continuing Education.

Enroll today!

Linux Resources
  • Linux Online
  • The Linux FAQ
  • Linux Kernel Archives
  • Kernel Traffic

  • Sponsored by: