The Mind Behind gawk: An Interview with Arnold Robbins: The Author of Effective awk Programming, 3rdby Chuck Toporek
Toporek: Before we get rolling, tell us a bit about yourself and what your first experience was like with a Unix system?
Robbins: I am just over 40, with a wonderful wife and four beautiful children--three girls and a boy. I was born and raised in Atlanta, Georgia. I attended college at Yeshiva University in New York, and spent two years studying in a yeshiva--an Orthodox Jewish seminary--in Jerusalem. I returned to Atlanta and did my master's in computer science at Georgia Tech. I've worked in universities and startup companies, mostly developing software, but one of my university stints was teaching continuing education classes in Unix, C, and TCP/IP networking. These days I work on O'Reilly books and do contract programming, and I also work on gawk as time permits.
As for my first experience with Unix, I was initially exposed to Unix as a senior in college on a PDP-11 running a commercial version of V6 Unix--Interactive's IS/1, if I remember the name correctly. From then on I was hooked. Unix was so far in advance of everything else available at the time. In grad school we used 4.1 BSD, and I did some contract work for Southern Bell using USG Unix 4.0--the equivalent of System IV, which was never released outside the Bell system. Since then, I've had a lot of experience with straight BSD (v4.2, v4.3), straight System V (Releases 1 to 4), and the various commercial mixtures: DEC Ultrix, SunOS, Solaris, HP-UX, and IBM's AIX. These days I use GNU/Linux, RedHat 7.x. One of these days, I'd love to play with Plan 9 from Bell Labs.
Toporek: What led you to use awk?
Robbins: In late 1987 I picked up the gray book on awk by Aho, Kernighan, and Weinberger [The AWK Programming Language]. I knew that awk was powerful, but until then it was only minimally, very minimally, documented. I already knew that anything written by Brian Kernighan is worth reading, usually more than once. I'd always wanted to learn awk, and with a book available, here was my opportunity. I was working as a system administrator at the time, and the language really met the needs I had for the text processing kinds of things that were my day-to-day tasks.
Toporek: The release of Effective awk Programming, 3rd Edition, marks your sixth book published by O'Reilly. What first brought you to O'Reilly and what's your experience been like thus far?
Robbins: Circa 1995, I had been working on gawk and its documentation for a while, and I was trying to find a way to get the documentation published. I had read the first edition of sed & awk, and it had a number of, shall we say, inaccuracies. I had also gone to graduate school with and knew Eugene Spafford (coauthor of Practical UNIX & Internet Security, 2nd Edition), and asked him for his contact at O'Reilly, with the idea that maybe they would split the book into two, with the gawk documentation simply replacing the awk half of sed & awk. Needless to say, O'Reilly wasn't interested in this idea, but the editor I talked with asked me to mark up a copy of sed & awk with what needed changing. I agreed to do this in exchange for a set of the O'Reilly books on X. I marked up the book, got my X books, and I thought that was the end of the story.
Sometime later, Gigi Estabrook, the then-new update editor, contacted me about revising sed & awk, since the original author didn't want to do it. Although I didn't expect to make a lot of money from it, I agreed to do it, since I knew that O'Reilly sold a fair number of those books, and it was a disservice to the reader community to have it be inaccurate. (I was surprised and pleased when that first royalty check came in.)
In 1997, when we made the decision to move to Israel, I talked with Gigi about doing more books, since I was planning to take a laptop and knew I wouldn't be working right away. At that point, I signed contracts to update Learning the vi Editor and Unix in a Nutshell. The pocket references followed.
Overall, obviously, my experiences have been positive, and in general doing each book gets easier. I'm looking forward to working with O'Reilly for a long time to come.
Toporek: So what led you to take on a development role with the GNU implementation of awk and gawk?
Robbins: After the reading the original AWK Programming Language book, I went looking to see if the GNU project had a version, since I like playing with source code. I found that their version barely implemented the original version of awk, and being interested in programming language implementation, and also single (with therefore lots of spare time), I asked the FSF if I could work on it to bring it up to compatibility with the awk described in the book. It turned out that David Trueman had already volunteered to do that, so the FSF pointed me to him. He and I worked together for a number of years. Although he was definitely the project lead, we complemented each other pretty well in terms of work on gawk. Around 1994, he bowed out, being too busy to work on it, so I took over the lead.
It has been both a lot of work and a lot of fun. I have corresponded with many, many people, and it is very satisfying to see my software shipped with millions of GNU/Linux and *BSD systems. I like to joke that I single-handedly support millions of customers, in my spare time. The truth though, is that I have a team of portability testers for different Unix systems, and port coordinators for non-Unix systems, whose help and feedback have made a huge and positive difference in the quality of the program. And the email feedback about gawk and its documentation is almost always positive, which is also very gratifying.
It has also led me to valuable friendships with Brian Kernighan of Bell Labs, who maintains The One True awk, and with Michael Brennan, the author of mawk. I value both of these friendships very much, and Michael was kind enough to write the foreword to Effective awk Programming, 3rd Edition.
Toporek: Obviously, you think it's important for people to program effectively with awk and gawk, but what's so special about them that should drive people to use them?
Robbins: The simplicity of the data-driven programming paradigm is the most attractive thing about awk. You can focus on what you want to do, with awk doing most of the housekeeping work.
The "no arbitrary limits" coding principle for GNU programs is also important and valuable these days. When many vendor versions of awk fall over and die where there are too many fields in a record, or too many bytes in a record, for any value of too many), gawk will just keep on plugging away.
Toporek: Why would someone use awk or gawk over, say, Perl or Python? Can you give an example where awk or gawk can do the job more effectively?
Robbins: Your question is good tinder for a flame war. I don't claim to be a Perl or Python expert, and I do appreciate the power and flexibility of those languages. I am a firm believer that one should use the right tool for the job, and if awk isn't the right tool, then you shouldn't contort yourself to use it.
All that said, let's answer your question. There are several reasons, which I'll highlight in the following list:
Availability: A version of new awk is available on every commercial Unix system and all the free ones. A carefully written awk program can be used with no problems on virtually every Unix system in the world. By contrast, you have to download and install Perl and Python.
Learnability: awk is a small language, which makes it very easy to learn; you can come up to speed with it pretty quickly, at least if you have some programming background.
Simplicity: For the class of problems where awk is the right tool, the programs are generally simple, compact, and readable, not to mention straightforward to write.
Speed: I have had people tell me that gawk is faster than Perl 5 on programs that do the same thing. mawk is usually an order of magnitude faster than gawk, so I bet it would really give Perl a run for its money. (I will admit though, that I haven't done any hard tests, so the Perl and Python fans out there can feel free to take this with a grain of salt. Just don't email me any flames, please.)
Elegance: I once heard a nice definition of this term: "Power cloaked in simplicity." For its class of problems, awk is elegant.
Toporek: Can you tell us a bit about how this book has evolved over the years? Wasn't it once published by SSC (Specialized Systems Consultants)?
Robbins: I started working on a 90-page draft manual on gawk from the FSF around 1989. It has grown and evolved since then. Around 1995, I had had a lot of positive feedback on the book and I was looking to get it published. At the time, no one wanted to touch a book that they didn't have exclusive rights to. However, I had a relationship with SSC, and Phil Hughes, SSC's president, read the book and agreed to publish it as a "this needs to be done for the benefit of the world" kind of thing. Fortunately, SSC recovered its costs and even made some money on its editions of the book, and so did I.
In late 1999 or early 2000, SSC decided to get out of the book business. By then, I had a nice working relationship with O'Reilly, and O'Reilly generously agreed to pick up the existing stock of the SSC edition and sell it through their Web site and other channels. By then, O'Reilly's view of freely available documentation had changed and they were interested in working with me to publish the third edition. This was great, as it gave me an excuse to really focus on gawk development for a while and finally get gawk 3.1 out the door.
Toporek: Since Effective awk Programming was originally written in Texinfo, what sort of challenges did you face as author, as well as maintainer of the documentation, to bring this latest edition to print?
Robbins: O'Reilly agreed to do the editorial work (content editing, book organization, and copyediting) on paper from the Texinfo version of the book. That way all that work would be donated to the version that ships with gawk. I am deeply grateful for this; the improvements were legion. I took the marked-up hard copy and applied all the changes, and only then began work on the conversion to DocBook XML, which is one of O'Reilly's preferred formats for book production.
I didn't start from scratch, thank heaven. The GNU makeinfo program turns Texinfo into either Info or HTML. Phillipe Martin had modified makeinfo to generate DocBook SGML output. Starting with his modifications, I did further work to make the output more XML-compatible. I then gradually wrote about six awk scripts to post-process the generated DocBook so that it was more readable, and in line with the conventions expected by O'Reilly's production department. This was something of an iterative process.
Once I got a version of the book that would go through O'Reilly's DocBook tools, I printed the whole thing out and read through it again, so that I could fine-tune the DocBook.
For more information about DocBook, don't miss O'Reilly's DocBook: The Definitive Guide.
Toporek: That's a lot of work. About how much time do you think it took for you to generate usable DocBook XML files from the original Texinfo?
Robbins: Well, I had started work on developing the conversion tools while the book was still being edited and reviewed. Most of the work came at the end. I'd say it took about four weeks, with the last two being almost full-time on the process. For a while I was making corrections in both the Texinfo and the DocBook versions in parallel. Fortunately, all the work paid off; the production process for this book was the smoothest yet.
Toporek: This book is also published under the FSF's Free Documentation License. For those who aren't familiar with the FDL, tell us what this means to you, not only as the maintainer of the documentation for gawk, but also as an author who makes part of his living from print publishing.
Robbins: The FDL requires the publisher of any document licensed under it to make the source code for the document available. This explicitly means the document must be in a form that an end user can easily modify, such as TeX, Texinfo, troff, or DocBook, and not in something difficult to work with, such as PostScript or a proprietary word processor format. This license does for documentation what the GPL does for software. End users can always get hold of the electronic version of a document and modify it to suit their needs. Most, if not all, publishers suffer book piracy (not to mention competition), just as software vendors do. Thus, this sort of license is anathema to them. (Why should I publish this book if Joe OtherPublisher down the block can do it too?)
With respect to this book being published under the FDL, I'm wearing two hats. Ideologically, it's important for the Free Software Movement to see publishers willing to take a risk at publishing something that isn't exclusively their book. For publishers, O'Reilly is setting an example that maybe publishing free documentation (free in the FSF sense) isn't so risky after all.
For me as an author, it's a chance to see my book reach a wider audience and make a reasonable income from it at the same time. Without stepping too much on Eric Raymond's toes, I'll just say that Free Software and Free Documentation for their own sake are important and my primary motivators. But with a family and a mortgage, the income is nothing to sneeze at. At the same time, it helps set an example for authors of free software documentation that "they can have their cake and eat it too." If publishers will pay royalties on free documentation, a potential author doesn't have to be scared into writing a proprietary book.
I know of another publisher, more or less doing all its business from free documentation. Alas, that publisher wouldn't even send me a complimentary copy of my own book. They shall, of course, remain nameless here. By contrast, O'Reilly not only contributed significant editorial resources to Effective awk Programming, 3rd Edition, but they also donate a percentage of net sales income directly to the FSF.
Toporek:What are some of the new features to gawk 3.1 that you think deserve the most attention?
Robbins: Since it's been five years between releases, there are a bunch of them. Here's a brief list; the Online Sample Chapters at the O'Reilly Web site describe these in a little more detail. The full story, of course, is in the book.
Internationalization facilities at the awk language level.
The C code for gawk itself has been internationalized. gawk is the first GNU program to ship with a Hebrew translation of its messages. So far, that's the only translation I have. (I didn't do the translation, though. It was done by Eli Zaretskii, who has done a lot of work with the DJGPP tools.)
Two-way communications with subprocesses via the |& operator.
Profiling of awk code.
Array sorting and bit-manipulation functions.
Adding dynamically loaded built-in functions.
There are a host of other, smaller extensions as well.
Toporek: If you had only 50 words to use to convince someone to use awk and gawk, what would you say to them?
Robbins: Hmm. I guess I'd say: Use the awk language if you have a modest-size text-manipulation task that you want to solve without a lot of fuss or learning curve, and where you don't need every last ounce of performance. Use gawk if your vendor awk dies under stress, or if you need its features. If you're hooked on awk but want speed, use mawk.
Toporek: Actually, that's 66 words, but who's counting? Thanks for taking time out of your busy schedule to chat with us about awk and your book. I'm sure we'll be seeing more from you in the future.
Robbins: You're welcome. Doing the interview was fun. I hope the oreilly.com Web site readers get something out of it too.