oreilly.comSafari Books Online.Conferences.


Perl Enters the Oxford English Dictionary

by Howard Wen

Perl figures heavily in the production of a historic institution--the Oxford English Dictionary. Oxford University Press, the publishing arm of the University of Oxford, which publishes the dictionary, has used Perl for a number of years. Their use of Perl has ranged from preparing texts for typesetting to producing language parsers for both English and foreign languages. "We have found it to be a programming language that is both powerful and approachable," says Dan Barker, one of Oxford English Dictionary's main programmers in the U.K.

In the dictionary's North American offices, Perl has been put to use by a neophyte. "I am not a programmer--I am a dictionary editor. Nonetheless, I was able to learn enough Perl in a short time to develop extremely useful tools," says Jesse Sheidlower, the Principal Editor of the Oxford English Dictionary's North American editorial unit in Connecticut. This is a new office, which Sheidlower is in the process of putting together, so he doesn't have certain resources at the moment--including programmers on staff.

Jesse Sheidlower
Jesse Sheidlower, Oxford English Dictionary editor,
with some of his favorite reference books.

The dictionary staff in the United Kingdom uses Perl in-house to serve SGML-compliant tagged text, which parses cleanly against a DTD, and which also is suitable for conversion to HTML. "Although the dictionary text is held in tagged electronic form, for various reasons it is not fully SGML-compliant," says Barker, "and the markup was designed a number of years ago for editorial use, not with Web publishing in mind."

In very broad terms, he and his staff have found Perl works best when dealing with the contents of elements, while an SGML-aware tool is often better at handling the elements themselves.

"We needed routines to get us to simplified SGML markup, and we now have a suite of applications (both Perl and non-Perl) which perform this role," he says. Ironically, Perl's default "ignorance" of SGML (how much Perl is aware of SGML without the SGML libraries) was helpful in achieving this goal. "We did use an SGML-aware conversion tool in conjunction with Perl," Barker says. "But for processing the text at the points where it was not in a state to parse against a DTD, it was essential to be able to treat it as a text stream, not defined markup."

For Sheidlower, Perl's ease-of-use enabled him to get major tasks accomplished, and was the most indispensable Perl module he used. "The Oxford English Dictionary is a tremendous text-based document, and Perl is perfectly suited for dealing with text. It's an ideal match," he says. "Perl has no problem processing everything we need to handle, and it's easy enough to use that we can get things done quickly without having to spend months developing things. Really, it's perfect."

The dictionary and all of its production apparatus is based in SGML. Part of this production process features a reading program, in which volunteer participants around the world read through a variety of the dictionary's texts and send the editors interesting, suggested uses of words for possible inclusion in the dictionary.

The database for this reading program is SGML-based, and everything that the volunteers read needs to be properly coded. "This can be a real drag, since we have hundreds of tags and thousands of character-entity references to handle the many different aspects of a word and its history, bibliography, etc. Still, it's necessary," Sheidlower says.

Learn how large and small companies are putting Perl to work by reading more Perl Success Stories.

In the past, the Oxford English Dictionary editorial staff would have paid keyboarders to type in all this text and mark it up properly, but this was a very slow and expensive process. More recently, the staff set up a basic template in Microsoft Word to help their volunteers type in their information. However, as Sheidlower says, the volunteers still had to deal with eye-glazing SGML tags, and still had to learn hundreds of pages of instructions in order to use the template effectively. Many volunteers were unwilling to do this.

To solve this problem, Sheidlower wrote a program in Perl to run the reading program on a Web page. All the basic bibliographical fields can be filled in, and the volunteer reader can enter quotes in a very simple-to-use form. When the form is completed, the program generates a properly coded SGML file. "Thus our readers are given a comfortable way to enter their material, and we are guaranteed perfectly formatted files with no typos in the tags," Sheidlower says.

Overall, Perl has been more than just a time-saver for him: "If I had to hire an outside programmer at great expense, this wouldn't have gotten done. But [with Perl] I could do it myself, saving vast amounts of time, money, and hassle, and enabling us to focus on the goal of analyzing text, instead of debugging programs. And that has made all the difference."

Read the New York Times article on Jesse Sheidlower, an editor who has "spent his career tracking with equanimity the ceaseless mutation of the American language, often into zones that its stuffier defenders have scorned." (The NY Times requires a one-time registration.)

Sponsored by: