Standards Update: The Voice Browser

by David Sims

Ed note: My thanks to Jim Larson of Intel, who worked with me on this article and helped me understand the various markup languages. Some of this article is drawn from Jim's presentation.

A voice browser is a software application that works with various markup languages to interpret voice input and generate voice output.

There are several reasons to create a voice browser. First, many more people have access to telephones and wireless phones than to desktop browsers, so a voice browser would enable these people to access Web data. Second, we expect to witness a dramatic rise in the number of devices that will connect to the Net from places other than the desktop. In addition to phones, these will include palm-sized devices, remote-control devices, pagers, handheld devices, and devices embedded into household appliances. Another way of saying that is, "from input devices other than a keyboard."

The voice browser effort aims to build a way to access web content (marked up in XML style) with voice commands -- which will come in handy in places where hands-free access is needed, such as wireless phones and automobiles.

A W3C (World Wide Web Consortium) working group on the voice browser has formed and is working within the consortium's user interface area. The group includes members from 21 companies who have a vested interest in the final product, including AT&T, British Telecommunications, France Telecom, General Magic, Lucent, Intel, Philips, IBM, Motorola, and Nokia.

W3C on Voice

For more information, see the W3C's Voice Browser Working Group's page on

Possible applications for a voice browser include

  • accessing business or public information over the phone or through a kiosk,
  • controlling smart appliances and security systems in a home or work environment via a remote-control device containing a microphone and speaker,
  • controlling and navigating a PDA (personal digital assistant), and
  • sending and receiving e-mail over the phone.

Five Markup Languages

The Voice Browser committee is actually working on five markup languages:

Grammar Markup Language

GML works close to the input of the system to start to make sense of what the user is saying. It tells the speech recognition system what words to listen for, which in turn discards meaningless words (such as "um," "ah," and "y'know").

The Natural Language Semantics Markup Language

The NLSML relies on natural language processing techniques to extract pieces of information from spoken sentences. It is used to represent the meaning of spoken sentences, including the objects to which pronouns refer. An element of sophisticated natural language systems is a robustness which lets it understand (or attempt to understand) pieces of information in an utterance when it can't understand the whole.

Figure 1

Figure 1. Speech into the voice browser device is converted into text and then parsed for meaning. The Grammar Markup Language helps make sense of the utterings. The Dialog Markup Language and Dialog Manager work to move the user through a script of choices toward a resolution, which could be giving out information or performing a transaction.

Dialog Markup Language

This is the front-end of the system, the markup language that prompts users for input, and makes enough sense of the input to determine what to do next.

The DML works with the Dialog Manager to guide the conversation between the system and the user. It prompts the user for input, makes sense of the response, and determines the next action.

The Dialog Manager and DML can work in two modes:

  • directed, where the system poses questions that the user answers, and
  • mixed initiative, where the system and user can both ask questions.

Intelligent voice systems should also include a "barge-in" capability that allows the user to interrupt the system while it is speaking.

Figure 2

Figure 2. The Dialog Manager drives the conversation and puts out content in text form. The Text-to-Speech Markup Language adds tags for inflection or other audio cues before a wave generator creates the message for output speakers.

Text-to-Speech Markup Language

The Text-to-Speech Markup Language (TTSML) is the final link in the outgoing chain, taking the XML-based text and working with a speech synthesizer to generate the system's voice.

A basic function of the TTSML is basic prosodic control, things like timing, pitch, pauses, rate of speech, and emphasis. Some of the tags being considered are to mark parts of sentences for tonal inflections. For example, the <question> tag would instruct the synthesizer to generate an upswing at the end of that sentence. Similarly, the <exclaim> tag would create a different type of inflection.

Multi-Modal Markup Language

The MMML works with the dialog manager as a sort of a traffic cop to handle input in many forms from many types of devices. Ideally, a dialog manager that, for example, walks a user through the process of ordering airline tickets, should be able to deal gracefully with content from a voice, a keyboard, a telephone keypad, or a multiple choice touch-screen display.

Next Steps

The working group has posted a requirements draft on its site. At a meeting in January, the group reviewed feedback on the requirements that it has gathered thus far. Group members are now developing drafts of each of the markup languages.

The working group plans to meet again in May and then publish working drafts of the markup languages. It will continue to gather input and revise the markup languages for up to a year, after which it will issue a last call for comments before publishing the final recommendations.

For more information, visit the World Wide Web Consortium's site.

Return to O'Reilly Network Hub.