ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.


AddThis Social Bookmark Button

Simple XML Parsing with SAX and DOM

by Philipp K. Janert

XML has arrived. Configuration files, application file formats, even database access layers make use of XML-based documents. Fortunately, several high-quality implementations of the standard APIs for handling XML are available. Unfortunately, these APIs are large and therefore provide a formidable hurdle for the beginner.

In this article, I would like to offer an accessible introduction to the two most widely used APIs: SAX and DOM. For each API, I will show a sample application that reads an XML document and turns it into a set of Java objects representing the data in the document, a process known as XML "unmarshalling."

First, a word on style. For instructional purposes, I have kept the code as simple as possible. In order to focus on the basic usage of SAX and DOM, I completely omitted error handling and handling of XML namespaces, among other things. Furthermore, the code has not been tuned for flexibility or elegance; it may be dull, but hopefully it is also obvious.

The 60-Second XML Skinny

For those completely new to XML, I would like to review the most important terms and concepts used with XML data.

Each XML document starts with a prologue, followed by the actual document content. The prologue begins with an XML declaration, such as:

<?xml version="1.0" standalone="yes" ?>

The declaration must be at the very beginning of the document -- not even whitespace may precede it! It is followed by the document type declaration, which in the present case only names the root element (catalog), but in a real-world application would also provide a link to a constraint as provided by a Document Type Definition (DTD) or XML Schema document:

<!DOCTYPE catalog>

This concludes the prologue. The following body of the XML document is made up of elements, which take the role of (and look like) familiar HTML tags. Every element has a name, and may have an arbitrary number of attributes:

<catalog version="1.0">...</catalog>

Here catalog is the name of the element, having one attribute named version, with value 1.0. In contrast to HTML, XML element names are case sensitive and must be closed with the appropriate closing tag. Note that there must be no space between the opening angle bracket and the element name. If the element contains neither text nor other elements, the closing tag may be merged with the start tag (a so-called empty tag):

<catalog version="1.0" />

An element may include either text, or other elements, or a combination of both. Text may include entity references, similar to those in HTML. In short, an entity reference is a placeholder for another piece of data. They are often used to include special characters, such as angle brackets: < or >. Entity references consist of a ampersand, followed by the entity name and a semicolon:


XML elements have to be properly nested; in particular, the opening and closing tags of different elements must not overlap. In other words, an element's opening and end tags must reside in the same parent. This establishes a clear parent/child relationship among all elements of an XML document. Finally, the outermost element (the one following the prologue) is called the root element.

An element name may be qualified by an XML namespace prefix, yielding a qualified, or qNam. The namespace prefix is in the form of a Universal Resource Identifier (URI) and is followed by the local name after a colon:


A document following these rules is syntactically well-formed. This is to be distinguished from its validity, which refers to adherence to the constraint laid out in the DTD or XML Schema document. Note that for a document that does not specify a constraint (such as the example document below), the concept of validity makes no sense.

The XML Document and Data Objects

The document to read describes the catalog of a library. The catalog may contain an arbitrary number of books and magazines. Each book has a title and exactly one author. Each magazine has a name and may contain an arbitrary number of articles. Finally, each article has a headline and a starting page.

<?xml version="1.0"?>

<catalog library="somewhere">

    <author>Author 1</author>
    <title>Title 1</title>

    <author>Author 2</author>
    <title>His One Book</title>

    <name>Mag Title 1</name>

    <article page="5">
      <headline>Some Headline</headline>

    <article page="9">
      <headline>Another Headline</headline>

    <author>Author 2</author>
    <title>His Other Book</title>

    <name>Mag Title 2</name>

    <article page="17">
      <headline>Second Headline</headline>


Note that the starting page is encoded as an attribute of the article element. This is done primarily to demonstrate the use of attributes, although it can be argued that this design decision is actually semantically justified, since the starting page of an article is information about the article, but not part of the article itself.

In the example text, the following elements (called "complex elements" for the purpose of this article) may contain other elements:

  • <catalog>
  • <book>
  • <magazine>
  • <article>

The "simple" elements are those that contain only text:

  • <author>
  • <title>
  • <name>
  • <headline>

There are no elements that contain both text and child elements simultaneously.

The complex elements are represented in the application code by classes, whereas the simple elements are java.lang.String member variables of these classes. Since the sole purpose of these classes is to bundle the data read from the document, their interface has been kept minimal: they can be instantiated, their data members can be set, and finally, they override the toString() method, so as to allow access to the data inside.

class Catalog {
    private Vector books;
    private Vector magazines;

    public Catalog() {
	books = new Vector();
	magazines = new Vector();

    public void addBook( Book rhs ) {
	books.addElement( rhs );
    public void addMagazine( Magazine rhs ) {
	magazines.addElement( rhs );

    public String toString() {
	String newline = System.getProperty( "line.separator" );
	StringBuffer buf = new StringBuffer();
	buf.append( "--- Books ---" ).append( newline );
	for( int i=0; i<books.size(); i++ ){
	    buf.append( books.elementAt(i) ).append( newline );
	buf.append( "--- Magazines ---" ).append( newline );
	for( int i=0; i<magazines.size(); i++ ){
	    buf.append( magazines.elementAt(i) ).append( newline );

	return buf.toString();

// --------------------------------------------------------------

class Book {
    private String author;
    private String title;

    public Book() {}
    public void setAuthor( String rhs ) { author = rhs; }
    public void setTitle(  String rhs ) { title  = rhs; }

    public String toString() {
	return "Book: Author='" + author + "' Title='" + title + "'";

// --------------------------------------------------------------

class Magazine {
    private String name;
    private Vector articles;

    public Magazine() {
	articles = new Vector();

    public void setName( String rhs ) { name = rhs; }

    public void addArticle( Article a ) {
	articles.addElement( a );

    public String toString() {
	StringBuffer buf = new StringBuffer( "Magazine: Name='" + name + "' ");
	for( int i=0; i<articles.size(); i++ ){
	    buf.append( articles.elementAt(i).toString() );
	return buf.toString();

// --------------------------------------------------------------

class Article {
    private String headline;
    private String page;

    public Article() {}

    public void setHeadline( String rhs ) { headline = rhs; }
    public void setPage(     String rhs ) { page     = rhs; }

    public String toString() {
	return "Article: Headline='" + headline + "' on page='" + page + "' ";

The classes have not been declared public, therefore they have package visibility. The primary consequence of this is that all of them can be defined in the same source file. (To remove possible confusion: the variable name rhs used in the setter methods stands for right-hand-side -- a very convenient naming convention for assignments!)

Pages: 1, 2, 3

Next Pagearrow