XHTML: The Clean Code Solution

   What is XHTML?

• XHTML is a stricter, cleaner version of HTML.

• Why do we need cleaner code? Current browsers are bloated with code to handle sloppy or proprietary HTML. As more devices read web content (handhelds, set-top boxes, pagers), we'll need more compact browsers.

• XHTML documents are all lowercase.

• All tags, including empty elements, must be closed.

XML continues to be a hot topic among web developers. Why? Because it delivers a standardized markup that separates display and layout code from syntax, making the creation, maintenance, and parsing of documents much easier for all involved.

But that's just one example of how a strict, standardized markup standard can make programming easier. As we watch the growing trend of portable web-enabled devices, we realize they require only small subsets of the bloated HTML code we are sending to desktop browsers, and multiple output formats are what XML and standardized markup languages were designed for. Getting to that point, however, will require some work.

Whether your site has 10 pages or 10,000, it's likely that the HTML code is a mix of standard HTML and browser-specific, proprietary markup. If you've been thinking about making the transition to XML, or even just standardizing your HTML code, the W3C, the Web's standards body, have provided the solution: XHTML (Extensible Hypertext Markup Language) is the latest version of HTML.

XML + HTML = XHTML (sort of)

Let's take a quick look at how these markup languages fit together.

The W3C (World Wide Web Consortium) has taken the logical step of expressing the HTML 4.0 standard in XML instead of using the more complicated SGML.

The minute details aren't important for the average web coder; the main difference is found in the document type definitions (DTDs) used by HTML and XHTML. A DTD, according to the W3C, is "a collection of declarations that, as a collection, defines the legal structure, elements, and attributes that are available for use in a document that complies to the DTD."

In other words, it's a definition of what is legal syntax in XHTML and what isn't. The DTD for XHTML is more restrictive than the DTD for HTML because XML is more restrictive than SGML.

The W3C gives two main reasons for recommending XHTML as the next step from HTML 4.0. First, XHTML, since it's an XML application, is designed to be extensible -- that's the "X" in all the acronyms. This means that new tags or "elements" in the official W3C jargon can be added without altering the entire DTD that the document is based on. Granted, if you add tags that aren't in the DTD, the document won't validate, but if you keep it well-formed, it will still parse.

Work is already underway on XHTML 1.1, which is designed to accommodate extensions through existing XHTML modules and techniques for developing new modules. These modules will permit the combination of existing and new feature sets when developing content and when designing new client software, so developers can choose among subsets of XHTML, and don't need to support the entire language.

The second reason is a follow on to that: XHTML is designed for portability. Desktop web browsers have become behemoths of code bloat. You name it, there's code in the newest browsers to do it. But according to some estimates cited by the W3C, by 2002, 75 percent of web document viewing will be through non-desktop devices like palm computers, televisions, toasters, and other alternative platforms, not through browsers on PCs. Your web-enabled toaster won't need or want to accommodate the same subset of XHTML that your PC browser does. Through a new client and document profiling mechanism, servers, proxies, and clients will be able to perform content transformation so that eventually it will be possible to develop XHTML-conforming content that is usable by any XHTML-conforming client. The server, the client, or a proxy service, will decide on the subset of XHTML that is received.

But much of that is still down the road in a number of XHTML 1.1 draft specifications still being written. The only spec that has been made a recommendation by the W3C is XHTML 1.0.

Some more specific reasons for moving to XHTML 1.0 are based on the fact that XHTML documents are XML conforming which means that they are readily viewed, edited, and validated with standard XML tools and that XHTML documents can use applications (such as scripts and applets) that rely upon either the HTML Document Object Model or the XML Document Object Model (DOM).

To help you see the main differences between HTML and XHTML, we've included a number of examples in the following section, "Differences." You'll see that most of the variances are simply stricter definitions of common HTML tags.

There are, however, some new features, which we cover in the third section, "What's New."

XHTML Part 2: Differences

First the basics. Here are a few simple changes to how you currently code. As you'll see in these examples, it's nothing dramatic, just some rules to remember about tags and attributes.

All HTML must be in lowercase

Since XML is case-sensitive, all HTML element and attribute names must be in lowercase. You can no longer get away with what many of us used to do to improve the readability of code -- entering the element and attribute names in uppercase and the values in lowercase, or other coding styles.

<BODY BGCOLOR="#ffffff"> <body bgcolor="#ffffff">

Fortunately most good HTML editors give you the option of inserting your HTML code in uppercase or lowercase, and many even convert the case of existing tags. An exception to this rule is that user-defined attribute values can be any case you want. For example, the #ffffff hex color above can also legally be written as #FFFFFF."

All attribute values must be quoted

This one is pretty straightforward -- no more <table border=0>. You now need to quote every attribute, even if it's numeric:

<table border=0>... <table border="0">...

This one will be particularly annoying to some Perl coders I know, who for years have been writing:

print "<table border=0>\n"; instead of
print "<table border=\"0\">\n"; or even better
print qq{<table border="0">\n};

It's also frustrating that some HTML editing programs that claim not to change your code do remove quotes around numeric attribute values.

All non-empty elements must be terminated

Remember when the <p> tag was used to separate paragraphs? Well that was never the intended use for that tag. But many HTML coders, including myself, used it that way. Some web developers still preach against the </p> tag at the end of a paragraph. That's a whole can of worms I'm going to avoid.

What I've learned is that the <p> tag is designed to mark the beginning and end of a paragraph. That makes it a "non-empty" tag since it contains the paragraph text. I still occasionally use it by itself, especially on pages that don't use a style sheet. But in XHTML that's a big no-no.

Paragraph 1<p>
Paragraph 2<p>
<p>Paragraph 1</p>
<p>Paragraph 2</p>

In addition to the <p> element, this also applies to list elements which are often left unterminated: <li></li>, <dt></dt>, and <dd></dd>.

Elements must nest, not overlap

HTML doesn't care whether you overlap elements. For example, if you have a bold tag at the end of a paragraph, it works pretty much the same whether you close the </b> first or the </p> first. With XML and XHTML, you need to close the most recently opened tag first, then the others in succession:

<p>here is a bolded <b>word.</p></b> <p>here is a bolded <b>word.</b></p>

Required elements

These little tidbits are pretty obvious, and I imagine that most of us are doing them already.

XHTML Part 3: What's New

In order for certain HTML elements to be considered valid in XML 1.0 and XHTML 1.0, they need to be written differently. These are the new requirements.

All documents must have a doctype declaration

Ever wonder what these odd lines of code at the top of many HTML documents are?

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

They are doctype declarations. They should be the line on every XHTML page right before the <html> tag. The purpose of the doctype declaration is to declare that the document adheres to a specified DTD.

Your XHTML documents must reference one of the three XHTML DTDs: Strict, Transitional, or Frameset. The XHTML DTDs are currently approximations of the HTML 4.0 DTDs. Since XHTML is still a W3C working draft, it may be modified before XHTML becomes a W3C recommendation.

The strict doctype declaration--

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/strict.dtd">

--is used when you're doing all of your formatting in Cascading Style Sheets (CSS). In other words, you aren't using <font> and <table> tags to control how the browser displays your documents.

The transitional doctype declaration--

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/transitional.dtd">

--is used when you need to use presentational markup in your document. Most of us will be using the transitional DTD for quite some time, because we don't want to limit our audience to users with browsers that support CSS.

The frameset declaration--

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/frameset.dtd">

--is used when your documents have frames.

Since the DTD defines what's legal and what isn't, you can validate your document against the definition. There are many programs to validate documents, one of the (hopefully) more reliable is the W3C's own validator. It has a number of options, but one of the simplest ways to use it is to put a link to http://validator.w3.org/check/referer on your web page. Clicking that link will validate your page. Somewhat ironically, because of the way this page is published, it won't validate.

The root element of the document must be <html> and must designate the XHTML 1.0 namespace

This means you can't have anything before the opening <html> tag (except the doctype declaration described above and an optional XML processing instruction) and you can't have anything after the closing </html> tag.

But the difference here is that you need to include a new namespace attribute xmlns in the opening HTML tag. The namespace attribute defines which namespace the document uses--

<html xmlns="http://www/w3/org/TR/xhtml1">

--(that's the letters XHTML and the number 1).

What's a namespace you ask? The W3C replies: "An XML namespace is a collection of names, identified by a URI [uniform resource identifier] reference, which are used in XML documents as element types and attribute names." In other words, the XHTML namespace is the list of tags used by XHTML, and while these are also identified in the DTD, the namespace is used to ensure that names used by one DTD don't conflict with user-defined names or those defined in other DTDs.

Namespaces are new to XML, and the idea behind them is that different types of documents will have different, and often multiple, namespaces. For example, a <title> element within the XHTML namespace can only refer to the document title. But another namespace, let's call it "book," might use <title> to refer to the title of the book.

In XML there's a method for combining multiple namespaces in a single document which allows you to have two <title> tags, for example, and not run into problems. The W3C is still working on bringing this functionality into compliance with strict XHTML.

If that isn't confusing enough, the W3C is working on replacing DTDs in XML with something called XML Schemas. This process is just in the working draft stage, so DTDs will be with us for a while, but it's worth noting that it's on the horizon.

Processing Instructions

In the last section we casually mentioned the XML processing instruction (PI), and its worth giving it a more formal introduction. The PI is optionally the first item in any XML document. It looks like this: <?xml version="1.0" encoding="UTF-8"?> In this example, it does two things. It tells you (and any programs parsing the document) what version of XML the document is based on, and it declares the character encoding that the document is using. The PI is rendered in some HTML browsers, so you may want to leave it off if you can, and you can if the document only uses the default character encodings UTF-8 or UTF-16.

Empty elements must be terminated

An empty element, as you might guess, doesn't contain anything. So while a <p></p> tag contains a paragraph, and a <b></b> tag contains text to be bolded, a <br> tag is empty. In other words, it has no real beginning and end. Other tags like this are <hr> and <img src="image.gif">.

In XHTML, these tags need to be terminated. To do this, you might think that you just add a closing </br> to the opening <br>. While this is valid in XML, it doesn't render properly in all browsers. Instead, XHTML recommends the use of a modified empty element: <br />. This is also sometimes called a self-terminating element or a terminated empty tag. Note the space after the element text. This helps to make the XHTML cross-browser compatible.

<br> <br />
<hr> <hr />
<img src="image.gif"> <img src="image.gif" />

This also applies to most form elements.

Attribute value pairs cannot be minimized

An attribute is said to be minimized when there is only one value for it. For example in the form element <option>:
<option value="somevalue" selected>
, the attribute "selected" has been minimized. Its mere existence in the element indicates to the HTML browser that the option should be displayed as selected. In XHTML, this isn't allowed. Instead, XHTML wants you to write minimized attributes as if they had values:

<option value="somevalue" selected="selected">

The same rule applies to <input type="radio">, <input type="checkbox">, and <dl> elements among others:

<input type="radio" ... checked> <input type="radio" ...
checked="checked" />
<input type="checkbox" ... checked> <input type="checkbox" ... checked="checked" />
<dl compact> <dl compact="compact">

<script> and <style> elements must be marked as CDATA sections

If you are putting <script></script> or <style></style> code anywhere in your document, you need to wrap the content of those elements in a CDATA section. If you don't, characters like < in your code will be treated as the beginning of an element. The main purpose for CDATA sections is to ignore characters that would otherwise be regarded as markup. The only delimiter that is recognized in a CDATA is the "]]>" string which ends the CDATA section.

This does not apply if you are using <script language="JavaScript" src="/sourcecode.js"></script> to pull your script code off the server, or if you're using <link href="/stylesheet.css" /> to load an external CSS file. This only applies when you're putting code inside these elements:


<script language="JavaScript">
document.write("<h2>Table of Factorials</h2>");
for(i = 1, fact = 1; i < 10; i++, fact *= i) {
      document.write(i + "! = " + fact);
// Code courtesy of JavaScript the Definitive Guide

<script language="JavaScript">
<!-- <![CDATA[
document.write("<h2>Table of Factorials</h2>");
for(i = 1, fact = 1; i < 10; i++, fact *= i) {
      document.write(i + "! = " + fact);
      document.write("<br />");
// Code courtesy of JavaScript the Definitive Guide
// ]]> -->

Note how we had to comment out the CDATA wrappers to avoid JavaScript errors (thanks to James Eberhardt for pointing this out). This is well-formed and validates as strict XHTML, but one caveat, the XML spec allows parsers to strip comment tags and their contents completely, so if you are relying on a script to run after parsing, you may want to double-check the behavior of your parser. As you may have determined by now, the easy alternative to using the CDATA wrapper is to use external script and style sheet documents.

Using Ampersands in Attribute Values

One minor but important item to close with. When an attribute value contains an ampersand, it must be expressed as the character entity reference &amp;. For example, in a URL that refers to a CGI script that takes parameters:

<a href="http://oreillynet.com/
<a href="http://oreillynet.com/

Converting HTML to XHTML

These aren't all the differences, but I think most of the more commonly used ones. More details about conversion can be found in the XHTML 1.0 Specification. Of particular interest is the HTML Compatibility Guidelines section of that document, which points out some of the fine details of converting from HTML to XHTML.

To help you convert documents from HTML to XHTML, the W3C suggests using HTML Tidy, by Dave Raggett. HTML Tidy is a free utility originally designed to clean-up HTML markup errors and reformat HTML code for legibility. It has been adapted by the author to do a wide range of things to any HTML file, including convert from HTML to XHTML.

I've written a simple Web front-end to HTML Tidy, specifically for processing individual HTML documents that are accessible via the Web. If you need to do batch conversion you'll need to retrieve and install your own copy.

Last thought

You can also see this article as a standalone, XHTML-compliant document, which should validate.

Editor's Note: This is a revised version of an article that Wiggin originally published in Web Review on July 16, 1999.

Peter Wiggin is the manager of web development for the O'Reilly Network.

Discuss this article in the O'Reilly Network General Forum.

Return to the O'Reilly Network Hub.