XHTML: The Clean Code Solution
Pages: 1, 2, 3

XHTML Part 3: What's New

In order for certain HTML elements to be considered valid in XML 1.0 and XHTML 1.0, they need to be written differently. These are the new requirements.

All documents must have a doctype declaration

Ever wonder what these odd lines of code at the top of many HTML documents are?

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

They are doctype declarations. They should be the line on every XHTML page right before the <html> tag. The purpose of the doctype declaration is to declare that the document adheres to a specified DTD.

Your XHTML documents must reference one of the three XHTML DTDs: Strict, Transitional, or Frameset. The XHTML DTDs are currently approximations of the HTML 4.0 DTDs. Since XHTML is still a W3C working draft, it may be modified before XHTML becomes a W3C recommendation.

The strict doctype declaration--

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">

--is used when you're doing all of your formatting in Cascading Style Sheets (CSS). In other words, you aren't using <font> and <table> tags to control how the browser displays your documents.

The transitional doctype declaration--

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "">

--is used when you need to use presentational markup in your document. Most of us will be using the transitional DTD for quite some time, because we don't want to limit our audience to users with browsers that support CSS.

The frameset declaration--

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "">

--is used when your documents have frames.

Since the DTD defines what's legal and what isn't, you can validate your document against the definition. There are many programs to validate documents, one of the (hopefully) more reliable is the W3C's own validator. It has a number of options, but one of the simplest ways to use it is to put a link to on your web page. Clicking that link will validate your page. Somewhat ironically, because of the way this page is published, it won't validate.

The root element of the document must be <html> and must designate the XHTML 1.0 namespace

This means you can't have anything before the opening <html> tag (except the doctype declaration described above and an optional XML processing instruction) and you can't have anything after the closing </html> tag.

But the difference here is that you need to include a new namespace attribute xmlns in the opening HTML tag. The namespace attribute defines which namespace the document uses--

<html xmlns="http://www/w3/org/TR/xhtml1">

--(that's the letters XHTML and the number 1).

What's a namespace you ask? The W3C replies: "An XML namespace is a collection of names, identified by a URI [uniform resource identifier] reference, which are used in XML documents as element types and attribute names." In other words, the XHTML namespace is the list of tags used by XHTML, and while these are also identified in the DTD, the namespace is used to ensure that names used by one DTD don't conflict with user-defined names or those defined in other DTDs.

Namespaces are new to XML, and the idea behind them is that different types of documents will have different, and often multiple, namespaces. For example, a <title> element within the XHTML namespace can only refer to the document title. But another namespace, let's call it "book," might use <title> to refer to the title of the book.

In XML there's a method for combining multiple namespaces in a single document which allows you to have two <title> tags, for example, and not run into problems. The W3C is still working on bringing this functionality into compliance with strict XHTML.

If that isn't confusing enough, the W3C is working on replacing DTDs in XML with something called XML Schemas. This process is just in the working draft stage, so DTDs will be with us for a while, but it's worth noting that it's on the horizon.

Processing Instructions

In the last section we casually mentioned the XML processing instruction (PI), and its worth giving it a more formal introduction. The PI is optionally the first item in any XML document. It looks like this: <?xml version="1.0" encoding="UTF-8"?> In this example, it does two things. It tells you (and any programs parsing the document) what version of XML the document is based on, and it declares the character encoding that the document is using. The PI is rendered in some HTML browsers, so you may want to leave it off if you can, and you can if the document only uses the default character encodings UTF-8 or UTF-16.

Empty elements must be terminated

An empty element, as you might guess, doesn't contain anything. So while a <p></p> tag contains a paragraph, and a <b></b> tag contains text to be bolded, a <br> tag is empty. In other words, it has no real beginning and end. Other tags like this are <hr> and <img src="image.gif">.

In XHTML, these tags need to be terminated. To do this, you might think that you just add a closing </br> to the opening <br>. While this is valid in XML, it doesn't render properly in all browsers. Instead, XHTML recommends the use of a modified empty element: <br />. This is also sometimes called a self-terminating element or a terminated empty tag. Note the space after the element text. This helps to make the XHTML cross-browser compatible.

<br> <br />
<hr> <hr />
<img src="image.gif"> <img src="image.gif" />

This also applies to most form elements.

Attribute value pairs cannot be minimized

An attribute is said to be minimized when there is only one value for it. For example in the form element <option>:
<option value="somevalue" selected>
, the attribute "selected" has been minimized. Its mere existence in the element indicates to the HTML browser that the option should be displayed as selected. In XHTML, this isn't allowed. Instead, XHTML wants you to write minimized attributes as if they had values:

<option value="somevalue" selected="selected">

The same rule applies to <input type="radio">, <input type="checkbox">, and <dl> elements among others:

<input type="radio" ... checked> <input type="radio" ...
checked="checked" />
<input type="checkbox" ... checked> <input type="checkbox" ... checked="checked" />
<dl compact> <dl compact="compact">

<script> and <style> elements must be marked as CDATA sections

If you are putting <script></script> or <style></style> code anywhere in your document, you need to wrap the content of those elements in a CDATA section. If you don't, characters like < in your code will be treated as the beginning of an element. The main purpose for CDATA sections is to ignore characters that would otherwise be regarded as markup. The only delimiter that is recognized in a CDATA is the "]]>" string which ends the CDATA section.

This does not apply if you are using <script language="JavaScript" src="/sourcecode.js"></script> to pull your script code off the server, or if you're using <link href="/stylesheet.css" /> to load an external CSS file. This only applies when you're putting code inside these elements:


<script language="JavaScript">
document.write("<h2>Table of Factorials</h2>");
for(i = 1, fact = 1; i < 10; i++, fact *= i) {
      document.write(i + "! = " + fact);
// Code courtesy of JavaScript the Definitive Guide

<script language="JavaScript">
<!-- <![CDATA[
document.write("<h2>Table of Factorials</h2>");
for(i = 1, fact = 1; i < 10; i++, fact *= i) {
      document.write(i + "! = " + fact);
      document.write("<br />");
// Code courtesy of JavaScript the Definitive Guide
// ]]> -->

Note how we had to comment out the CDATA wrappers to avoid JavaScript errors (thanks to James Eberhardt for pointing this out). This is well-formed and validates as strict XHTML, but one caveat, the XML spec allows parsers to strip comment tags and their contents completely, so if you are relying on a script to run after parsing, you may want to double-check the behavior of your parser. As you may have determined by now, the easy alternative to using the CDATA wrapper is to use external script and style sheet documents.

Using Ampersands in Attribute Values

One minor but important item to close with. When an attribute value contains an ampersand, it must be expressed as the character entity reference &amp;. For example, in a URL that refers to a CGI script that takes parameters:

<a href="
<a href="

Converting HTML to XHTML

These aren't all the differences, but I think most of the more commonly used ones. More details about conversion can be found in the XHTML 1.0 Specification. Of particular interest is the HTML Compatibility Guidelines section of that document, which points out some of the fine details of converting from HTML to XHTML.

To help you convert documents from HTML to XHTML, the W3C suggests using HTML Tidy, by Dave Raggett. HTML Tidy is a free utility originally designed to clean-up HTML markup errors and reformat HTML code for legibility. It has been adapted by the author to do a wide range of things to any HTML file, including convert from HTML to XHTML.

I've written a simple Web front-end to HTML Tidy, specifically for processing individual HTML documents that are accessible via the Web. If you need to do batch conversion you'll need to retrieve and install your own copy.

Last thought

You can also see this article as a standalone, XHTML-compliant document, which should validate.

Editor's Note: This is a revised version of an article that Wiggin originally published in Web Review on July 16, 1999.

Peter Wiggin is an independent software developer specializing in Web technologies.

Discuss this article in the O'Reilly Network General Forum.

Return to the O'Reilly Network Hub.