Re-starting the HTML Engine

by Kurt Cagle

A couple of weeks ago, the W3C made an announcement that caught a great number of people by surprise. After nearly a decade of inactivity, the HTML working group was being restarted, in order to handle the fairly significant amount of development that has occurred on top of the HTML standard since HTML 4
.3 became the last formal HTML standard prior to the introduction of XHTML.

I have to admit to some qualms about seeing this. I'm not doubting that it isn't needed - XHTML's adoption has been comparatively slow because of the legacy base of HTML out there, the introduction of AJAX has shifted the balance of power to imperative scripting, and the realization is increasingly being made that the namespace issues dividing HTML and XHTML are beginning to tear the standards apart.

The question, however, comes back to the role that XML ends up playing in all of this. HTML has its own DOM, can in fact be treated as quasi-XML-like, but it most demonstrably isn't well formed XML in most cases. For those of us who have been pushing XHTML adoption in industry, this is going to be seen as a fairly major step backwards, as it has the potential to make browser developers decide that perhaps incorporating XHTML support isn't that big of an issue, and can be pushed off for a release or ten.


8 Comments

Jim
2007-03-25 13:49:54
> HTML 4.3 became the last formal HTML standard


There's no such thing as HTML 4.3. The last HTML specification was HTML 4.01.


> singleton attributes with no corresponding value


"Singleton" (minimised) attributes have a value. They *only* have a value - the difference between them and normal attributes is that they don't have a *name*.


> the lower case preferred notation of XHTML


Lowercase isn't "preferred", it's mandatory.


> It still becomes the responsibility of browser creators to update their DTDs and conformance engines


What on earth are you talking about? Browsers don't have anything remotely resembling "conformance engines".


> Namespaces should not necessarily be unique to XML - there's nothing in fact in the original 1.0 specification that limits them from being applied to an HTML5


The original specification is called "Namespaces in XML", continually refers to the documents being XML documents, and relies on the XML 1.0 specification for various definitions.


You could argue that the *design* doesn't preclude an HTML 5 application, but the *specification* does.

Kurt Cagle
2007-03-25 14:06:02
>> HTML 4.3 became the last formal HTML standard


> There's no such thing as HTML 4.3. The last HTML specification was HTML 4.01.


Yup - I've been in XHTML too long. Thanks for the correction.


> the lower case preferred notation of XHTML


The lower-case usage in XHTML is mandatory, in XML it is generally preferred over upper case. When I wrote this originally I had XML rather than XHTML there, then corrected the term for clarification but not the adjective.


>What on earth are you talking about? Browsers don't have anything remotely resembling "conformance engines".


Actually most browsers do have the ability to determine whether an HTML document is valid or not, most just don't expose that functionality to the user. I know directly that both Mozilla and IE7 do, and I believe that Opera probably does. I don't know about Safari or Konqueror.


> Namespaces should not necessarily be unique to XML - there's nothing in fact in the original 1.0 specification that limits them from being applied to an HTML5


I don't think there's any exclusionary language specifically targeting the HTML specification here. Yes, namespaces are XML artifacts, perhaps one of the key XML artifacts. I'll accept your assessment on the design vs. specification, but realistically what I see is that in order for there to be some form of reconciliation between HTML and XHTML, redefining namespaces so that they recognize non XML-structures becomes necessary. There's no real way around this that I can see.


Jim
2007-03-25 15:30:29
> Actually most browsers do have the ability to determine whether an HTML document is valid or not, most just don't expose that functionality to the user. I know directly that both Mozilla and IE7 do


It sounds like you are confusing validity with well-formedness. I know that Mozilla uses expat, which is a non-validating parser.

inquirydog
2007-03-26 17:47:38
I love xml and cringe at how messy real world html pages are (I have a lot of experience with this topic), but I feel that xml namespaces will hinder the public from really ever seriously using xhtml. On the couple of occasions I have written, for instance, an xforms document, I had to waste time figuring out what namespaces to declare. Furthermore, even once I figure out what to use I often have to fight with tools that expect slightly different versions.



While I would welcome umbrella namespaces, even one namespace is probably too much for the average population which can barely just type out "Fred's Homepage".


I suspect that most xml users would cringe to hear me suggest this, but I would really like to see default namespaces associated with (please don't cringe!) filename extensions. Yes, I know that this is completely inconsistent with hte way xml works (ie- xml data need not even have a 1-1 relationship with a file), but it is really consistent with the way xml is used. And furthermore it is consistent with the way most files are used (ie- if the filename ends with .jpg, I know what type of data is in it).

Kurt Cagle
2007-03-26 19:02:55
Some years ago, I was at an XML conference, killing some time after having finished giving a presentation on XSLT and wandered in to one session just in time to hear the brilliant Ken Holman ask to a fairly picked audience "Now who hear doesn't understand namespaces?" Just for grins, I raised my hand high and he laughed when I replied "... Been trying to understand them for years, and they just don't get any easier!"


Namespaces are HARD. They represent partitions of a document object model, their verbosity makes them hard to remember, the rules for governing mixed namespace manipulation is confusing and cumbersome and they are an endless source of confusion and error.


For all that, namespaces have a very definite, very important role, and any solution that attempts to eliminate namespaces inevitably ends up becoming unmanageable unless the domain involved is VERY small and the rules are very clear. It's one of the biggest uglinesses about XML, but from what I've seen, any other alternative in the end tends to differ only in the syntax used to represent that partitioning.


On the other hand, I think that there is a second factor coming into play here. Fewer and fewer sites are actually accepting "live" XML anymore - they use BBS notation, or rich text editors, or WYSIWYG applications in order to build their web pages. The ones that are building these applications, however, should be familiar with namespaces, because they represent to XML what class libraries or packages represent to the typical Java or C++ developer - a means of organizing coherent and connected information under a single interface, a means of differentiating between namespace collisions when necessary and a way of packaging this content for distribution.


That's why I'm always dubious about what I call the Aunt Millie's argument "HTML must be so simple that your Aunt Millie could write it". Nope ... unless Aunt Millie is a web designer and developer, she should have no reason to write naked HTML. She should use a WYSIWYG package that will let her do what she wants in a nice pretty fashion, and that TOOL should be completely conversant with namespaces. If its not, that represents poor design and laziness on the part of tool designers.

inquirydog
2007-03-27 17:13:18
Well, I can see we are far apart on this one, but I love a good debate, so I might as well post again :)


I was not suggesting eliminating namespaces (at least in my previous post, although I perhaps could see myself doing so), but rather associating default namespaces with file extensions so that the user would not have to type out the declaration in certain cases. This get around the Aunt Millie thing without losing namespaces. And for all the computer generated stuff, full namespaces suffice. For xslt, you still need to declare the resultant namespace, but the xslt gang are big boys and girls, they can handle it.


The thing is, however, I secretly really do want to banish the over engineered namespace. I mean, come on, how many different xml dialogs of global importance have we seen in the last few years? Saying 'hundreds' would be generous. You might argue that in the future we would see millions more as more users start using xml, and we will only contrain ourselves without namespaces.... Yet in the real world the simple filespace has been more than adequite in differentiating every file type that has ever been made, I've never seen and confusion between jpg and doc, nor have I ever seen two individuals battle it out because they both wanted the ipx extension (now I suppose someone will find an example of this to prove me wrong). Of course there are probably millions of minor xml dialogs that are used in different projects, much like there are probably millions of proprietary small binary formats with individual file extensions, which also haven't derailed the computer industry.


The other thing I disagree with is that xml can be complicated because eventaully we all would like it to be generated by software. First of all, this never happens.... XSLT was supposed to be generated by a more human readable script, but these days pretty much everyone writes XSLT directly. A few years back, the "XML editor" was the hot thing, but these days one of the most popular xml editors is emacs (with nxml). I even fell for this one back a couple of years ago when I set up SOAP, thought that the complicated internals really didn't matter as long as the tools took care of everything. Guess what- it turned out to be a leaky abstraction, and I ended up spending hours debugging by staring at complicated ethereal data, or trying to figure out that the problem I had now was due to some strange wsdl issue send a while ago. We finally got rid of the whole thing for a managable REST solution.


OK, I oculd go on, but that is all for now.



Kurt Cagle
2007-03-27 18:41:28
inquirydog,


I don't think we're that far apart, though let me respond to your comments.


The problem with filename extensions (which are of course themselves a form of namespace prefix - or suffix in this case) is that they favor the early adopters. For instance, suppose that I'm working on a document. In this case "doc" is a perfectly reasonably extension, right? However, Microsoft grabbed that one early, and if I put a doc extension on the end of my file, then unless I get deep into the bowels of the OS, chances are really good that Windows will read your text file as a(n extremely) corrupted Microsoft Word file.


The second problem with the file prefix is that it in fact assumes that you're dealing with a file - but what if I have a stream of content generated from a process coming off the web. Chances are the "extension" in this case will be .asp or .php or .jsp or ... you get the idea. These COULD be generating html content (and in probably 90% of the cases would be), but they could also be generating everything else.


So this is where mime-types step in, right? Well, sort of. The problem with mime-types and content-types is that someone still needs to standardize on them, in most cases the IETF. It can take months or years (or decades) for such a mime-type to be registered, and until then you just end up hoping that someone else doesn't choose the mime-type extension and make your software break.


The problem I've found in general is that there are CONSTANT struggles on filename prefixes, just like there are constant struggles over good domain names. JPG has no obvious competition because the letters do not have any immediate semantic associations. DOC, on the other hand, has a huge semantic association, and there are any number of other word processing vendors that would have LOVED to use DOC rather than ODF, ODT, SXW and so forth.


Now, while the web is "slowly" waking up to XML, XML has been used internally by companies for expressing objects for quite some time, and it makes sense to apply your own taxonomy to that namespace - after all, it's meant to solve a local problem. However, there are currently hundreds (if not thousands) of invoice specifications out there, most of which are subtly or in many cases dramatically different.


XML presents a real problem there as well, without the concept of namespace. If I create a suffix to my file that is unique to me, how will programs know what to do with a file of that suffix? If the name is uncommon, you can manually make a change to the operating system to override the "default" handler for that namespace, but this also means that there's no clean way to identify that the document, in addition to being an invoice, is also an XML instance.


Issues like this crop up all the time when dealing with differing interactive domains (which is what XML is ultimately all about). That's a big part of the reason why namespaces of some sort are necessary, and why I really don't see them disappearing from XML anytime soon.


On the "XML is machine generated" assertion, I'll still stand by what I say for the relevant population. A programmer is not Aunt Mildred (unless Aunt Mildred also programs Unix boxes in her spare time, of course). Chances are that if you're a sufficiently capable programmer to write your SOAP and WSDL documents by hand, then you know what you need to know about namespaces. Most Aunt Mildreds can't write HTML at all, save perhaps the very occasional <i> or <b> tags. They aren't trained to think in markup terms, they don't understand the rules, and while they may be domain experts, they aren't domain experts in XML or HTML ... and THEY will almost certainly use a WYSIWYG HTML editor to create content, most often not even really realizing what they're doing.


Personally, I tend to think that this argument is typically used by lazy programmers (not putting you in this category, mind you - you've obviously thought through these issues) who find working with XML in any form beyond tags in strings to be too confusing and demanding. They seem to not understand why:


doc = "<html><body><h1>"+myText;
doc += "</h1><p>" + "Here's some <b>text</b></p>";
doc += "<div>"+myFunc(text)+"</div>";
doc += "</body></html>";


is such an incredibly bad way of dealing with either HTML or XML markup.


Now, does that mean I'm a purist with regard to namespaces? No - they CAN be a pain in the butt to deal with, especially when working with compound documents, they make things like XPath a real nightmare, and they do contribute to the slower adoption rate of XML compared to other languages. I'm not saying they are panaceas. I just think that most other solutions are worse.


The reality is that XML is hard - declarative programming in general is harder than imperative, because it involves the ability to work with abstracts at a level that most programmers normally don't venture. A good IDE can help (and can turn namespaces from limitations into assets) but even some of the best are only just reaching a level of sophistication that most programmers have expected of their imperative IDEs for years.

francis
2008-03-13 11:34:59
12% html engine start after that stop