What's the attraction to URLs in content?

by Uche Ogbuji

Rarely do I review XML design without seeing something like:


<spam>
<link>http://example.com</link>
</spam>


Putting URLs in element content seems to come naturally to people, regardless of the age-old convention from HTML:


<p>
<a href="http://example.com"/>
</p>


I've always disliked this, as I prefer to have URLs and IDs in attributes. I used to think URLs in content was a manifestation of database-refugee XML, but I see it a lot even in carefully-crafted formats.

19 Comments

Lance Lavandowska
2007-09-03 18:12:59
I tend to think of attributes as values that describe or enhance the content of the element. Thus the href attribute is the reference which provides further information for the content of the anchor element. And attributes shouldn't be used for values which may need to be double-decoded or any other such sillyness - ascii text only.
Josh
2007-09-03 18:18:14
I think id should have been an attribute since it doesn't add much of any semantic value (plus you can only have one of them per feed|entry).


The other side of attributes is the crappy way they handle namespaces. Unless you want to say <:atom:feed atom:id=""> you can't be for sure you're using an Atom id. "id" itself has issues with xml:id as an attribute, so you don't really want to use that for naming your id so an id element just makes a lot of sense in that respect. It feels kludgey, but it likely is the best choice at the end of the day.


Also, it's important to point out that id is not a URL, it's a URI. There are slightly different semantics from URLs ...

Uche
2007-09-03 18:42:41
@Lance. Just remember that XML is very internationalized, and I think the idea of only having ASCII text in attributes is a huge step backwards. This, for example, would not permit IRIs. As for escaping, one just has to be aware of the limitations (and normalization) of CDATA attribute type and all is well. See the article I referenced int he Weblog. Domain complexity of the data is more important than its lexical form when decided between attributes and element content.
Uche
2007-09-03 18:46:02
@Josh. Yes, my point is precisely that Atom id should have been an attribute of feed or entry. There is *no* namespace confusion: an unprefixed attribute is not in a namespace, and by convention takes on the semantics of its owner element. It is only when you qualify an attribute (the spec calls these "global attributes" that issues of where the semantics come arise. So atom:feed/@id is perfectly clear and atom:feed/@atom:id would be IMHO baroque design.


And there is also *no* confusion with xml:id. xml:id is a global attribute with separate well-defined semantics. My hypothetical atom:feed/@id would be entirely locally defined in the Atom spec, and there would be no confusion whatsoever by non-broken processors. atom:feed/@xml:id does not work because an xml:id, according to the spec, must be an NCName, and IRIs are not valid NCNames. Note: I did always think this was perhaps a harsh limitation for xml:id (I would have gone with QName), but it does underscore the limited purpose of xml:id as a *document-scope* ID mechanism, and not a universal one.


Finally, if you want to be nit-picky, an Atom ID is an IRI, which is a superset of URI, which is superset of URL. Every URL is a URI and an IRI. And yes, my recommendation holds that all forms of IRIs should go in attributes, unless there is good reason to have them otherwise.

Thomas Broyer
2007-09-03 23:32:11
@Uche. Josh is right in the sense that the AtomPub WG chose to use an atom:id element to avoid confusions with xml:id attributes and the like (e.g. HTML's @id): atom:id is not an ID (or xs:ID if you prefer) to reference an element inside an Atom document.


As an aside, @title in HTML is not "@hint" at all: when you extract a list of links or images from an HTML document, @title is very useful to have an out-of-context "description" for the link or image (which a link's text content wouldn't have in most cases). In other cases, such as when @title is used on form elements, then yes it could have been named @hint (and there are proposals for a @hint attribute on such elements in the HTML-WG/WHATWG)

Asbjørn Ulsberg
2007-09-04 00:41:03
I think you'll find the core of the attribute versus element discussion here. You'll also find some interesting opinions here. I think the final consensus was that since we all agreed that atom:id should be an URI, then the ID couldn't be used as an XML ID nor DTD ID, so to avoid confliction, we chose an element.


It was also important that people didn't expect the URI they saw in examples to be dereferencable (it isn't an URL), so having the ID in an element makes that point clearer too, in my opinion.

Lance Lavandowska
2007-09-04 04:06:53
Uche, you said "having ASCII text in attributes is a huge step backwards. This, for example, would not permit IRIs" which is exactly the point of why they shouldn't be in attributes.


Ideally there shouldn't be such a problem, practically there is; I've seen any number of xml documents broken because the people writing them (or writing the software that writes them) didn't properly treat their attribute data. 'ascii' was really the wrong word, the international version of 'plain text' was what I was reaching for?

Lance Lavandowska
2007-09-04 04:09:02
"my recommendation holds that all forms of IRIs should go in attributes" - I haven't seen an explanation of your recommendation other than "I prefer it this way".
Uche
2007-09-04 06:09:58
@thomas. OK, your point is more clear than Josh put it. As I gather now it's not that there was a mistaken idea that there is a namespace problem or any strictly technical problem with atom:feed/@id, but that the WG wanted to avoid *conceptual* confusion that people might have because it would not have been (by declaration nor convention) a DTD ID type. I can understand this, and maybe it's a special case to preference for IRIs and IDs in attributes (honestly, I rather assumed this was the reason, which is why I made no fuss when I discovered atom:feed/atom:id), but IMO it's too great a loss to say that one should never use an ID called "id" for anything that is not semantically similar enough to a DTD ID. I prefer that tools and people just *learn* that atom:feed/@id is not a DTD ID, which is not that hard. People have to learn all sorts of things about Atom semantics to use it correctly, and this would give them no more pause than anything else.


But as I say in my linked article all element vs. attribute conventions are just guidelines on which people should choose to differ, as long as they are really thinking of the design considerations.


As for html:a/@title, I say it's a hint because it's a hint for the *user*, and this is how it's used in almost every case (e.g. a browser tooltip). Even the use case you made up: of having a useful description if links are isolated from document context, serves as a hint. To me a title is a rich content construct, and should be expected, e.g. to have markup of its own, which is why I think Atom got it right in how they modeled titles.

Uche
2007-09-04 06:19:36
@Asbjørn. Thanks. These links prove that the matter was very well thought out, and even if I disagree with the conclusion, careful consideration is all I can ask, of course.


As for the "don't make users think it's dereferenceable" point, I disagree. First of all, I've already seen, and had to correct such confusion even in Atom's present form. Also, as I said to Thomas, it wold be just a point of Atom semantics people need to learn, like all other matters.

Uche
2007-09-04 06:23:18
@Lance, ummm. IRIs *are* plain text (they're just not ASCII, thank goodness), so now you've really lost me. As for my reasoning, I didn't intend to reproduce in this weblog posting my full article. Please see the article, linked from the bottom of my weblog post. The discussion you want is in the section "Principle of readability", but please don't read that section isolated from the rest of the article.
Asbjørn Ulsberg
2007-09-04 06:31:57
Uche, I think the confusion over dereferencable ID's is mitigated a bit by stuffing it in an element instead of an attribute. URI-like strings in attributes are dereferencable almost "by default"; at least most developers seems to have this knee-jerk reaction.


I'm not claiming full "victory" here; URI's (including IRI's) will always be thought to be URL's by people who don't know the difference, but by not adding to the confusion by having it in an attribute, I'd say we've at least made a couple of developers think twice about it.

Josh Peters
2007-09-04 10:53:13
@Asbjørn: "URI-like strings in attributes are dereferencable almost 'by default'; at least most developers seems to have this knee-jerk reaction."


Which is perfectly natural given the way most of us discovered URIs: HTML. Anytime I see "http://" I assume what follows is locatable. If I really want to say that something isn't available in a web browser then I'll use a "tag:"

Peter
2007-09-04 12:42:22
Attributes are kind of dumb anyway. What's the difference between the value of an element and the value of an attribute? The designers of XML should have left out attributes in the first place... who needs a second way to delimit a value, after all? It can't be about markup minimization - that is left off the table by those designers as their last goal, also a mistake. Lack of markup minimization is one of the big reasons people reject using XML and invent all kinds of other junk like microformats.
Taylor
2007-09-04 17:23:12
One reason the id thing comes up might be the special use of id= in xml and html. I even find myself doing just what you've shown, , because I feel like id= is for somebody else...the xml compiler or something, it's not available for my use. With atom we are indicating the universaly unique ID of a feed element, not the id of an XML node...


elementid= ??? no, that stinks, so the reasoning goes, so we use ...and there you have it.


Here's another strange practice to comment on, the XML param/value pattern in XML...at Google: http://www.google.com/help/blogsearch/pinging_API.html


My personal pet peeve being the ... pattern. Somehow the concept of different URLs for different pages is well understood, but for methods we end up with a "uni" end point that handles everything.

Uche
2007-09-04 21:15:38
@Josh, quite right, tag URIs are almost always a better option for universal IDs.


@Peter, Ooh, the age-old "attributes shoulda been canned" flame-war. I've been in those wars enough and I'll leave that alone. now For me, I'm very happy XML retained attributes. And nice 2-for-1 flame war mentioning the lack of minimization in XML. Really, you had 3-for-1 flame war with your trashing of microformats, but I actually agree with you on that one ;-)


@Taylor, I disagree that because many people use "id" as an ID type attribute, we can't use "id" for something else. Whatever their other many faults, namespaces mean never having to hold an NCName sacred :-) . Agreed about the other peeves you mentioned. Especially the old <property><name>age</name><value>21</value></property>. I call it "overweight CSV" and it's positively ghastly.

Asbjørn Ulsberg
2007-09-05 01:54:45
I completely agree that "tag" URIs are much better for identifying things than an HTTP URI. However, since atom:id is a URI and not limited to "tag" URIs, you will find non-dereferncable HTTP URIs as Atom ID's. I actually fought a bit over this, wanting to shrink the number of allowable URI schemes, but that battle was lost.


Regarding attributes in XML, you should read a bit about Erik Naggum's Enamel, which is extremely elegant and concise, but unfortunately not developed as an international standard with the support i.e. XML has.

Norman Walsh
2007-10-03 13:24:21
I'm coming late to the party, but I'll just say for the record that I think content was the right place for the atom:id. The ID is part of, an intrinsic part of, the content of the atom entry. If anything, the fact that it's a URI is incidental. Of course, making it a URI is also the right thing.
Uche
2007-10-04 06:22:12
@Norm, is atom:id really intrinsic content? I have accepted the other defenses, but I have a lot of trouble with this one. Just as a simple, empirical test, if it were truly intrinsic content I'd expect RSS readers and such to show the entry and feed IDs prominently, but it doesn't. The *links* for entry and feed are more prominent in most user software, and these are in attributes. To me intrinsic content is something the end user cares about.