Disrule! special character entities without DTDs

by Rick Jelliffe

A really common problem facing people moving over from SGML to XML (and yes, there are still industries such as aerospace that are still thoroughly SGML!) and from XML DTDs to XML Schemas (including RELAX NG, Schematron, XSD) is the unwillingness to forgo entities references for special characters. ISO defines a whole lot of special characters: © and so on.

In XML, you can define entities for special characters up in the internal subset of the prolog and use them in element values of attribute values. Or you can have entity declarations in external parameter entites that form part of the DTD. So if you get rid of external DTDs, you also get unresolved entity references in your information set.

6 Comments

Alex Brown
2006-05-31 01:18:13
> ISO DSDL Part 7


Part 8, actually. Martin is also giving a presentation on DSRL at Extreme Markup 2006 which should be stimulating ...

St├ęphane Bortzmeyer
2006-06-02 01:32:16
Most ISO documents are not public. As far as normalization goes, this is certainly one of the worst organisms.


1) Will DSRL be public?
2) Why ISO and not a more open organization like OASIS, W3C or IETF?

Rick Jelliffe
2006-06-02 06:09:09
You can find the most recent drafts of ISO DSDL standards at www.dsdl.org, which is a site that SC 34 WG 1 maintains for public communication. It has a public maillist that I encourage you to follow, for DSDL-related technologies. There will be a new draft for DSRL sometime soon, when Martin Bryan gets back home from Korea and finishes polishing. He is also working on an open source implementation, by the way. I hope he will make a public beta available in the next month or so, but he is the one in charge of his timetable.


SC 34 has asked for ISO Schematron to be available free, and we are awaiting confirmation. I am here in Hong Kong, supposedly working on the ISO Schematron open source version, but instead I am writing this!


So I think ISO JTC1 SC34 WG1 actually is much better than OASIS or W3C in this regard: we have public drafts, public comments, open source software, and we are trying to make the final drafts free too.

Rick Jelliffe
2006-06-02 06:38:23
On the question, why ISO rather than OASIS or W3C?


Short answer: W3C has not come up with a solution in 10 years and are not working on one AFAIK. But people need a solution, and publishing and SGML-related systems are our scope at the ISO working group on document languages.


Long answer: Standards bodies (IEEE, ISO, IEC, W3C, IETF, OASIS, Ecma, etc) all have different areas of interest and different procedures that make them practical and useful in different circumstances. The head of Ecma was telling me over drinks the other night that they regularly put out rival standards for optical disks, because the technologies are obsolete in 9 months. So Ecma standards are for fast, moving technologies.


The ISO process is really slow; it is difficult for anything to take less than a year at fastest, and often it will take multiple years. But in some standards bodies you can stack the committees with friends; in ISO there is one vote per nation (well, registered nations) which makes it very difficult for one company to dominate even if it has a community of friends helping. This is why I think MS and Sun and IBM find ISO distasteful: they usually cannot dominate it.


A lot of anti-standards sentiment, and anti-ISO sentiment, appears to come out of big business. Sometimes they cast the standards bodies as the establishment, rather than the radicals they are. A favourite technique is to water down the term "standard" to mean almost nothing.


ISO SC34 WG1 is highly aligned with the publishing industry: Martin Bryan was once a typesetter who actually worked with Fournier's metal type IIRC, for example. By contrast, W3C has very little interest in publishing (with the notable exception of Liam and the XSL-FO people.) So it would be surprising for W3C to do anything fundamental that addresses the needs of publishers, such as the issue of how to have entities for special characters at the same time as XML Schemas. But not surprising for SC34 to do something in this area, since it relates to publishing and since no-one else wants to provide a solution. I hope there is no view that SC34 is somehow trying to do something that the W3C had any prospect of doing! On the contrary, we want to defer to industry consortia like W3C as much as possible. (For example, we stopped our pipeline work in order to let the W3C get their approach developed.)

David Carlisle
2006-06-05 04:50:15
Is this a new feature not yet in the draft stsndard?
The description of entity mapping at
http://www.dsdl.org/dsdl-8.pdf
doesn't seem to have this.
It (and the tutorial) have mappings from characters/entities to entity names in other sets, but the input still needs to be well formed, so mapping _declared_ entity references to differently declared entity references, rather than mapping undeclared entity refs. Especially as it's XSLT implementing the mapping in the reference implementation??


Anthony B. Coates
2006-07-21 00:49:59
For cases when the HTML range of entities is sufficient, you could try the 'xmlchar' approach that Zarella Rendon & I created some years ago.


http://xmlchar.sourceforge.net/


Cheers, Tony.