XML parsing, state machines and UTF-32

by Michael Day

Hello, my name is Michael Day and I'm here to blog about XML, CSS, web standards, declarative programming, UNICODE and other topics of interest to XML.com readers. Since a lengthy biography of me is not one of these topics, I shall limit myself to one sentence: I am the founder of YesLogic and the designer of Prince, an XML + CSS formatter and a great way of getting web content onto paper.

Now that we've got that out of the way I would like to get straight into talking about XML parsing and UNICODE encodings. In Prince we use libxml2 for all of our XML and HTML parsing needs, and have been very happy with it. However, it's always interesting to see new approaches for XML parsing that may offer greater speed or convenience than existing methods.

5 Comments

Devon Young
2007-03-07 21:41:23
I've never even heard of UTF-32 anywhere. Although, a quick Google returns about 371,000 results. Yikes! The question I have in mind suddenly is, WHY would anyone need UTF-32? I'll have to go see what it's good for. Personally, I probably won't ever use it.
Kurt Cagle
2007-03-07 22:15:41
Michael,


Welcome to XML.com and I hope to follow your columns with great pleasure.

bryan rasmussen
2007-03-09 02:37:28
I don't get the problem really, from the spec:


" All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode 3.1"


It is true that most support more than that, but basically most serious projects have rules about exchanging XML in UTF-8 so as to increase interoperability (my experience at any rate. )


So my suggestion is: support UTF-8 and UTF-16


of course, also from the spec: "Although an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them. " my suggestion is, outside of publishing - which I suppose you have a lot to do with - entities don't seem to be used that much anymore. Unfortunate, but people seem to want the ability to refer to the external resource via markup.


David Roussel
2007-03-19 07:01:09
So this started off interesting, parsing XML with a state machine, but then you stopped on an artificial barrier.


Any more ruminations on this subject would be interesting.


Esp, using Raven to generate the state-machine. Then you could write the state-machine once and compile Java and C, and Ruby versions of it.

Michael Day
2007-03-21 17:52:27
Hi David,


I will have more to say on this topic at a later date. I first wanted to answer the question "can you generate an XML parser by applying an XSLT transform to a description of the XML grammar expressed in XML?"; it seems that the answer is "Yes, but XSLT really isn't very convenient for this sort of thing" so I've gone back to the drawing board for the time being.


By the way, what is Raven?