Once again: no excuses to ignore i18n in XML

by Uche Ogbuji

Related link: http://www.javareport.com/article.asp?id=9797



I think the most pervasive problem in XML adoption is ingorance and even wilful sabotage of the international foundation on which XML is built. In several recent incidents, both in my consulting work and in my OSS/community work I have come across systems that ignore or break XML's Unicode character model.

I've almost grown tired of saying it, but it is worth saying until I've worked through my very last nerve: the single most important aspect of XML is its character model. Ditch XML and use something else before you mess with that. A tremendous amount of damage is done by people who can't see past the pointy brackets as the point of XML.

Yes, Unicode is hard. There is nothing to be done about this. We have a myriad of languages, writing systems and local conventions, and they complicate just about everything. That's our wacky, wondrous world for you. Nevertheless, as a software professional in this age, there is no excuse not to buckle down and learn the rigors of i18n. I'm not meaning to be a pedant about this: I know a lot less abotu i18n than I wish I did, and I fall short of good i18n in much of my code. However, I respect the problem and I strive to work on my skills in the area, and my discipline in applying it in software development.

If you use XML in your work, please read "The skew.org XML Tutorial. A reintroduction to XML with an emphasis on character encoding", by Mike Brown (a truly brilliant article). You might also want to check out my article "Proper XML Output in Python". Even if you're not a Python programmer, you might find some use in its discussion of common character problems when generating XML.

2 Comments

tcowan
2004-08-18 12:31:48
Unicode is easy, bad unicode support is bothersome
In my experience making it all work is hard because many Java servlet engines assume iso-8?? something standard, and finding good editors that will save and edit text in UTF-8 is sometimes a bother, but unicode itself is just a set of symbols represented by numbers, with perfectly clear documentation designating which character sets occupy which rows. Assuming OS's and tools were all written correctly, Unicode should be about as simple as ASCII.


Taylor

uche
2004-08-18 13:05:38
Unicode is easy, bad unicode support is bothersome
Sure the abstract concept of Unicode is easy, but Unicode is more than just that. Unicode includes the transfer formats (standard encodings), which is generaly where it gets hairy. Unfortunately, you can't just ifgnore the transfer formats when dealing with XML because that is how the data gets to the XML processor.


Even with the brilliant minds behind Java, they made many mistakes in their implementations of Unicode-related technlogies. I'm more familiar with the case of Python, where lessons from Java and Perl were kept in mind, and some of the biggest brains on the planet hammered out solid Unicode facilities. Even with all these stars aligned, things get rough in patches.


I think all these facts, as well as my plentiful experience working with Unicode myself, and with other developers, proves that Unicode is not as easy as a two-sentence description would imply.


And there is simply no comparing Unicode with ASCII. Unicode is necessarily much more complex.


--Uche