There Ain't No Such Thing As Plain Text

by Uche Ogbuji

Related link:…

Alternate (non Google-groups) link

When Paul Prescod holds forth on a subject, wise developers pay attention. His recent thread on byte versus character strings in language design is IMHO required reading for language users as well as designers.

The immediate context is Paul's advice to designers of Prothon, a Python derivative language, to get the character/byte string distinction right from the start, and also to enshrine other good practices such as making encoding and locale important first-class constructs. The lessons, however, are limited to neither Prothon nor Python, and are expressed clearly enough for users of other languages to follow.

Paul references Joel Spolsky's important article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)", and if you haven't read this, do so right away. The rest of the thread is filled with important insight and expansions.

I think I've learned these hard lessons in many of the same battles as Paul. I can confirm that you really do pay if you are guided by the accidental conveniences of speaking a language whose character repertoire happens to fit into a computer byte. Don't cut corners when it comes to computer representations of text.

Have you been keeping up on your character-fu?