Text encodings: If we know the problem won't go away, why cannot we deal with it?

by Rick Jelliffe

I was reading the Ant (the make system) documentation today, and in the section on copy I came across this horrible note:
Important Encoding Note: The reason that binary files when filtered get corrupted is that filtering involves reading in the file using a Reader class. This has an encoding specifing how files are encoded. There are a number of different types of encoding - UTF-8, UTF-16, Cp1252, ISO-8859-1, US-ASCII and (lots) others. On Windows the default character encoding is Cp1252, on Unix it is usually UTF-8. For both of these encoding there are illegal byte sequences (more in UTF-8 than for Cp1252).

How the Reader class deals with these illegal sequences is up to the implementation of the character decoder. The current Sun Java implemenation is to map them to legal characters. Previous Sun Java (1.3 and lower) threw a MalformedInputException. IBM Java 1.4 also thows this exception. It is the mapping of the characters that cause the corruption.

On Unix, where the default is normally UTF-8, this is a big problem, as it is easy to edit a file to contain non US Ascii characters from ISO-8859-1, for example the Danish oe character. When this is copied (with filtering) by Ant, the character get converted to a question mark (or some such thing).

There is not much that Ant can do. It cannot figure out which files are binary - a UTF-8 version of Korean will have lots of bytes with the top bit set. It is not informed about illegal character sequences by current Sun Java implementations.

One trick for filtering containing only US-ASCII is to use the ISO-8859-1 encoding. This does not seem to contain illegal character sequences, and the lower 7 bits are US-ASCII. Another trick is to change the LANG environment variable from something like "us.utf8" to "us".

Now, lets put aside the question of why anyone would copy using text operations rather than binary operations. The larger question is why one earth, in 2007 and ten years after XML came out, we are still using text files that don't label their encoding?

Let me put it another way: if you make up or maintain a public text format, and you don't provide a mechanism for clearly stating the encoding, then, on the face of it, you are incompetent. If you make up or maintain a public text format, it is not someone else's job to figure out the messy encoding details, it is your job.

If avoiding the issue is the wrong approach, what is the right approach? One of the right approaches is to adopt Unicode character encodings (UTF-8. UTF-16) as the only allowed formats. (This is what RELAX NG compact syntax does for example.)

Another right-ish approach would be for every text format to adopt explicit labelling: the disadvantage of this however is that, like HTML's <meta> element, that it is unsatisfactory to have to parse deep in the document in order to be able to parse the document. And to have recognition software that understands the conventions of each format is impossible.

However, it is possible to generalize XML's encoding header into a delimiter-independent form that can be adopted . My 2003 suggestion for XTEXT gives the details. I don't see any disadvantages to XTEXT: in the post-XML world, programmers have moved from being puzzled by encoding labels to understanding that are a valuable part of the furniture.

An XTEXT-aware Ant (or default readers that recognize XTEXT conventions) would allow the problem to go away incrementally, as developers and maintainers adopt it. But the trouble is some mix of a lack of leadership by people developing or maintaining text formats: they don't see themselves as part of a larger community of text users, I guess, or believe that there is any advantage in participating in a larger community. I suspect that this ultimately because the developers of text formats are people who think in terms of ASCII or who don't have contact with use-cases where there are different character sets possible. The problem is pushed downstream. Not only incompetent but lazy?

Am I being too harsh? I hope so. In particular, in this day and age of international standards, the burden for fixing this has shifted from the developers to user-community representatives: it is something that governments and non-ASCII-locale standards bodies need to consider.

When I say "You are incompetent" an entirely satisfactory rejoinder back at me is to say "Yes I am: I can only respond to demand from people who are affected by this issue, and the standards and procurements processes are the place for these demand to be manifested!"

But buck-passing won't fix anything. If we know the problem won't go away, why cannot we (we consumers or we developers) deal with it?


Michael Foord
2007-10-08 05:19:59
Nitpick of the day: "people who are effected by this issue" should be 'affected'.

Unless this is a subtle continuation of todays XKCD: http://xkcd.com/326/

This comment is fully compliant with Skitt's law: http://en.wikipedia.org/wiki/Skitt's_law

Fazal Majid
2007-10-08 05:25:15
Another way would be to specify that the file defaults to 7-bit ASCII (or 8-bit ISO 8859-1) unless it starts with a Unicode BOM, in which case the encoding is specified by the BOM. That's more or less how Windows does it.
Steve Loughran
2007-10-08 05:29:57

the reason our ant task has the note is that as well as binary copies, you can do a copy with a transform in the process, something like expanding properties, fixing CRLF mappings etc. The note is there to stop all the "ant screws up files on a copy" bugreps...users need to selectively copy+filter those text files they want to transform, and copy binary stuff without the filters.

For xml, life is simpler. go use XSLT or an xml pipeline, and stop treating it as bytes or as chars where the charset needs to be known in advance.

Steve Loughran, Ant team.

Ps. Thank you for reading the documentation. Not enough people do. For reference, we get many more complaints about Ant not picking up user+group file permissions on unix in and than we do for encoding problems.

Rick Jelliffe
2007-10-08 05:55:59
Fazal: If you have ASCII, you may as well have UTF-8, since it is a CES subset. And 8859-1 is a non-starter nowadays, because it is just asking for trouble.

But the first right approach I mentioned was indeed limiting new text formats to Unicode formats: UTF-16 with a BOM or UTF-8 if no BOM.

When you say "That is more or less how Windows does it", what do you mean?

Steve: Yes, I certainly accept that data being transformed on a single PC or cluster is much more likely to be all in a single encoding at the moment. But it is a mistake to think that there won't be more documents in different encodings floating around. Europe is only the start.

At some stage it becomes more important to have reliable data interchange rather than hackery. One reason XML has succeeded in areas it is a poor fit for otherwise is that it does at least get the encoding issue workably right.

Joe Grossberg
2007-10-08 08:51:56
Why aren't encodings labeled?

Because of the awful maxim:

"Be liberal with what you accept and strict with what you produce"

It's ignored by the nitwits who leave out encoding.

Subsequently, it's followed diligently by the coders who use that output.

And, nowhere along the line, are the people creating those text files informed that they're doing something wrong. At least with (X)HTML and CSS, you can point them to the W3C's validator.

Damian Cugley
2007-10-08 09:20:46
For what it's worth, Python already supports xtext, almost: the format

#xtext encoding=UTF-8

falls in to the class of patterns recognized by Python interpreters as per PEP 263. I don't think it will be recognized qith quotes around the UTF-8, unless you do something repetitious like this:

#xtext encoding="UTF-8" -*-coding: UTF-8-*-

I imagine they would be open to allowing quotes in the pattern in a future release.

2007-10-08 09:59:47
Yes indeed. The fact that the file is read as a text encoding rather than used in binary is quite strange and seems to be a matter of not knowing how to get the softwares' environment to behave.

I did not know about xtext, although I like the idea. I have been sitting on a proposal around the ubiquitous #! prelude for conveying/confirming character encoding too. I figured I'd illustrate it in my nfoWare writings around handling simple text format and as a lead-up to ways that XML takes this farther.

But passing files through a reader that makes text-decoding assumptions for what may actually be raw binary is simply clueless. And I agree that we should do something about it.

2007-10-08 10:15:12
I don't understand Fazal's comment either. Windows recognizes both UTF-16 and UTF-8 by using a BOM on both. (This leads to some interesting effects that allow Raymond Chen and others to make humorous posts from time to time.) The BOM on text apparently convinced ODF advocating parties-unnamed that Microsoft was putting secret binary codes on front of its early Word XML effort.

If you don't convince window libraries and applications that they are seeing UTF-8 or UTF-16 in this way, they are programmed to assume that the default Windows (aka ANSI) single-/double-byte encoding applies. The assumption of 8859-1 only happens when that is the setting on the particular client. It doesn't help for interchange of the same text file between machines with different ANSI defaults whatsoever.

[I am religious about using UTF-8 and saying so in the top line, usually a comment in appropriate language. But to get Windows systems, including web servers, to know I've done that, I often run the files through notepad to get the BOM stuck on there. I suppose I could build a tiny utility to BOM-de-nom-BOM (sung with a Beach Boys lilt) such files.]

@Michael: Rick gets his choice on use of effected. It avoids the alternate sense of affected as "taken on" as in "putting on airs." Effected in the sense of impacted (a better choice) works here, and I wonder if Rick is quoting someone.

2007-10-08 10:41:44
"Am I being too harsh? I hope so."

This is not a good tactic. It is not as if
everyone using UTF-8 would have 0 new problems, you see?
It is always a trade off and following a strict line
that says "use this route" while problems WILL arise as
a CONSEQUENCE of this, people will just get pissy about

I for myself know exactly that I want to know ALL benefits
AND drawbacks, and THEN make a decision on that.

PS: Standards are overrated. They *should* make life
easier, but some standards dont. I dont want to mention
specific examples endorsed by big financial consortia,
but lets just state that standards should make life
easier and if they dont, its time to abolish that
specific standard without mercy. (I have strong
doubts that an established standard with a well
funded gremium would be able to reform itself from
the inside... )

Dave Hinton
2007-10-08 12:38:03
#!/usr/bin/perl -wT

XTEXT does not play well with Unix script shebangs.

Rick Jelliffe
2007-10-08 22:41:24
Damian: The link about PEP is very heartening. I hereby name the people involved as "competents"! :-)

PEP is probably a sign of Python's spread and maturity that its developers realize the advantages of this.

She: ?? You don't have to like every standard in order to recognize that something needs to be done. Of course, the more that a technology is dominated by ASCII-only developers and single-locale-project developers, the less chance that this issue will be addressed.

The basic issue is that APIs do not read or save encoding information in file metadata. Consequently the whole chain of text-based processing falls apart as soon as there is a need for reliability and mult-locale participation. XTEXT or similar systems of explicit in-band labelling do not prevent mistakes. With such information, conservative systems can be built that work (XML proves this): without this, reliable systems cannot be built (HTML proves this).

Dave: There could undoubtedly be improvements made to XTEXT to cover more cases: if the first line starts with potential delimiters and is less than a certain size and does not have an XTEXT declaration then try the second line. That sort of thing.

But relying on each different text format to have their own encoding declaration system is too much work to expect of developers: if there were standard XTEXT libraries for reading and writing, and a format developer could just say "we use that" it has more chance of better reach. But XTEXT is only a suggested mechanism; that there is labelling (and that it is required and reliable) is more important than any particular mechanisms IMHO.

Paul Prescod
2007-10-09 01:42:22
While we are at it, why not have a declaration in the XTEXT header for the actual file format? That is also valuable metadata. In general, it should be possible to put arbitrary metadata in the XTEXT line and the refo/refc stuff could be moved into a separate standard as a particular example of that metadata.
Rick Jelliffe
2007-10-09 02:10:46
Paul: Or the full MIME header? I tend to think that the issue of how to go reliably from bytes to characters should be a separate layer from other issues, but if you can come up with a syntax it would be interesting.
2007-10-10 16:05:48
@Paul and @Rick:
I was thinking of handling both with #!, but it depends on cooperation of applications, whether or not the shell has a part in it. There is also the ambiguity in *nix of there needing to be identification of a specific application, and then there is a chance to specify encoding and format (and version!) or whatever else.

My trick on length is simply to treat additional #! as continuations, but you can't do that unless the application (which must digest the #-line anyhow) is prepared to handle the #! for application-specific information.

A hack, no question about it. I'd even put these on the front of persisted XML files but they'd still have to be fed to an application that new to strip these off as pseudo processing instructions before any XML. (My use case is some processor that excepts a text format or an XML format and is forgiving about the #! for other reasons.]

Hmm, hack specification by comment thread. Interesting.