Reaping what you sow: How a standard for Java would have made it better today

by Rick Jelliffe

Three programmers gathered at the next cubicle to mine yesterday, clucking and snorting as is their want. I looked over to ask what was going on. "A bug in Java" they said. The problem was with ZIP files, specifically some differences between ZIP files made by different methods.

They had some files with non-breaking spaces (U+00A0) in the file name. Not something that I would do myself, but the number of people who want to use non-ASCII characters in their filenames is surely now much greater than the number of people just content with ASCII-only names. Aha, so file this under internationalization (I18n)!

The problem was, it seems, that WinZIP stored the filenames using the system default encoding. But Java would read the filename using UTF-8. So sometimes ZIP files parts would have the non-breaking space, and other times the same file saved a different route would have 0xFF at that position. Now this is the kind of behaviour and problem that you would expect a decade ago, but I was surprised it still occurred.

Checking through Sun's bug database, we find that this bug (or its clone) is actually the second most requested (2008-13-28). The engineer who evaluates the problem gives the excuse that Sun decided to use UTF-8 for JAR files (which use ZIP) and seems a little surprised to discover that ZIP may actually be created by other systems to.

Looking at the bug report, we also find it was first reported 07-JUN-1999. Almost nine years ago. The bug report says it is only reported up to Java 1.4.2, however I cannot see anything in Java 1.6 that addresses it.

So what has happened? Several things:

  • Apache put out a zip implementation as part of Ant that supports different encodings. So people who needed it can use that.

  • Since September 2006 the ZIP spec has formally included a bit to state the the file name is stored using UTF-8.

  • It seems other manufacturers have increasingly used UTF-8

So for almost 10 years the Java version of ZIP has been broken for internationalization purposes, the fix seems to be caught in limbo (are they waiting for non-UTF-8 encodings to go away, perhaps?) , and so people are forced to go to other implementations. WORA undermined! Indeed, this seems another example where Java is simply too large for Sun to maintain adequately.

But what about this angle: the current ZIP spec has an appendix on file names and encoding it says
The ZIP format has historically supported only the original IBM PC character
encoding set, commonly referred to as IBM Code Page 437.

Which means that Sun's policy of merely writing UTF-8 is now going against what the ZIP spec says.

Software maintenance and juggling issues on a budget are not easy. However I think it is more than plausible that had Sun gone ahead and submitted Java to ISO for standardization a decade ago, this issue would have been fixed long ago. Because ISO National Bodies give very high precedence to issues such as internationalization, accessibility, modularity, and conformance. So the lack of proper encoding support in the ZipEntry API would undoubtedly have come to the fore in the very first round: Japan never lets this kind of thing slip, for example.

By exactly the same token, if the ZIP format has been put through as a standard, proper encoding support would have undoubtedly been raised as part of the first review. Standardizing either would have been good enough to have a technical fix agreed on, published and pressure applied for a fix ahead of the demands of corporate featuritus. But standardizing both would still be best.

After Sun backed off last time, leaving so many people who had participated feeling burnt, it is hard to see that standards people won't be deeply suspicious of them. And Sun people may not be keen to submit even to a "bullshit process" based on pragmatism and incrementalism. But Java would clearly, IMHO, be in a much better position today if it had been standardized. And so would ZIP.

Standardization as a kind of audit

What standardization of a living technology gives stakeholder companies is more than just bragging rights and ammunition to shoot their rivals with and to confuse procurement people with, tempting as those things may be, it also give an objective audit program dictated not from the corporate POV but from (to a greater or lesser extent, depending on interest) the market and relatively disinterested third parties. Any long-term software project gets encrusted in the personal politics and ideosyncrasies of the development team, and needs a circuit-breaker. This is a view of standardization as a kind of major technical audit, particularly of the documentation but also of areas that are becoming more market-critical: standards use and compliance, openness, responsiveness, accessibility, internationalization, integratability, testability, and so on.

These are all things that established technologies need. Now of course you can get audits in each of these areas by hiring experts. That is good, but you don't get the breadth or provable transparency that National Body participation can bring. And expert opinions still have to get evaluating the context of the power relationships of the company, the very same relationships that allowed the problem to arise (these might be as simple as CJK requirements not having an adequate champion or I18n not being a profit center that can demand changes.) And you can get benefits from using boutique standards bodies in which vendors or their representatives can have voting rights: W3C, Ecma, OASIS, and so on. That is good too, but it does open to domination by one side or the other.

Which leaves the ISO family (e.g. ISO/IEC JTC1) as being effective forums for this kind of audit. People who think that ISO standardization is always a pushover should consider the current OOXML debate: you have MS and friends on one hand and IBM and friends on the other both pushing as hard as they can, and yet as I write neither can establish clear dominance. And these are the largest players in the world. Whether DIS 29500 mark II passes or fails it will be because national bodies decided on technical issues, not pack alliances, as far as I can tell. I am sure that neither MS nor IBM is feeling comfortable at the moment: and this is the strength of the ISO kind of procedure, regardless of the outcome.

We have all had enough experience of open source to be aware of its strengths and weaknesses now. Making something open source does not automatically mean that bugs and so on will be fixed. No silver bullet. As I wrote in this blog a couple of years ago in Sun should open source Swing
it is not enough to Open Source something: the mechanism for speedy response to bug fixes and releases is crucial too.

And neither will auditing a technology by making it a standard. Nothing is automatic. But Error-full systems emerge from single-strategy maintenance regimes and the dinosaur systems such as Java and Office are full of examples of this. The ISO standardization process has many qualities to commend itself for large companies as a tool for shaking things up and circuit-breaking. And we still need an ISO standard for ZIP too.


Steve Loughran
2008-03-27 06:24:52
...but if ZIP were an ISO standard, what about the extensions to do unix filesystem permissions? Which Ant's task can create, even if our implementation can't set them due to Java's weak support for file permissions. It was to handle permissions and very large zip files that we in the Ant team did our own implementation, not the encoding issue.

IMO, we dont necessarily need ISO specs, we just need good specifications with test suites that anybody can run. All specs need a way to evolve, too.


ps: remember that the top25 list shows top voted open bugs. Anything closed as WONTFIX is not there...there are worse things, like the fact that whoever maintains the class hasn't actually read the HTTP specification, hence Apache's HttpClient project.

Rick Jelliffe
2008-03-27 07:07:58
Steve: Oh, I am not proposing standardization as a magic bullet!

But it is useful where there are different developer parties who don't have much interest in (or forum for) talking, but users who do have an interest in level-playing field interoperability. And it is useful when there is a dominant player, and it would be good for all concerned for the technology to have an external audit. And it is useful when there is may be open source manpower available, but their synergistic impact will be lost without an agreed specification that other efforts can buy into as well. And it is useful where there was a technology associated with a niche or platform use, but which now has an importance wrt other standards and uses and platforms.

ZIP seems to be at that stage. (What do you think?) Some coordination of the best extensions, some rationalization of compression methods, some management of encryption and signing issues seems useful. At a certain stage of a technology, the developing stakeholders realize that it is in their interest to prevent fragmentation, and that co-operation through a neutral, non-dominatable, voluntary forum with independent review in a non-antagonistic process would be productive.

Think of the enormous impact and success of XML here: the use of standards has directly lead to a situation where non-interoperability at the delimiter level is a thing of the past.

Exstensible technologies often suffer from a flowering of extensions, but unless this is followed by a consolidation effort, the benefits of the extensions are localized and often fall into disuse. Standards groups can be good for making sure that technological improvements don't fall by the wayside.

It is a myth that open source people are antagonistic or unused to standards: they write code using standard languages and standard APIs and with standard media formats every day. But some open source people (just like closed source developers) can hold back the success and spread of their projects by ignoring the standardization angle IMHO.

It is not that standardization addresses every problem. But it does provide a particular slice of the cake that otherwise frequently remains unaddressed. The encoding problem is one.

(I guess file permissions for standard platforms such as POSIX/Linux would fit in there too. When a free technology becomes a standard, such as Linux ABI, then it adds an imperative for other standard technologies to consider what support is necessary for it. So standards can be a way for open technologies to boost each other. If there was an ISO standard for Ant, for example, we could have used it in ISO DSDL for our schema language framework: we discussed it even!)

Rick Jelliffe
2008-03-27 07:19:09
Steve: (#2 on Http)

And as a person who runs a 100% Java company, we couldn't have operated without Apache.

We may grumble that Apache code is written for servers not desktop applications (no conditional invocations of thread.yield() to help GUI responsiveness for example.) And we may grumble that fixes take too long to get distributed for some projects. But it is a wonderful effort. Without it, we just couldn't have adopted Java.

If Java had been standardized, then the Http issue would have come up. And it would undoubtedly have to be resolved in favour of the IETF and W3C standards, but if Sun had particular profiles or extensions that could be justified, they would get reasonable consideration. And if Sun had ultimately said "We cannot accept this standard version of Java" then at least everyone would know clearly that Sun chooses not to conform in this area, as would be their right. However, then it would give customers a bigger footing to say "We only use products that conform to the standard" which then gives a commercial impetus for Sun to conform.

(And if no-one actually cared, that would also be a sign that the conformance was unnecessary and that real life had passed the IETF standard by, which would be instructive too. In which case the relevant ISO committee could start dealing with the relevant IETF or W3C people to figure out a productive next step. )

Steve Loughran
2008-03-27 07:20:44
I think my opinion of standards is up on my submission to Waterfall 2006: 'Standards: The Waterfall at Work':

Either you have spec-first design -which gave us the OSI communications architecture, or you have defacto standardisation of what is in the field. TCP/IP being a good example. Not only does it work, the standardisation gives interoperability, which is a valid reason for using it.

Regarding Ant, standardising it would stop us being able to make changes; we'd be less agile. Instead we have a test suite (and apache gump) to keep us on honest. We don't gain anything by improving interop because anyone is already free to reuse our code and our test suite. Its the test suites that define Ant, not the code.

Rick Jelliffe
2008-03-27 07:36:12
Steve: (#3 waterfall)

That conference site is one of my favorites! A constant scream.

I am not sure how the waterfall issue would apply to a standard for Ant though: Ant already exists, so a spec could not be "specify-standardize-implement". And for other systems who want to accept Ant files, why is having a standard any different from any other preexisting source of inspiration, from the waterfall perspective?

For example, an Ant standard would have a good schema. It could be done using ISO Schematron which is a powerful schema language based on Xpaths that allows extensibility and openness in all sorts of ways. (In fact, my company Topologi has an (Java) Ant task for Schematron that we would love to contribute to Ant under any open source license: what is the mechanism for this?) The schema then acts as test for developers.

At SC 34 we have a really active schema group (WG1) which is largely predicated on the notion of test-first development of document systems: that the appropriate schema provides a really workable and practical test. In the case of Schematron, it even has a notion of phases built in, so that you can have a schema that supports incremental development, rollback of functionality (from refactoring) and other useful non-waterfall things.

I think you have the wrong idea on standards though (though your comment may be frequently correct): that standards necessarily are complete, non-extensible and unalterable. That they get in the way and are always developed ahead of implementations. In fact, ISO standards are all dated, and one can supercede another. Even radical changes are possible. And standard can be extensible, fixing the branches but leaving the leaves free to experimentation. (Indeed, it is one of my hobbyhorses that all standards need to provide support for plurality at the next layer, and no arbitrarily restrict things.)

2008-03-27 09:01:35
Nice appraisal. I want to raise my arm and second the need to standardize the Zip format. The OOXML spec is meticulous in what it expects to be used and not used in the Zip specification (ODF is notoriously sloppy about this, plus it immortalizes an awful hack to create the equivalent of a magic number for ODF files).

(I've been making Jar files using 7zip and I guess I got away with it because I am using only the standard 95 printable characters that look like UTF-8 too, and not all that many of those.)

Steve Loughran
2008-03-27 09:50:21

Ant schema? Well, DTDs dont handle the same element name having different behaviours in different places, and (thankfully) we avoid XSD like the plague. One issue with Ant is that you can add new elements dynamically, so the schema would have to be dynamically generated at a specific point in time.

Regarding a schematron task, its usually best to keep it close to your code, build and test with the latest stable ant release.
1. Antunit is your testing friend; go search for it.
2. you can get an entry under the external tasks lists, just submit your patch to external.xml in our source tree.
3. do test against SVN head and complain early if something broke.

Cay Horstmann
2008-04-04 06:59:08
I am a little surprised to see OOXML displayed as a model for the ISO process. Isn't this the 6,000+ page standard that few people read in their entirety, and that had hundreds, if not thousands, of technical issues whose proposed resolutions by a single vendor were summarily approved without discussion, in a "fast track" process? Isn't this the standard that stubbornly refuses to build upon other well-established ISO standards, preferring to go with muddled messiness in the interest of backwards compatibility? Isn't this the standard that contains quite a few features that say "if flag X is set, imitate the behavior of legacy product Y which we decline to specify"?

Now it may be that the ZIP file specification in OOXML is a rare gem in a pile of steaming ____, but then you might want to do your squeamish readers the favor of pulling out those details.

Rick Jelliffe
2008-04-04 07:14:06
Cay: I don't know why it is so difficult for anti-OOXML people to admit that *some* good things have come out of the standardization process. Even getting an acknowledgment that there might be *some* good parts to DIS 29500 was like pulling teeth.

Standardization is good for some things. In particular it is good, when it works, for making one side allow or accede to the requirements of their competitors. It can be as simple as getting both sides to move away from their mutual NIH (not invented here) syndromes.

For example, if the process worked properly for Java (had it become a standard), the result of the GUI wars would have been an position where there was a standard graphics base that everyone had to support (AWT plus minimal Swing, for example), then a modularity system that allowed non-WORA graphics plugins, but managed: SWT, J++, full Swing, and so on. And MS mightn't have split off to make C#, and thence the CLR, and then .NET would be quite a different animal. Speculation, of course, but not an impossible outcome in theory.

Ravi Luthra
2008-04-04 07:20:44
Why ISO, there are a billion standards organizations:

And Java has been standardized:

It's funny that people think ISO is the only organization that can make something a standard.

Rick Jelliffe
2008-04-06 23:16:29
Ravi: But not all standards bodies are equal. (Nor should they be.)

I tend to make a three-tier distinction with standards bodies (excluding national ones):

* Standards bodies which are to some extent fake: the process or organization was established to allow some external input by a stakeholder in order to maintain effective control both of the technology and the process. I would put JCP in this class.

* Boutique standards bodies: membership based, which allows control of working groups by a cartel. Ecma, W3C, and OASIS are in this class.

* International standards bodies: membership based on nationality, which makes domination either by an individual or a cartel too difficult. ISO is in this classs.

One of the practical differences is that as you get towards the ISO level, the need for inclusiveness increases. Lets say company A has a technology, and company B and C have variants they would like which don't fit in with A's plans. In the first kind of organization, you will only get what company A allows: B and C may be out in the cold. In the boutique organization, A and B, or A and C, or B and C will form an alliance and get what they want, excluding the third company's requests. In the international model, none of what A, B and C want is guaranteed, but they each will (if things go properly) get a fair hearing and one's requirements will not be excluded because of another's competitive positioning.