We still need an ISO standard for ZIP

by Rick Jelliffe

It is extraordinary that we have no standard, ISO or otherwise, for the ZIP format, when it is the basis for modern packaging: JAR, WAR, EAR, SCORM, ODF, Open XML, etc.

I have found that I was wrong that DIS 29500 (Open XML) includes a ZIP specification. What it has is a quite detailed profile (more than 20 pages), requiring the use of deflate compression and disabling all the advanced features of ZIP, however it falls well short of being an actual ZIP specifation. So Open XML and ODF (which has 2 paragraphs only on this), ultimately both reference the PKWARE definition of ZIP. Sigh...

Last year, I was tasked by SC34 to investigate an ISO standard for ZIP, so I will have to start looking into it again. I am interested in finding out what people think about how much of ZIP should be standardized (if we can indeed get any of it standardized): is a minimal Open-XML style ZIP the way to go, or is something more full-featured better? Should the goal be implementability and out-going compatability (be conservative in what you send), in which case something like the Open XML subset is appropriate, or in-coming compatability (be generous in what you receive), in which case an ISO ZIP would try to allow as much variability as possible?


Preston L. Bannister
2007-08-08 21:01:01
How much of a "need" is there, really? What changes if we have a standard?

Standards are especially of use in avoiding incompatible implementations, and help promote a common evolution. Given the long existence and wide use of the open source Info-Zip implementation, pretty much anyone who needs an implementation already has one. Given the need for compatibility, you can only standardize what is already there.

Can't see a lot of value in this effort.

Rick Jelliffe
2007-08-09 02:09:45
Preston: There are a couple of general reasons for standards for established formats. One is that it forces issues of internationalization, accessibility, IP and documentation quality into the open. Another is that some organizations (especially some govt organizations) have a requirement to specify standards in their documentation.

In the case of ZIP, when I looked at it last year the development effort had forked incompatably, with PKWARE allocating some extensions for one use, while a rival developer allocated the same extensions for another use. (My memory is hazy now: I think it was in the area of signing parts.) Where there is a split spec, a standard can be useful either to tell the public the common subset that is adopted by all the major players, or to establish which of the forks is the right one. It can also be a useful forum for getting diverging developers to sit at the same table.

Rick Jelliffe
2007-08-09 02:13:42
Preston #2: There is also the issue that ISO requires (for standards developed in it) that specifications which are normatively referenced should come from some legit standards body (or have special permission.) A normative reference is one to an external specification without which you cannot implement the standard in consideration.
Gary McGath
2007-08-09 07:20:17
I've linked to this from my own blog and noted a few issues.

In reply to Preston Bannister, standardization by implementation is a bad idea. Quirks and defects can be hidden in an implementation, even when it's open source. With the increasing use of Zip as a component of documents, it's important that the format be able to outlive a particular set of code.

I think a Zip specification should be open-ended to future compression schemes, while specifying those which are widely used.

Preston L. Bannister
2007-08-10 08:56:51
Just to note the obvious, a standard without implementation is still a matter of interpretation. In effect a standard is a theory - unless put into practice, you have no idea whether the theory is workable, complete or accurate. Quirks and defects can remain hidden in a standard (Ada comes to mind - from first hand experience). Reference implementations are always a valuable sanity check on any standard.

If the goal is to document existing practice (to satisfy a particular external requirement) then a standard is harmless. The ZIP standard is a case where there is little or no need for evolution.

There are other domains where the existence of an external standard my be helpful in ensuring compatible evolution (Javascript / ECMAscript for one). ZIP is just a container for compressed objects with a directory. There is little or no need to evolve the ZIP format. (There is some need to allow for object over 2 gigabytes, otherwise not much.) Additional semantics can be obtained by injecting special objects into the container (as is done for the JAR format).

Adding more compression schemes to the ZIP format is dubious. Added attributes might be safely ignorable by existing code, but new compression schemes would be incompatible. In effect, you would have a format, and a new format should go by another name.

In the end that is precisely my point. The ZIP format is valuable in that it is sufficient, easily used, and stable. Any standard should not change this. In this case, change is bad.

Rick Jelliffe
2007-08-10 09:15:05
Preston: Funnily enough, just today I came across an example where it would have been much better if ZIP had been an ISO standard for a couple years.

When a standard is reviewed, national bodies check whether it is good enough for them: so good internationalization/localization is high on the list of requirements.

According to the PKWARE site's changelog, they only formally added the methods for signifying that filenames used UTF-8 encoding about 12 months ago. This is very late in the day, late enough that it seems that it has complicated the specification of Open XML, for example, by requiring Open XML to have unnecessary mappings.

Preston L. Bannister
2007-08-14 05:00:20
As the ZIP format dates back to 1989, in this context "a couple years" might be more like 15 years. Given the passage of time, the number of deployed ZIP implementations, and the huge number of ZIP data files - there just is not a lot of room for changing things.

Yes, as you note, it would have been better if the ZIP format specified character encodings for file names (and file contents for that matter). Odd that this issue went unaddressed for so very long - until you think through the use cases. In the case of ZIP use embedded within another standard (JAR, WAR, EAR, SCORM, ODF, Open XML, etc.), with a little care - storing all filenames as UTF-8, for example - any lacks in the current ZIP standard become irrelevant.

Again, we end up with fairly minimal value for an ISO ZIP standard.