Can a file be ODF and Open XML at the same time? (and HTML? and a Java servlet? and a PDF archive?)

by Rick Jelliffe

We are used to thinking in terms of formats as rivals (either X or Y) or adjuncts (X for this use; and Y for that use) but what if there is an entirely different way of approaching office file formats? What if we learn from the success of Apple's Fat Binary system and progress to a ZIP-based system where different standards can co-exist and share media files in the same package?

For background on this, see my 1999 paper How to Promote Organic Plurality on the WWW which introduces three ideas data kidnap, workflow kidnap and data lockout as more useful specific concepts than the usual data lock-in (which, being an over-broad concept, tends to generate over-broad solutions.) The basic idea is that technology needs to be layered so that each layer can allow a multiplicity of alternatives, rather than monolithic solutions. Think the Internet stack and all the RFCs giving alternatives. The paper was made as part of thinking about XML Schemas at the time of its development, and I think our later experience with XSD has entirely born out its conclusions. XSD has succeeded where it is modular (e.g. data types) and had trouble where it is monolithic.

Exploiting ZIP

ODF, Open XML and Java Web Applications (.WAR files) all are based on ZIP archives. Change the extension to ZIP, and you can poke about inside with just COTS ZIP utilities. So at the moment, it seems that we can actually, say, save a simple word processing document as HTML, ODF and Open XML, then merge them into a single file after paying attention to various paths and metadata issues, and removing duplicated media files. If we give that file the .odt extension, it will open using ODF; .docx, it will open as Open XML; .war, it can be installed as a servlet serving HTML pages.

Does this give us a file that is three times the size of the single file? Probably not, because not only will media files (which can easily dwarf the text components) be shared, but also because there will be fewer unique strings to compress, a unique string in the original document's text will appear in the three different formats. Also, the modern formats allow embedded XML for forms or spreadsheet data source which can also be shared. So I don't see it as impractical from the size point of view.

The kinds of adjustments that would need to be made include adding the appropriate MIME-based content type information to the ODF manifest metadata and the Open Packaging Convention content types file. But the result would be a single file that could be, with the appropriate extension change, be read by any of the other systems.

And, more interestingly, if we made up a new extension (.superXML?) an application could open up the file and then select which format it was happiest with. For example, an application might only cope with HTML and ODF, and so would chose one of those. Or an application might decide to open the file using whatever was the native format of the application that created the document: for example, if the document was created by Open Office, the receiving application might decide that ODF matched the feature set of Open Office more than Open XML does, and so import using that.

A different road to harmonization

With this kind of framework in place, the road to harmonization becomes clear, because harmonization doesn't become a question of "Which format do we choose as the round hole, and which formats have to become square pegs?" but becomes a question of "What modules do they have in common now? What modules can be split out of one to help the other?". So by supporting plurality and modularity, we can actually find out the points of similarity and quarantine differences to ever-smaller alternative fragments.

Lets give a practical example. Font matching is the feature where an application opens up a file and, upon discovering that some font needed by the document is missing, tries to find a near match. Various mechanisms can be used, but the most basic matching criterion is of course whether the font contains the characters for the language (I mean "script" of course) being used. It is not good using a Russian font for a Thai document.

ODF betrays its pre-Unicode and UNIX roots here, and uses a non-Unicode based system where it uses the locale character set of the original document (or of the font) and matches that. So it will say "This font has an ISO 8859-1 mapping table, therefore we will look for another font with an ISO 8859-1 mapping table." This is pretty crappy in theory, actually, because Unicode extends so many of the locale-based character sets, but ultimately OK, because these things are only optional hints and the more hints the better.

Open XML uses the more modern Open Fonts standard ISO/IEC 14496-22 for font mapping, which allows mapping both by Unicode block and by major script family. Open Fonts comes from Open Type, which in turn is a container for including both Adobe PostScript fonts and Microsoft TrueType fonts: in fact, it is another example of this kind of containment mechanism.

Interestingly, it is this use of IS 14496-22 that has shows one of the problems with ISO DIS 29500 (i.e. Open XML). You may remember that anti-Open XML people have raised the issue of bitmasks in Open XML, with the lunatic fringe going as far as saying that Open XML was riddled with bitmasks and that these were impossible to validate or manipulate in XSLT; and me then rushing to Schematron's defence and showing how it was entirely possible, if not trivial, in Schematron and XSLT. Well, the main place that bitmasks are found in Open XML are actually in the font/sig element that is used for font matching, and the bitmasks are the values specified by ISO 14496-22. There is no reason for an application to tease apart the bitmask numbers, certainly not to add 96 separate attributes for something that humans will not be interested in. because the numbers are just magic numbers that come from the original font and are matched against the prospective substitute fonts, that I can see. In the same way that you don't want to have separate values for R, G and B, because having combined RGB values is more convenient for manipulation. (So the problem with DIS 29500 is not that it uses bitmasks in this element, but that it only gives a vague reference to the standard that the bitmasks are based on, when it should have a clear normative reference—I don't think anyone else has picked up the ISO Open Font implicit reference hooray for me—: yet again this requires just an editorial fix rather than a technical fix.)

So what should be done? Should Unicode people say to ODF "You need to replace your antique system with something better" Should Linux people say to Open XML "You need to replace your cross-platform system with something that handles Linux-only legacy fonts better?"

With a common system based on plurality, we can say "Well, why not modularize both out as separate resources in the ZIP archive, so that each application has more resources to use?" Now Open XML is probably ahead of ODF here, because it tries to split up the document into many different files in the archive and is already divided into multiple namespaces. So it would be great it ODF adopted the same kind of modularity too. So an ODF application also can, if it chooses, to look in the Open XML font tables for better information. And a Linux system that is using the Open XML format can include information that will help with legacy documents on Linux better.

Practical issues that need to be addressed to get to plurality

The overarching idea is not so much that each document will have a grab-bag selection of different formats, but that each document will have at least one complete version in a standard format *plus* any alternative and additional information from other formats that the application can provide. So that a receiving application can choose the best modules it can, and so that information interchange becomes less dependent on limits of one particular standard.

I have mentioned before that no serious application suite can afford to ignore any common standard format. So in a couple of years time I am sure we will see Open XML and ODF import/export as part of the base packages for all the suites. Indeed, governments and power buyers should demand this from vendors for the distros they buy. (I suspect this will indeed become a purchasing requirement: see the European Open Documents Exchange Formats workshop in Feb 2007 where (p.12) Representatives from public administrations requested over and over again that industry take steps to overcome interoperability problems between ISO 26300 (ODF) and Office Open XML and to implement both standards in their products.. The writing is on the wall.) But my idea goes beyond merely transformation to a model of enabling selective augmentation.

Now even though it seems we can probably make an archive with these different formats now, the difficulty is with writing them. Applications currently won't update the format parts they don't understand of course. So if you update an ODF document that also has an embedded Open XML document, the Open XML document will be out-of-sync. This is an area for standards, and in particular an area for the maintenance of OPC and ODF: should the extra parts be removed and how do we signal it in markup?

Adopting the multi-format approach then has feedthrough for other formats, such as PDF. PDF would need to be unshelled, so that the various pages and resources were exposed as different files in the ZIP archive.

Of course, adopting this approach would nor preclude different formats from cross-pollenating and converging where possible. But sometimes there are differences that cannot be reconciled, and supporting plurality means that no solutions are gratuitously ruled out by bureaucratic dictates for single standards (Obviously I think the "Highlander" principle expressed here p.6 is in danger of being terribly simplistic and impractical, unless the one-true-format itself allows plurality at subsequent layers.)

Already Open XML has some capability for allowing alternative chunks within a file, and ODF of course allows foreign elements so you could poke some alternative or extra information in there. But my view is that this is something that needs to be engineered at the standards level, with vendor buy-in, to push competition between standards bodies and their stakeholders one level up the protocol stack. Every level is a victory, and I think this is a race where we need to win one step at a time. The hare and the tortoise.

What steps might this involve? Well, for a start I think that most of Open Packaging Conventions (OPC) should be adopted. There could be an on-ramp made for it, to allow current ISO ODF documents to fit in. The big difference is that ODF uses direct references to entities in the package, while Open XML uses OPC which uses indirect references. So the idea would be an identifier resolution system where ODF applications first treat the reference as a local relative URL then if that fails look up the OPC package then if that fails treat it as an external URL (of course, delimiters will provide extra hints to speed this up.) Furthermore, Open Office rewrites the identifiers used to be GUUID not human names, so it would be nice to add mirror SGML's PUBLIC/SYSTEM/Indentifier distinction here—SGML got it right.

But I don't think these issues are insurmountable. The question we need to ask is not How do we enforce monolithic technologies? but How do we take the sting out of multiplicity? It is not a question of trying to have the cake and eating it too, but rather that it is foolish and unworkable to merely throw half the cake out. Oh, that is getting far too aphoristical.

The pluralistic approach of this .superXML format also makes it easy to address issues such as equations, bibliographic citations and metadata where the needs of laymen are entirely different from the needs of professionals. The primary standard formats can adopt simple, layman-oriented structures (Dublin Core, etc) while encouraging specialist formats with higher qualtity requirements.


M. David Peterson
2007-07-29 01:36:04
> And, more interestingly, if we made up a new extension (.superXML?)

.lfz (Lingua Franca Zip)

M. David Peterson
2007-07-29 01:52:06

Rick Jelliffe
2007-07-29 02:23:35
.wtf (World Text Format)
M. David Peterson
2007-07-29 06:15:45
> .wtf (World Text Format)

Might get confused with "what the f**k!", but in a weird sort of twisted way they would kind of fit nicely together.

To throw another into the ring (and building directly off of yours),

.wdf (World Document Format)?

M. David Peterson
2007-07-29 06:19:59
A few more,

> .wdif (World Document Interchange Format)
> .gdif (Global Document Interchange Format)

I like the connection to the term "diff" as it fits well with what you have outlined above in regards to handling the various differences between the various doc-types, so it would provide a nice mental bridge in regards to what it actually represents.

2007-07-29 06:21:15
Rick, this is potentially one of the most constructive posts you have made on this issue.

Do you see large vendors being willing to change their existing software and formats so that something like this could work? In particular, do you see Redmond being willing to really try to express "the full richness of existing documents" in a vendor-neutral way? Because this proposal would stand or fall on how committed Microsoft is to making their data files equally usable by any vendor.

As you probably know, most independent ODF supporters are really after vendor-independent and openly-specified, freely-implementable document formats without any "IP" issues. Larger vendors would probably agree with that as well, as long as they could see a way to remain profitable (e.g., if every turned to plain UTF-8 text and used vi or emacs (for text) plus bc or dc (for calculations) they might not be very happy).

At the standards level, where the discussion seems to be focused right now, is there any support for pausing the process and forging a unified document format (layered and namespaced as necessary) that both ODF and OOXML supporters can get behind?

Bruce D'Arcus
2007-07-29 06:44:24
Rick: you are again, I think, giving undue prominence to Gary's position. While suggesting that people come to their own conclusions, the fact that you link to four difference manifestations of this position -- and then justify it by in fact accepting Gary's position at face value (rather than, for example, consider that it may be Gary who has the backwards black-and-white view) -- is less-than-balanced.

Back to your suggestions, though, I think clearly if there were an effort to harmonize ODF and OOXML going forward, this is the way to attack it: start at the most basic (the zip archive), move up to the manifest, and finally the actual content, style, etc. pieces.

Rick Jelliffe
2007-07-29 07:03:39

.pax (Plural Archive for XML)

Rick Jelliffe
2007-07-29 07:23:43
Bruce: Yes, I probably should have kept the material in the last section (it did not go out in the main feed, it is in the extended entry only, by the way) for a different blog, and pick a different example so as not to sidetrack people.

But it is there for a purpose, apart from its intrinsic and sensational interest: to show that the lack of pluralism creates false adversaries, not only just between MS-users and Open Source promoters, but also even within ODF development, on the same side. This is a problem for all sides.

It is no use pretending that Edwards has not said these things, and I think I go out of my way to provide extra context and multiple sentences of warnings not to be fast to judgment: I am not sorry however that I only mentioned some possible cases rather than exhausting the possibilities which would descend nasty conjectures: I think we should assume that readers are bright enough to know that there are lots of other potential cases.

(Update) I have put the middle paragraph into small text, to show its position in the argument as a note rather than the thrust.

Rick Jelliffe
2007-07-29 07:35:54
W^L+: I have no idea what Redmond thinks. I think all vendors have two interests at their bottom line: they need to have an interoperability story (i.e. why they are harmless) and they need to have an extra features story (i.e. why they are preferable.) An approach like the one I am suggesting caters to that bottom line better than what we have now.

MS is further down this line than other vendors, because of OPC. I think they see lots of advantages (as do I) in OPC becoming a standard, because we have been farting around without an adequate packaging mechanism for standards documents for too long. We are missing the layer below XML, because the Web doesn't need it and XML is a Web technology. However documents as files still need it.

So I think actually it may be less of a hard sell to them than to other vendors, because they have a low barrier to entry. They already have ODF and XHTML export, for example. And you can see from Opn XML part 5 that they are paying attention to extensibility and future-proofing issues. But it is the kind of thing that I think Governments should be pushing for.

I would tend to see this kind of idea as something that should be done after ISO Open XML is ratified, when we can start to be serious about looking at "harmonization". The big advantage is that it is an incremental technology: for MS it is retrofitting ODF and HTML capability into OPC, for ODF is it adding OPC support into their ZIP format. So there is no need for any players to start a brand new technical cycle because of it: it merely rearranges the current parts in a more satisfactory way.

Patrick Durusau
2007-07-29 07:40:53
Well, I can say this: You have finally tempted me into posting a comment to a blog entry on ODF/OOXML. ;-)

What if we carry the idea of abstraction just a bit further?

Suppose we defined a target file format that had elements that contained multiple namespaced GIs? For example, . Any application can "save" a file in its "native" XML format but can also "save as" a format that can be more robustly shared by different applications?

That would solve the updating of content issue although it would obviously require a robust application layer to deliver the document that your application understands.

Granted, there are other areas, such as styles where OpenDoucment and OpenXML differ, such as "section" properties being supplied by stylesheets in the first and recorded in the content at the end of a "section" in the latter, but surely we are collectively bright enough to signal that to an application that wishes to read the file. If reading it as OpenDocument, you need the stylesheets. If reading it as OpenXML, the styles need to be at the end of the content.

For some purposes, such as internal office interchange, I may not need or care about other formats. For public documents or ones that need to travel about, then I could choose the interchange format that works seamlessly between applications. Or someone who has a copy of a single format document could convert it into a more interchangeable document.

I think the time has come to recognize that interchange comes at a cost and that one size doesn't fit all situations.

If we could craft, as both OpenDocument and OpenXML evolve, a standard abstraction that enables applications that deliver the format desired by a another application (as well as creating a super format with that mapping) then the overhead of that "extra" ability to interchange would be borne by those who really need it.

Make no mistake, I think lossless interchange is an absolute necessity in many situations, but I am less certain that every application has to support it. I can easily imagine a public application site that allows a user to toss a file in the super format at it and the user is returned a file in a requested format. All of which are XML formats.

Think about a word processing application that isn't burdened with all the conversion routines for various formats. Perhaps this could be a step toward slimmer word processors?

And we should be mindful of the pending presence of UOF and the recent entry of PDF into ISO space. Such an abstraction could set the stage for folding these formats in as well.

Assuming funding to pursue such a vision, supported by the major players and the respective standards organizations, I think a combination of your suggestions on the Zip file format and a "superXML" format that when requested can express any number of defined formats, would attract widespread support. It would certainly benefit consumers, developers, governments and others whose interests are widely claimed by all sides in the current debate but that aren't being served all that well.

Rick Jelliffe
2007-07-29 07:44:02
Bruce (para 2): Yes, I think it is not unthinkable as a way forward.

I wouldn't think that it should stop low-hanging-fruit convergence (e.g. ODF adding harmless hints or features from Open XML that help conversion) because that helps conversions.

And I wouldn't think it should stop the development of metamodels such as Patrick Durusau has mooted, to allow better mappings of the difficult parts.

But where there are legitimate and difficult-to-reconcile alternative approaches to the same technology, and where one application has a different set of features than a standard technology provides, this kind of ZIP idea may have some value.

Rick Jelliffe
2007-07-29 08:00:59
Patrick: Ha I got you! :-)

Yes. I guess the idea would be that the Save As... dialog box would include a checkbox with all the conversions supported by the application:
[ ] Open XML
perhaps with the native format always checked.

But it would be possible, of course, to have intermediate applications that take a file and add in the extra information even if the sending application did not use it. So we could take a file that only has HTML and run it through FOP and produce PDF pages for example.

> And we should be mindful of the pending presence of UOF and the
> recent entry of PDF into ISO space. Such an abstraction could
> set the stage for folding these formats in as well.

And it would allow an easier way out of any coming XPS versus PDF bunfight too. Instead of it being either or, a file could have XPS and PDF (if split into pages). Not because vendors would particularly find this attractive, but so that users can converge on their mass solutions without preventing niche alternatives.

It would also allow a way forward to some vendors who may find neither Open XML nor ODF attractive, because they don't want to give up control of their native formats. I am thinking of Adobe FrameMaker and UOF here, for example.

The drawback here is that it applies much more to Word Processors than Spreadsheets, since speadsheets typically don't have the media files and so on. But even with spreadsheets it is not unusable: you could have the formulas in Open XML formula language and their equivalent in Open Formula and in whatever format came in from Gnueric for example.

Rick Jelliffe
2007-07-29 09:15:43
W^L+ #2: Another answer to your question of whether Redmond would buy into this. I do not imagine that they will be very keen to engage with the standards process much if Open XML fails at ISO due to non-editorial issues. Why should they waste their time?
Bruce D'Arcus
2007-07-29 13:02:53
I'm curious why you think, Rick, that OPC is so demonstrably superior to ODF's manifest? Is it simply that it is a bit more generic?

I think the manifest really ought to be a place where we can have one standard. ODF has one. Adobe is involved in some effort that seems to adapt but extend the ODF work (MARS). And now we have OPC. And as I'm concerned, they're not THAT different.

Actually, ODF is about to get a new manifest along with the new metadata stuff. Because we base that on RDF, the manifest will also be RDF-based. It gives us the extensibility we want to provide (extension developers, for example, can add extra metadata they may need), without having to worry about breaking compatibility. The primary addition we've made is a mechanism to bind a stable URI to in-document content node ids and files. This is conceptually not all that different than what I see in OPC; it's just that the unique IDs are in fact URIs. Among other things, in the RDF context that allows further statements to be bound to those URIs.

Bruce D'Arcus
2007-07-29 13:21:03
Also, if OOXML fails at ISO, I might be wrong, but I have a feeling MS won't be able to be so casual as to walk away from standards work altogether and not try to learn from the failure.

I think it would a fairly petty and misguided response to simply conclude that the failure is not in large part MS's own doing. They tried to dump a massive spec that stepped on a lot of pre=existing standards (not just ODF) into a fast track process with as little change as possible. That kind of inflexibility not only results in a worse spec, but pisses people off. Who else in the world but MS would think they can actually get away with this? Arrogance does not go well with standards work.

Rick Jelliffe
2007-07-29 18:50:30
Bruce: Indirection is definitely superior if not essential for large documents. Without an indirection mechanism you need to store your data in databases pretty fast. One reason SGML succeeded for the kinds of big documents it did succeed in was that moved the bar so that you only needed to install complicated databases for super-complex or large systems: much more could be done with just file structures.

SGML entities were based on several kinds of indirection, and these were really essential for successful creation of larger systems. When XML came along as "SGML for the Web" many of these mechanisms were redundant because the delivery mechanism of the WWW was assumed. However, when we are in the world of files not the web, the usefulness of entities returns. XML DTDs had simplified entities, but when we moved to XML Schemas, these disappeared. What OPC does, in part, is just re-invent the entity reference mechanism, in an XML element syntax.

There is some discussion of this in the XIndirect note at W3C

See also OASIS XML Catalogs.
for some the doors that indirection opens for maintanance.

OPC has several advantages:

1) Indirection. Take the example of a catalog where a logo is repeated 1,000 times. Using an indirection mechanism, if we want to change the logo we only need to change one entry in one file, with no searching involved because we know where to look. This increases maintainability substantially. If you want to change the layout of, say, the ZIP directories where your files are stored, it becomes a trivial matter, because you don't need to know every element that can point to file, they are all in one spot.

2) Chunking. It encourages documents to be split into small chunks. This is better for reducing the effect of file corruption. And better for data access: for example, all the style information in one XML part, each separate worksheet or table in their own different parts. This allows faster access and less object creation for clients, and makes it easier for multiple processes to be working on the same document.

Chunking also benefits programmers. Replacing one stylesheet with another becomes a ZIP file operation, not an XML operation. And it reduces the amount of things that a programmer needs to understand, because they can approach the chunks assuming that all the information on a topic is in that chunk: they are spared the mental toil of having to search through a big file with lots of extraneous elements.

3) Relative indirection. In OPC, each file that has reference has its own _rels file with the indirection lists. This makes it easier to cut and paste some information with all its associated resources in some cases, provides name scoping to remove the chance of name clashing between files, and so on.

(Aside) Apart from this, OPC has the advantage of actually specifying ZIP. (I don't know how ODF ever managed to get accepted when it does not follow ISO's rules for references with its ZIP reference.)

RDF? Why aren't you using ISO Topic Maps, since that is the SC34 technology?

Rick Jelliffe
2007-07-29 19:04:22
Bruce: I get frustrated that almost every conversation with ODF people about how to move constructively forward ends up degenerating into a rave about why MS is at fault in any situation. It is just not productive.

It is that kind of hostility that makes it impossible for MS to participate in ODF. In the ODF world you should have been trying your hardest to get MS onside and find out whatever levels of cooperation were possible: did you even consider OPC for example? If you didn't even consider it, where on earth are your heads at?

On the specific point you raise. You would know that they submitted a 2000 page document to ECMA and that the ECMA process resulted in the document expanding to 6000 pages. Blame ECMA and openness for the size.

Also, there is no scope in the ISO procedure for them to make corrections on-the-fly. After they have submitted it, they can either withdraw and start again or they can wait until the end (the Ballot Resolution Meeting). Blame ISO procedures.

As for dumping, they made available drafts to SC34 people for comment months before the official start of the review period.

Which is not to say that the sun shines out of Microsoft's Record Separators (little SGML joke there).

MS takes about three years for an idea to get from acceptance to adoption in products, and that is within the ten year technology adoption cycle for strategic technologies: so we need to be getting consensus now on feedback on Office 2007 to influence current Office 2010 requirements gathering, even for the limited directions that Open XML could be steered to over the next decade, and to have MS and the other vendors teed up for universal adoption of some format like this .pax format I suggest where all stakeholder's requirements are catered too in 2013 (!) timeframe. It really may be that slow.

Peter Sefton
2007-07-29 23:53:41
Rick, this is a great idea. There is actually a pretty safe interoperable subset of ODF and OOXML that we're exploring on the ICE content management project - we should produce an interop guide I guess. The key is to use styles to carry important structural information, both to aid in interop, but also to allow decent HTML export.

I have to bring up the issue of an HTML version that you zip past in the first part of the document. It's pretty hard to get good HTML out a word processors out of the box - partly because the formats are not structured the same as HTML and partly becuase HTML export code is pretty bad. See this series of recent attempts I made to produce a paper in HTML using three word processors.

But, if there was a template that one could use that provided safe interop, along the lines of the one used in our project, then it would be possible to make good HTML. In fact - the HTML might even provide an interoperable core to interchange with other formats.

Imagine a Venn diagram with ODF, OOXML and HTML as circles all overlapping a bit in the middle. The interoperable core represents a pretty decent set of functionality and formatting - more than enough for general word processing. As Tim Bray used to say, HTML would make good word processing format and we have found that if you provide useful templates that help users stay within the overlap they're happy to work in that zone.

Rick Jelliffe
2007-07-30 01:03:04
Peter: I guess in one sense my idea is concerned with how to have co-existence even without mapping. So that implementations don't have to implement the whole of a rival's spec, just the parts that add value or improve functionality for example.

So mappings and subsets would come into play *after* the basic format was established in a sense, as ways for vendors and developers to "fatten" files. The support for plurality takes the sting out of subsetting and extension.

Bruce D'Arcus
2007-07-30 07:21:05
Rick: "Bruce: I get frustrated that almost every conversation with ODF people about how to move constructively forward ends up degenerating into a rave about why MS is at fault in any situation. It is just not productive."

I feel exactly the same way, and it cuts both ways. You opened the door by repeatedly raising the threat that failure at ISO will lead to some kind of catastrophe where MS turns entirely away from standards, like some petty child we must placate.

But enough. I'm happy if we can just focus on the facts of the technology issues :-)

Bruce D'Arcus
2007-07-30 07:26:49
On: "RDF? Why aren't you using ISO Topic Maps, since that is the SC34 technology?"

Ask Patrick ;-)

2007-08-02 16:17:20
Hey Rick. I like this idea a lot, and I agree about the problem of needing a scheme to deal with dependencies and synchronization of the related parts. I recommend OPC as a basis for developing in that direction for both ODF and OOXML and whatever else.

I like .pax (also .odfx) and note that pax was a replacement and upgrade for tar at one point. I don't know what happened to it.

Marcus Groeber
2007-08-03 01:14:21

One specific note on compressibility of such an all-in-one format...

You write that multi-format files be smaller than the sum of their parts but also because there will be fewer unique strings to compress, a unique string in the original document's text will appear in the three different formats.

I believe this would not work with the current way .zip compression works, because by default all files are compressed independently (and thus can be extracted without regards to the rest of the package contents). This means that multiple XML files sharing the same contents at different points in the .zip filesystem would not be optimized by the compressor.

"Solid" archiving (where the entire filesystem is compressed as if it were one file, such as .tar.gz tends to do) may eliminate that to some extent, as long as similar files are sorted close enough together to share substantial dictionaries, but overall this may require some explicit consideration beyond just specifying "any" type of compressor to work optimally.

Certainly nothing unsolvable if the politics of this can be addressed - and something that should not distract from a great idea.

Perhaps a .zip-based .pdf container would be a good next step?

Rick Jelliffe
2007-08-03 01:47:16
Marcus: Oh, good catch thanks!

I guess smart Zipper could use the LZ77 compressor trees from one data-intensive file for use on the equivalent file in a different format. That may help deflate.

Marcus Groeber
2007-08-04 01:52:09
I guess smart Zipper could use the LZ77 compressor trees from one data-intensive file for use on the equivalent file in a different format. That may help deflate.

True. My only concern here would be that this would probably break compatibility with "standard" zip as we know it today - to my knowledge this format does not contain any mechanism for smartly pre-loading dictionaries.

The closest would probably be double-zipping the constituent files, once with "store" compression, and then wrapping the result into a "deflate" envelope, so there is only one adaptive dictionary used across the entire file.

Rick Jelliffe
2007-08-04 02:20:24
Marcus: ZIP allows you to use the precomputed compression trees or build your own. Often it is not worthwhile or too difficult, so it is possible that some unZIP programs may not be tested with DIY trees adequately, but ZIP itself certainly allows it.
Jesper Lund Stocholm
2007-08-11 14:44:07

Interestingly, it is this use of IS 14496-22 that has shows one of the problems with ISO DIS 29500 (i.e. Open XML). You may remember that anti-Open XML people have raised the issue of bitmasks in Open XML, with the lunatic fringe going as far as saying that Open XML was riddled with bitmasks and that these were impossible to validate or manipulate in XSLT; and me then rushing to Schematron's defence and showing how it was entirely possible, if not trivial, in Schematron and XSLT.

Can you give a reference to where you "rushed to Schematron's defense"? I have been trying to find head and/or tail of this problem , as it has popped up here in Denmark as well.

Thanks :o)

Rick Jelliffe
2007-08-11 15:27:18
Jesper: The fallacy that bitmasks are impossible (or difficult) with Schematron (or even XSD regex) went into the Groklaw discussion pages:

and in my response to "Josh" in this Blog

and also in the Wikipedia talk pages, where I managed to to get the wording corrected from IIRC there being "many" bitmasks to there being "some" bitmasks.

And also as "bitfields" (<sig> at least is not used as a masks) in

The connection of the bitmasks in the particular case of the element (which is the one often quoted) was in this post

And on a lighter note