Markup's Dirty Little Secret
by Rick Jelliffe
This means that even if a document is saved as XML which completely captures all the page and style settings and so on, and even if the receiving system has the same generic fonts and even all the same compliment of "muffin borders" and other art are available on the receiver and sender systems, a document moved from vendor's A application on Platform A cannot be expected to open up with line-for-line or (for multiple page documents) even page-for-page fidelity on vendor B's applications or even on vendor A's application on platform B. Even with good matching, a word here or there will break or hyphenate differently, a line will break differently over a page, and so on.
This is particularly noticeable on short measures: particular table cells. Unless all the cells in the table are each wider than their content, with no multi-line cells, there is every chance that lines may break differently.
Note that this will happen regardless of whether you are using ODF or Open XML: it is not the limits of the XML representation as much as that applications have different code inside them. If you want exact fidelity, the current state-of-the-art is you have to pretty much use the same application (and platform) to open the document that it was created in.
What can you do to minimize this?
Well, for a start you need to set your expectations appropriately: an HTML page looks different on every different browser and OS and depending on the window size too...do you really need exact line and page fidelity? The HTML experience is strongly that it is better to have presentation-independent design, allowing flexibility, in order to get the benefit of re-target-ability.
Strategies for coping with these issues have dominated SGML/looseleaf publishing systems: it is not simple. One thing is to ween yourself of page and line dependencies: use section numbers to refer to things, and IDs, not page numbers. Never hard-code page numbers or line numbers, but use references and variables.
In your typesetting specs, make your widow/orphan control move paragraphs over the page readily (if you expect there will be additions) so that there is plenty of whitespace at the bottom of the previous page, so that typing a few extra words here and there will not cause repagination. If you do this well, then it also makes using the ocassional forced page breaks more workable.
There are mixed strategies: send PDF as well as the XML document, and use the PDF as much as possible, until there needs to be editing. Or send HTML as well to discourage page-centricism.
Another strategy is to clearly separate out those pages that must not break, and treat them as artwork, included from external documents. Pages contain examples of forms in particular are better handled as graphics, when included in general documents.
And there are procedures to take as well. For example, if someone sends you a document and you open it in application B, first go through all the tables and resize the text so that it breaks the same as the PDF. Of course, this relies on your document using styles: but if you don't use styles you are probably messed up anyway (because there are many ways to do the same thing, and they may result in different results: for example, on some systems a bold space is bigger than a plain roman space!).
I remember that Word Perfect had a (patented) feature where it would adjust fonts size and table borders for optimal layout. This is exactly the kind of thing that would be needed if we want to get better guaranteed fidelity at the line and page level between applications.
Is infidelity ever forgivable?
So remember, there are three kinds of fidelity: fidelity because the document has all the information used by the producing and receiving applications, fidelity because the applications have the same resources available to them, and fidelity because the producing and receiving applications have the same algorithms and defaults. When looking at the various claims (Len Bullard mentions Spy versus Spy) made by MS on Open XML and" fidelity", and ODF people on "interoperability" we need to interpret them in the hard light of the Dirty Little Secret.
Governments and procurement projects need to be quite clear that whenever they insist on page fidelity, they are probably in fact locking themselves into one vendor's tools, in which case it becomes a debate on features, quality, price, training, etc. In a limited sense, everything *except* interchangeability.
differnt spellings of "algoriths" too!
This is a great analysis. I would quibble over your definition of fidelity as being only on the technical side and needing to account for what is fidelity in the eyes of the users - authors and readers and collaborators. But the problem of presentation-preserving interchange is daunting and I'm happy to see your emphasis on it. There needs to be tie-back to the HTML example and how there is a contextual spectrum of practical fidelities. (And even with the HTML case there are browser wars about who does it "right"!)
Hmmm. Again an area where TeX may be of interest. It was written to give (basically) identical calculations/algorithms on different machines, tests (trip/trap) are integral to creating a version that may be called TeX, the basic (Computer Modern) fonts are controlled for stable metrics, TeX itself is now frozen for future development, etc., etc.
Jimbo: Quite a few of the SGML-generation typesetting systems were based on TeX. I knew of one company in Japan who overlaid a LISP system (I think common LISP) on top of TeX (Interleaf and ISO DSSSL also were LISP systems on top of a page model), and I think Sebastian Ratz maintains an XML/SGML on TeX system. The feeling is that TeX provides a good basis for extending the typesetting capabilities, but an inadequate basis for higher-level control and glue between XML/SGML and TeX for large reference publishing, hence the need for an intermediate interpreted system like LISP.