Markup's Dirty Little Secret

by Rick Jelliffe

Different applications on different systems use different fonts, fonts with the same family name but different metrics, different hyphenation algorithms, hyphenation setting defaults, hyphenation dictionaries, different size spaces, different line-breaking algorithms, different widow/orphan/keeptogether rules, and different co-ordinate space measures.

This means that even if a document is saved as XML which completely captures all the page and style settings and so on, and even if the receiving system has the same generic fonts and even all the same compliment of "muffin borders" and other art are available on the receiver and sender systems, a document moved from vendor's A application on Platform A cannot be expected to open up with line-for-line or (for multiple page documents) even page-for-page fidelity on vendor B's applications or even on vendor A's application on platform B. Even with good matching, a word here or there will break or hyphenate differently, a line will break differently over a page, and so on.

This is particularly noticeable on short measures: particular table cells. Unless all the cells in the table are each wider than their content, with no multi-line cells, there is every chance that lines may break differently.

Note that this will happen regardless of whether you are using ODF or Open XML: it is not the limits of the XML representation as much as that applications have different code inside them. If you want exact fidelity, the current state-of-the-art is you have to pretty much use the same application (and platform) to open the document that it was created in.

What can you do to minimize this?



Well, for a start you need to set your expectations appropriately: an HTML page looks different on every different browser and OS and depending on the window size too...do you really need exact line and page fidelity? The HTML experience is strongly that it is better to have presentation-independent design, allowing flexibility, in order to get the benefit of re-target-ability.

Strategies for coping with these issues have dominated SGML/looseleaf publishing systems: it is not simple. One thing is to ween yourself of page and line dependencies: use section numbers to refer to things, and IDs, not page numbers. Never hard-code page numbers or line numbers, but use references and variables.

In your typesetting specs, make your widow/orphan control move paragraphs over the page readily (if you expect there will be additions) so that there is plenty of whitespace at the bottom of the previous page, so that typing a few extra words here and there will not cause repagination. If you do this well, then it also makes using the ocassional forced page breaks more workable.

There are mixed strategies: send PDF as well as the XML document, and use the PDF as much as possible, until there needs to be editing. Or send HTML as well to discourage page-centricism.

Another strategy is to clearly separate out those pages that must not break, and treat them as artwork, included from external documents. Pages contain examples of forms in particular are better handled as graphics, when included in general documents.

And there are procedures to take as well. For example, if someone sends you a document and you open it in application B, first go through all the tables and resize the text so that it breaks the same as the PDF. Of course, this relies on your document using styles: but if you don't use styles you are probably messed up anyway (because there are many ways to do the same thing, and they may result in different results: for example, on some systems a bold space is bigger than a plain roman space!).

I remember that Word Perfect had a (patented) feature where it would adjust fonts size and table borders for optimal layout. This is exactly the kind of thing that would be needed if we want to get better guaranteed fidelity at the line and page level between applications.

Is infidelity ever forgivable?



So remember, there are three kinds of fidelity: fidelity because the document has all the information used by the producing and receiving applications, fidelity because the applications have the same resources available to them, and fidelity because the producing and receiving applications have the same algorithms and defaults. When looking at the various claims (Len Bullard mentions Spy versus Spy) made by MS on Open XML and" fidelity", and ODF people on "interoperability" we need to interpret them in the hard light of the Dirty Little Secret.

Governments and procurement projects need to be quite clear that whenever they insist on page fidelity, they are probably in fact locking themselves into one vendor's tools, in which case it becomes a debate on features, quality, price, training, etc. In a limited sense, everything *except* interchangeability.

4 Comments

spellcheck!
2007-04-21 06:24:02
differnt spellings of "algoriths" too!


(nice points)

orcmid
2007-04-21 08:52:42
This is a great analysis. I would quibble over your definition of fidelity as being only on the technical side and needing to account for what is fidelity in the eyes of the users - authors and readers and collaborators. But the problem of presentation-preserving interchange is daunting and I'm happy to see your emphasis on it. There needs to be tie-back to the HTML example and how there is a contextual spectrum of practical fidelities. (And even with the HTML case there are browser wars about who does it "right"!)


(I think PostScript and, presumably, XPS, have had to deal with this, and it may be via rigorous metric assumptions and font defintions but I'm not sure. Perhaps there is something to learn from that. You're starting to have me yearn for ODA after all these years. And perhaps take another look at TeX as well.)


An even bigger issue has to do with these being fine details with somewhat chaotic consequences (like failure to rely on styles or erratic use of styles). I'm not sure how all but a small proportion of end users are going to master this stuff and have the patience to attend to the details. (I'm thinking of your three-kinds of document-productivity software and maybe we need a fourth for serious publication applications.)


Serious stuff, and I wish government IT organizations would quickly get up to speed on this before they get mousetrapped by legislative activities.

Jimbo
2007-04-24 10:24:06
Hmmm. Again an area where TeX may be of interest. It was written to give (basically) identical calculations/algorithms on different machines, tests (trip/trap) are integral to creating a version that may be called TeX, the basic (Computer Modern) fonts are controlled for stable metrics, TeX itself is now frozen for future development, etc., etc.


Meanwhile, the expense form created on a PC and opened on my Mac (in Microsoft Office) won't print on a single page. I have to ask for a PC printout and fill it out by hand.


However, ask which tool a "normal" user would like, LaTeX (TeX-style markup) vs Word ...

Rick Jelliffe
2007-04-25 04:52:49
Jimbo: Quite a few of the SGML-generation typesetting systems were based on TeX. I knew of one company in Japan who overlaid a LISP system (I think common LISP) on top of TeX (Interleaf and ISO DSSSL also were LISP systems on top of a page model), and I think Sebastian Ratz maintains an XML/SGML on TeX system. The feeling is that TeX provides a good basis for extending the typesetting capabilities, but an inadequate basis for higher-level control and glue between XML/SGML and TeX for large reference publishing, hence the need for an intermediate interpreted system like LISP.


Back to the topic: in one sense, TeX is just another formatting system with its own breaking rules and defaults. So a document typeset by TeX will, unless it has unusual properties, break differently than the same document set by Word or Open Office.


One solution might indeed be for everyone to adopt TeX. Not in the sense of full TeX, but for (users to pressure) vendors to adopt some common breaking algorithm or hinting mechanisms to at least minimize the problem. An expert mentioned to me the other day "when users hear 'interoperability' they think of exact reproduction" but this has never been the promise of markup languages. We like to think "WYSIWYG is dead" but the zombie still has quite a big of animation in the non-technical community. The danger for the blanket marketing statements (ODF's "ODF can handle all the interchange you need" and Microsoft's "Open XML can give perfect fidelity") is that these things are said in the context of accepting that different applications *must* break differently, which is *not* what hopeful adopters may expect.


Another approach might be to specify an standard hyphenation test suite: for example, a set of the most common 3,000 English words and their breakpoints, so that the major applications can confirm that they break in a similar way. Of course, that only fixes a part of the problem: probably it is strange words that would break differently anyway, but in conjunction with settings like the length and treatment of stems and so on, it may be some use. It would certainly provide some extra information for open source developers, without forcing proprietary vendors to open their source. Actually, this would more likely be part of a standard hyphenation dictionary, rather than a test case, I suppose.