Why EMUs?

by Rick Jelliffe

Ecma 376 Office Open XML' DrawingML uses an odd measure called the EMU: short for English Metric Unit. There are 36000 EMUs per cm, 91440 EMUs per inch.

The reason for this may become clearer if I note that, using the Adobe "big point" of 72 points per inch (rather than the old 72.72), there is 1270 emu per point. Err, maybe not...

What about this then: 36000 and 91440 are divisible by 2,3,4,5,6, and multiples?

Still no idea? Well, representing numbers in computers is frought with errors every time you have to have anything that requires fractions, or with multiplication or division by numbers that are not 2^n. That even can includes multiplying by 0.5. Computer scientists spent a lot of their early time investigating various techniques to overcome these problems: in a branch of mathematics (or is it engineering?) called numerical methods.

These errors are small by themselves, but when you have, for example, long sequences of calculations such as graphics object where one segment is positioned using the result of the last segment, the accumulated error can increase. In publishing, misalignent can have a serious effect when there is some kind of multi-color printing: you can get registration errors.

One way to circumvent the problem is to move to integer (whole number) arithmetic: you find some convenient small measure that can be multiplied so that you don't need to use floating-point numbers. When you do divide, you throw away the remainder, because it is below the precision you are supporting; but because the data frequently is aligned to grid positions (1/2 inch, etc) there will be no loss of precision from data capture (what the user sees) to the internal representation. Now armed with this perspective, lets imagine a set of criteria for a typesetting system or vector graphics system:

* use a small unit to allow implementation in integer arithmetic
* this unit should allow allow exact whole divisions (no remainder) of the common measures of modern English-speaking countries' typesetting: the cm, the inch, and the point. So a half inch, 10.5 points, or a third of a CM are all exact (within the bounds of the system)
* the unit should be small enough to allow non-"English" measurements with, say, 0.01% precision (or do I mean inaccuracy?): the continental diderot or the Japanese Q system for example

If you take these kinds of criteria, and work through the numbers you get something like EMU. They are used by Ecma 376's DrawingML for 'high precision co-ordinates" in certain places. The rest of the time, people can use locale-dependent measures.

So if EMU is a reasonable technical approach, is it a reasonable measure to appear in a standard? To my mind, this falls in exactly the same bucket as SpreadsheetMLs use of numeric indexes, though there are accuracy issues as well as performance issues. I think it comes down to the purpose of the standard: when the purpose of the standard is too allow high-quality typesetting and graphics and to reflect the triggering application, I think the exact numbers such as EMU may win. However, when the purpose is to allow data interchange and human/read and writability, then using SI and locale-dependent measurements will win.

The EMU issue is also a interesting one from a standardization viewpoint: there is a kind of premise that supporting a standard (obviously the specific application-independent alternative is SVG-in-ODF in this case, but this applies to systems supporting Open XML too) involves adding functionality or adjusting superficial details (names of elements and attributes, use of property elements rather than attributes, and so on): this is, I think, the view that underlies Tim Bray's comment (from memory) "how many ways do we need to say some text is bold or italic"? However, there are other changes that go to implementation: converting to and from SVG (as it is) presumably entails foregoing give up exact import and export of data in the "high precision coordinate" system. The difference would be minimal, a rare pixel here or there, I'd expect.

Like the data indexes, I don't particuarly know why Open XML couldn't support both the common notations as well as the optimized one. Best of both worlds. But EMUs are a rational solution to a particular set of design criteria, it seems to me: and the name English Metric Units that has caused alarm seems less alarming when understood as just a descriptive name and not a reference to something external.


Josh Peters
2007-04-16 08:39:16
Why is it that so many XML dialects tend to reinvent the wheel? If SVG doesn't provide an accurate enough metric for a "high precision coordinate system" why not mix the elements of a namespace that does provide them into the output?

I'm currently unsure (because I'm not too connected when it comes to various published namespaces and schemas) of other unit-providing namespaces but there have to be some out there somewhere that could have been used in OOXML.

Oh well, I guess there is now even more ways to skin a cat.

Rick Jelliffe
2007-04-16 09:56:03
Josh: One of the welcome things about Ecma 376 (whether it makes it to an ISO standard or not) is that it does provide a really clear and detailed list of ideas and features for ODF to consider. In the short-term, this exposes weaknesses in both Open XML (where there are features that people don't like) and ODF (where there are features that ODF doesn't handle well, or handles differently.) But in the medium term, it is a different proposition. Now I don't exactly think Open XML is Bill Gate's love letter to the anti-trust regulators, but it is certainly some kind of backflip if you compare it to Gates' comments of 10 years ago (which I see have recently been bandied about again, as if they were current.)

I think there will be three markets for office systems. One market will be the policy-driven market: this will be public organizations. The other will be the features/platform market: this will be corporate. Finally there will be the bundling market, such as for home users. Each will have a different dynamic.

2007-04-16 16:42:40
Oh, did you mean "date indexes" in the last paragraph? I suppose one move toward convergence would be to add time-point (date-time) and time-interval data types to the types handled by formulas, as is the case in ODF (although those were evidently introduced without consideration of how spreadsheet formulas and any implicit conversions with numerics would work).

I think it is important to recognize that Excel and all spreadsheets conformable with it do not have a data type for date, they only had cell-presentation formats for dates and times. The cell values are all numeric types in this case and rational types at that, with clock-time as fractional day. So it would be useful if ECMA-376 specified a minimum precision of calculation in interchange, something I haven't found. Nevertheless, supposing that there is a trivial transformation from the existing Excel practice to a formal date type (or a different ordinal date mapping) is incorrect, because we don't know when an existing Excel numeric value is ultimately going to be involved in the determination of an ordinal-date number.

So the indictment for not supporting a particular ISO interchange format for dates is not apt, and ordinal dates have nothing to do with the Gregorian Calendar (which I have never seen anyone site a standard specification for either, by the way). It is the case that an ordinal date system is used for internal representation and the values for cells. And there is a problem about conversion of one of the ordinal days to a Gregorian date. People who want to avoid that can use the alternative Macintosh ordinal numbers which starts after that date. (Just kidding although that is an option.)

Still, I agree that moving toward convergence would be valuable, maybe when the ODFhideen and the OOXen are able to sit down at a table and actually work together on behalf of all of us. A lot is going to depend on how well the OpenFormula folk play the conformable-with-Excel (hence Spreadsheet ML formulas) card.

Back to your initial question: I think the proper engineering term is 0.01% *tolerance.*

Abstract measures to arbitrary precision are a new thing in the world and we have to be careful about cleaving from physical reality -- pixels on a display, marks on paper, registration of color presses and so on.

I haven't looked lately. Have you checked how PostScript handles this? Ditto for the XML Paper Specification, I imagine. I know the last time I looked at PostScript and Xerox Interpress specifications, there was a lot of attention to this problem, as well as working through geometric transformation issues too. Hmm, I wonder what TROFF and Knuth's TeX (especially Metafont) do here as well. Hmm, fonts ...

2007-04-16 17:07:54
"However, when the purpose is to allow data interchange and human/read and writability"

There's something off about this. I don't think humans will use EMUs any more than they will directly operate with 1/86400 of an ordinal day (1 second in the Spreadsheet ML ordinal-day representation).

I think we need to distinquish what the interface and the presentation provides and what the interchange format is. It will be clumsy at times, but these are not external representations. Consider locales not on the Gregorian calendar and, I suppose, somewhere or somewhen that has the daytime broken into different units. OK, that's a stretch, I think. I also trust that we won't rearrange the solar calendar anytime soon.

But consider different formats for writing numerals and numbers in an internationalized world. This stuff should not appear at the numerical-representation level, it seems to me.

Ordinal dates have a certain calendar agnosticism that may turn out to be the safe way.

EMUs are not so agnostic, but there we have this small problem about controlling the precision where there are material consequences. (I don't expect astronomers and astrogators with a completely different notion of time and position precision to depend on Spreadsheet XML or ODF Calc ever.)

Lighter moment: I just copped to having a problem with sound-the-same words when I type (the example was between site and sight on Flickr). I see that I have to include "cite" in that specific case, because I meant that with regard to finding a standard specification for the Gregorian calendar.

Rick Jelliffe
2007-04-16 23:30:22
Orcmid: Since ISO 8601 dates are lexically distinguiahable from numbers (having multiple ":" for example) I think it would be trivial to serialize and parse dates in the data format, regardless of whether Excel supports date types itself. I am not suggesting that the underlying application should change; indeed, for Open XML that would be quite cart before the horse.

Similarly, inches, points and cm can be lexically distinguished (by a suffix) so I don't see why DrawingML couldn't at least allow other measures for import without requiring any change to the Office applications or loss of precision.

Its certainly a "rough edge" rather than a "showstopper". You never know when humans will be involved, so readability (does not = understandability!) is always at least a minor consideration for markup IMHO.

2007-04-18 11:06:01
@Rick [In what social protocol did that convention start in? I like it though.]

OK, I can be with rough edges especially since we are starting a long journey here, not legislating utopia now.

"I think there will be three markets for office systems. One market will be the policy-driven market: this will be public organizations. The other will be the features/platform market: this will be corporate. Finally there will be the bundling market, such as for home users. Each will have a different dynamic."

This is intriguing. I just watched Malcolm Gladwell's TED 2004 video, and I keep thinking Howard Moskowitz's insight applies here and could really enliven productivity applications:

I'd hope for great porousity (and sponginess?) and *interop* to make it work, because we'd still want interchange among those regimes to work. Standards for formats, up-down-custom-level smoothness, and so on strike me as having tremendous value in that vision. (ODF could figure out how to embrace OPC in the future rather than coming up with their own, for example.) I do think it would look quite different than the current fragmenting of Microsoft Office into fairly-arbitrary packagings for pricing purposes.

Something that got me thinking that I wanted to share back while I keep thinking how to blog about it.

Rick Jelliffe
2007-04-21 03:07:42
Orcmid: I know that Patrick Durusau, the editor of ISO ODF and one of the good guys, has been going over the OpenXML spec recently with a view to seeing what the substance or nature of the differences between OpenXML and ODF are.( I'm going to blog about interoperability soon, because there is one important point that people who want to adopt ODF or allow OpenXML should be aware of, that regularly eludes mention: Patrick mentioned it to me recently, actually.)

But ODF is a moving target: ODF 1.1 is coming soon, ODF 1.2 is in the pipeline, (and already capable of representing many more of the features of any office document processor), and it is not unlikely that OpenXML will trigger an ODF 1.3 (which might, if all the parts are added together and if it repeats type documentation like Open XML does, take 6000 pages!) which will be really strong. Of course, implementations will lag behind the OASIS spec, and the ISO rebranding is quite likely to lag behind as well. This represents a challenge for adopters of "standard" ODF: which one do they choose, if they really want interoperability? As I understand it, there will be some kind of graceful degradation, so that people who only need to the feature set of ODF 1.0 and have ODF 1.0 systems can accept ODF 1.3 with no problems.

2007-04-24 08:19:17
Another way to skin a cat:

TeX uses a nice (easy to figure) 65536 = 2^16 "scaled points" per inch. Rounding errors inbound from other dimensions are negligible and math after that is integer and machine independent.

Rick Jelliffe
2007-04-27 00:36:06
Jimbo: Yes, it is decision each developer makes, not a moral issue, and if the grain is fine enough, user interface systems can apply heuristics to reconstruct the original units ("this is XXX scaled points = 3.49999cm, therefore that must have been 3.5cm".) The more you get away from type an into the world of colour and registration and calculated paths, the more benefit that comes from exact numbers. So it would be odd if typesetting properties were specified in terms of EMUs or scaled points; not appropriate or relevant to end users. But it is more appopriate if a graphics program exposes its co-ordinate system in generated markup. (Whether it should also be able to generate and accept standard units as well is another issue.)