Native Language Markup Issues and Open XML

by Rick Jelliffe

Just when I thought I had escaped, I had a request yesterday from Microsoft to join in a call with a journalist from ZDNET Asia about a blog An open document standard for China. Preparing for this gave me a good chance to review the use of Native Language Markup in Open XML: the area is quite arcane so it is a good topic for a blog (good because you probably won't get the information elsewhere and good because your feedback can help if I have missed something.) I have included some asides and personal background, probably not even of interest to my mother, in small print that you can skip.


The Peter Junge blog basically warms up Rob Weir's Swiss cheese (hmm, something wrong with that phrase): impossible in its thrust (a single file format that can cope with all cases?), alarmist in general (This kind of legacy is full of pitfalls for the open source developer."), over-reaching in its analogies (see my Power plugs and low-hanging fruit), too strong in its conclusions (look at how "may" and "might" are used to say "will") and misleading in its use of details (what has footnoteLayoutLikeWW8 (Emulate Word 6.x/95/97 Footnote Placement etc) to do with open-source developers in particular, especially since the spec gives the advice that "Typically, applications shall not perform this compatabiliity"? It is flag not a requirement for goodness sake.)


Native Language Markup



Native Language Markup is the use of names and symbols in markup of the users native language. This implies the use of the user's native script (characters). It is different from "natural language" because names in markup may still have artificial limitations (such as no spaces or apostrophes) or use contracted forms that would not appear in natural language.

Native Language Markup was a term I developed in the early 1990s, when Allette Systems gave me a project to figure out why SGML was not popular in Asian countries. I came back with various items, and collected them into the ERCS (Extended Reference Concrete Syntax): these included things like allow native characters in tag names (SGML had large character set limitations for names then), hexadecimal numeric character references, the ability to reference any character by its Unicode number, and an initial set of the characters in Unicode that were suitable for use in markup. These were endorsed by a standards-related expert group, the CJK DOCP group, and when XML development started, were adopted into XML. This was recognized by a kind comment of Gavin Nicol in the 1999 Journal of Markup Theory and Practice "The importance of native language markup, and the role the SGML declaration plays in an SGML system, are fairly well understood these days, partly due to the tireless efforts of Rick Jelliffe on the ERCS, and partly due to a lot of work done on HTML I18N (Internationalization)" Now, of course, I am not saying that I invented the idea that words you can read are more useful than words you cannot read! ERCS was a set of concrete technical proposals, and Native Language Markup is a name for the issue. Anyway, the bottom line is that this is a subject that I think is really important.


Native language markup has proved itself. Murata Makoto demonstrated at a conference last year how the Japanese government XML was using it, and China's UOF format. It is not just an issue of translation: many languages have terms which do not have a satisfactory English equivalent. Nor is transliteration a useful approach: many languages require a romanization system with accents or tone marks to be useful. A technology that does not allow non-ASCII characters imposes a burden on non-ASCII users and limits the acceptance rates to the highly educated and foreign-literate.

However, native language markup becomes inappropriate whenever there is a cross-over between language groups. Most Australians stop learning new characters about the age of 5 or 6; Chinese language markup is not easy for us! So for international standards for fixed schemas, there is no practical alternative but adopting ASCII and English wordings.

ISO/IEC JTC1 SC34 has recognized this, so as part of the
IS 19575 Document Schema Description Languages standard, there is a technology spearheaded by UK's Martin Bryan called the Document Schema Renaming Language (DSRL or Dis-rule). This is a convenient language (Martin has an XSLT implementation) that allows conversion of the markup in documents (or schemas, potentially) to and from different languages (as well as other uses.) Non-ASCII-using nations looking at adopting ISO standards should look at whether they should also adopt a DSRL mapping into native formats. So that developers would work in the document using native language markup, then convert the document to the ISO standard form before shipping, for example. Or that internally in a country, the localized form was used, but it could be translated to the international form for shipping. My belief is that DSRL should become a standard part of the XML processing chain, because it addresses all sorts of versioning and localization issues.

The evolution of standards



Character sets have posed an big problem for standards makers.


  • In the 1960s/1970s generation of technologies, 7 bit character sets were used: the ASCII/EBCDIC generation. Technology standards rooted in the 60s had to cope with 7bit data transmission.


  • In the 1970s/1980s generation, communications systems moved from 7-bits to 8-bit clean systems. Typically with this generation and under the influence of the C programming language, instead of characters systems adopted a byte mentality, where a string was a sequence of bytes. Standards from this period naturally followed. However, because international data exchange was not important, the standards from this time pay no attention to identifying which character encoding was in use.


  • In the 1980s/1990s generation, attempts were made to extend the existing systems to cope with extra characters. This would involve adding overloading character escape mechanisms to allow character references to the local character set, or variable-width character sets which are ASCII compatible for single bytes but which allow multiple bytes for non-ASCII characters: UTF-8, Big 5, Shift-JIS are examples of standards that reflect this. These fitted into the constraints of 7-bit and 8-bit clean systems. However, the standards infrastructure was aimed at localization not internationalization: the advent of the PC retarded the reach of the internet initially but by the advent of the WWW suddenly there was a world-wide data incompatability problem: the standards and systems did not adequately support for resources to say which character set was used. Examples of this was HTML forms: for a long time, there was no definition of which character encoding should be used when sending forms data. In the standards world of the time, there was a real split between the internationalists, who said that everyone should adopt Unicode, and the nationalists, who said that every country should adopt locally-optimized formats.


  • The 1990s/2000s generation is the XML generation. It has been recognized that internationalization needs to be pervasive (standards should have first class support), systematic (based on Unicode), and friendly (allow people to be conservative in what they send but generous in what they recieve.) With XML, we defined XML in terms of Unicode characters, but allowed the user to use any encoding they wished: this was safe for data, because XML allows character references in terms of Unicode character numbers, and because the XML encoding header provided an effective in-band way to make sure that the character encoding of a document could be maintained. This effectively satisfied the requirements of both the nationalists and internationalists. Another example of this approach is the XML approach to URLs: in this case we deliberately went against the standard URL syntax to allow non-ASCII characters in system identifiers and namespace names, because native language markup is more important than compliance with that standard (or, at least, because conversion to ASCII-only transfer syntaxes should be a library function, not a document-writer's job) . Sometimes the existing standards are sub-optimal and have to be ignored, even by other standards!


  • One of the last links in the chain for documents came with the much delayed release of the IRI specification. This officially standardized the system that XML adopted (and the address bar of browsers naturally had been using) of extending URLs to allow any characters. Protocols that used URLs would still use percent delimited ASCII, and address bars would still display any character, but with IRIs it becomes easier for the standards world to specify exactly what is needed. Nevertheless, the terminology IRI has not become common yet, with the result that people often say URL when they actually mean IRI, and with the subsequent result that sometimes standards drafters write URL when they mean IRI.



Native Language Markup in Open XML



The Open XML schemas use ASCII and English wording. Anything else would be rejected at ISO of course. Data values, for content and attribute values, allow non-English characters. Typically this is formalized so that things that may appear on user interfaces (such as style names) have both a print name and an internal identifier: this allows documents to be localized as far as their user interface information but international as far as their internal identifiers: good for off-shore document processing for example.

Formulas in spreadsheets are an interesting area. In order to be user friendly, the function names of course need to be meaningful to users. However, a standard cannot contain every language variant (I am told that Word 2007 has over 100 different localized versions.) So Open XML takes the view that this is the application's responsibility: the Spanish version of a spreadsheet can present the formula to the user using Spanish words for example, but the markup is generated with the common form. (This is a respectable option: dates are usually handled as 8601 format rather than localized forms, in international standards along the same lines.)

IRIs



One area that deserves special attention, because it is so intricate, is the availability of IRIs in Open XML. This is an issue that has received a bit of attention, and was the issue in Peter Junge's blog that has triggered this blog. The bottom line is that in the current draft text you can use any character for a relative IRI inside the package or to your file system (relative references) but for external references the current spec says the markup should use URL syntax.

Note that this does not mean that a URL on a user interface cannot use Chinese characters. Nor does it mean that Chinese characters cannot be percent encoded into a URL. This is an issue of Native Language Markup, at the software developer level.

I suspect this is just a drafting error, and I am pretty certain that JIS (Japanese Industrial Standards) at least will call for its correction in the final text. It is a strong enough requirement to force a "conditional yes" vote. The current datatypes use anyURI.

There are more details on Open Packaging Convention below.

Lets look at what Peter Junge's blog said:
Another standard that Microsoft does not support, is the RFC 3987 specification, which defines UTF-8 capable Internet addresses. Consequently, OOXML does not support the use of Chinese characters within a Web address.


It is a textbook example of what is wrong with so much of the anti-Open XML material.

Lets look at the first sentence. Now RFC 3987 is the IRI spec (which was co-authored by Michel Suignard of Microsoft,) If you look at DIS 29500, Part 1, Annex A, Resolving Unicode Strings to Part Names, it has clauses such as "Creating an IRI from a Unicode string", "Creating a URL from an IRI". If you look at Part 1 Section 8.2.1, that annex is invoked. So the simple statement that Microsoft does not support is incorrect. (The explanation of IRIs is technically pretty garbled, but it is not easy to express in a single phrase.) The second sentence is incorrect too: you can have Chinese characters in the a web address, as long as they use URL syntax and are percent encoded.

Now there may be some way to weasel word this, that it really it says
Another standard that (DIS 29500) Microsoft does not support (in one case), is the RFC 3987 specification, which defines UTF-8 capable Internet addresses (internationalized WWW resource identifiers which map to ASCII based standard URLs using percent-encoded UTF-8. Consequently,(draft) OOXML does not support the (indirect) use of Chinese characters within a (external) Web address (in markup).


But an ordinary reader of a blog simply is not technically equipped to understand this. How anyone would write this if they ever had read the draft is beyond me. I mean that seriously. If you are writing a blog, and making comments about IRIs, how simple is it to download Part 2 of the spec, open it up in Acrobat or your PDF reader, and search for IRI?

Chinese Native Language Markup



I think the main trouble with Peter Junge's blog comes from a misunderstanding of the ISO process and the position of voluntary standards. I don't think he knows what a standard is.

When he says I hope China will not support OOXML in its ISO voting, but force Microsoft to consider talks for one harmonized office document standard for the whole world. it sounds nice and tough, but the ISO process is not geared to that kind of win/lose approach. In the ISO situation, when you find an error that can be fixed (such as this IRI mistake) you don't throw the whole thing out, you point out the problem, propose a fix, and work together. Just because Open XML gets added to the library of voluntary standards at ISO, it does not mean that the Chinese national body is thereby forced to adopt Open XML in preference to UOF in any circumstance. Chinese businesses will be sending and receiving documents from overseas in formats outside the control of the standards bodies, and governments have little interest in making arbitrary restrictions on world trade now days; it is better for that data to be in a standard format than a non-standard one. Non-Chinese countries are not going to adopt UOF but still they will produce and receive documents: I am sure that the UOF people are entirely aware of this.

What the current generation of document standards (ODF, Open XML, UOF) does is expose all the different functionalities required. This is a great pre-requisite for getting Chinese and other requirements publicized.

Now the area of East Asian native language markup is one that is particularly important to me. I started off in SGML while working in Japan, I had a lot of contact with really wonderful East Asian experts because of my involvement in ERCS and CJK DOCP, and because I ran the Chinese XML Now!" project at Academia Sinica, Taipei in 1999/2000. This was a project (academic/practical, *not* political !!!) to try to work through issues relating to XML and Chinese. (The Chinese XML Now page is now old, and I hope there are much better sites now, but it did have a few million real hits as far as I could work out.) Schematron, now an ISO standard, came out of this work, because I wanted to develop a Schema language that did not depend on tokenized grammar rules, lay Chinese understand their language in terms of characters not words per se.


One part of this project was for me to represent Academia Sinica (*not* Taiwan) at various non-national level standards groups. One outcome was that in the XML Schema Working Group (repesenting Academia Sinica) I championed and suggested the name for anyURI: to allow better native language markup than URIs; the IRI standard was not available then.


ASCC's reason for hiring me was a little shocking: my boss, a really incisive and surprising man, told me that Westerners on standards organizations do not listen to Asians (from Asia articulating Asian-only requirements), and so they wanted me to advocate for them (and for Chinese language requirements in general) because a white person would be more acceptable.


Now this is not so much a claim of personal racism at all: in part it is due to the language barriers, partly due to time zone and travel problems, partly due to the difficulty that people from respect-based cultures have in contention-based committee systems, the problem that people from seniority-based cultures have in expert committees, the problem that people from face-based cultures have in ad hoc discussions, and also the difficulty in getting up to speed with issues and procedures as a newcomer.


In the ISO SC34 committee on Document Processing and Description Languages, there has been an effort to schedule meetings in Asia (Korea last year), issues have to be tabled six weeks before meetings in order to prevent surprises and give people a chance to translate and discuss, and in general votes on important issues are not taken on the same day they are proposed, in order to allow consultation with national technical committees in different time zones. But nevertheless, learning how to operate effectively in committees dominated by Western-style relationships s a real difficulty (from my recent trip to India, this clearly doesn't apply to Indians! So I mean "Asian" in the Australian sense of mainly East and South East Asians, not in the UK sense of "Indic". )


These kind of thing are good, yet there will, in my opinion, always be a difficulty there. Of course, Westerners will learn how to interact with Asians better, and Asians will learn how to participate with Westerners better. But it is up to the nations that use a particular language or script to work out their requirements and communicate them effectively. UOF is at minimum a good exemplar of this. The Japanese kinsoku rules are perhaps another example.


Open Packaging Conventions, ZIP and IRIs



Open XML Part 2 Open Packaging Conventions sets out all details of packaging in Open XML: the profile of ZIP to use, the part referencing system, digital signatures and so on. It is the part that has URLs and IRIs etc.

Now here it gets a little complicated. The ZIP technology is not standardized. What difference does it make in practice, is a reasonable question The difference is that in a standard, you get pro-active in areas such as internationalization and accessibilty. Often proprietary formats leave internationalization issues unspecified for as long as possible: it is at the bottom of the list of work items. Now the difficult with both internationalization and accessibility, you cannot just add them on as an afterthought, casually. They can be quite disruptive, and consequently take time to get buy in.

Just as the RFC for IRIs didn't actually come out until January 2005, the ZIP specification didn't specifically sort out using UTF-8 for filenames until 2006-09-29, according to the release notes. That is only one month before Ecma 379 was released. This shows the dilemma for standards: now we are a year later, how should we trade-off the need for Native Language Markup on the one hand (now that ZIP spec has a way to support it) and the need to be compatible with interoperability in the form of actual ZIP libraries in the real world (and therefore fit in with major platforms and not require users to upgrade or switch libraries unnecessarily?) Document standards are full of these kinds of trade-offs, and reasonable people may differ.

So even though I said earlier that I think is probably some typo, there is also the real issue of compatibility with ZIP implementations to consider. It may have been the pragmatic choice in 2005, or whenever the OPC work was done, but probably not now.

My own opinion is that first I would like some objective evidence. How do libzip, Java, .NET handle UTF-8 etc part names in ZIP archives. (I believe Java is cool with them, I haven't looked at the rest.) But unless there is some major lack of support, then I think it is a no-brainer to fix it.

So when we look at the treatment of IRIs in Open XML, it needs to be against the background that URLs don't allow non-ASCII characters directly and that (at the time of drafting) ZIP did not have an adequate system either. So the choice of making translating the IRI in markup to a URL-like percent-encoded syntax for filenames was reasonable.

The difficulty with that is that it leaves it up to the user interface to translate back into the non-ASCII characters. Now that does work (e.g. web browsers) so don't imagine it is a showstopper, but it does add an inconvenience.

Now if I were to predict where a possible problem might be, it is that there may be ZIP libraries which are still in the 70s/80s stage mentioned above: allowing bytes in file names and not saying what encoding is used. The effect of this is that if a ZIP library saves filenames using the locale-character set rather than UTF-8, the filenames will be garbled on systems with other locale-character sets.

So I think the OPC needs improvement in this area. Hurrah for the standards process! OPC should allow IRIs in external URLs as part of acceptance of the standard. But I think OPC should also support for UTF-8 part names in ZIP packages rather than requiring percent-encoding. only: I don't know whether this should be arranged as part of the current ballot process, in maintenance or in a subsequent version.

5 Comments


2007-08-10 03:50:42
ZIP does support UTF-8. And ZIP is not ISO standardised. ECMA re-specified ZIP as excluding UTF-8, and there was no need to do that.


ECMA 376 faces so many problems, it should be withdrawn and retabled. Retabled in portions. Why not first standardise just the container. Then the word markup etc.


And then we need full disclosure of all the parts that are not standardised.


I must be possible to share as much with other formats as possible.

Rick Jelliffe
2007-08-10 09:23:09
Anonymous: ZIP did not officially support UTF-8 in time for the Ecma draft, according to the changelog. A profile is not a re-specification.


What is your objective evidence that Ecma 376 faces many problems? Not hearsay and edge cases?


And which parts do you mean? The deprecated parts?

Andre
2007-08-12 05:40:59
Rick,


you claimed that National chapters do not need to investigate issues that only affect Asian languages. That contradicts ISO policies! A key slogan is "Global Relevance". These CJK problems exclude Australian competitors to serve CJK markets.


See:
http://www.noooxml.org/global-relevance
and the WTO TBT Agreement.


//Andre


--- What is your objective evidence that Ecma 376 faces many problems?


I think its consent in the standards community. ECMA messed it up.
http://www.noooxml.org/arguments
Most of these are real confirmed bugs. It is technically irrelevant to talk about potential problems of other standards.

Rick Jelliffe
2007-08-12 07:58:50
Andre: I think you have misunderstood what "Global Relevence" means, in that ISO Bulletin (written to "stimulate discussion", by the way.) It means that you should not have local variations on standards (weather permitting).


If we lived in a world where every national body had enough experts to go through the nation's issues, and enough experts left over who had expertise enough in CJK internationalization that they felt confident enough to have a opinion over the national bodies of China, Korea and Japan, then by all means they can go ahead.


Now if the Chinese, Japanese or Korean National Bodies want to know more details of what the bug in Word95 for spacing full-width characters, then I would certainly support them in any meetings: they have the primary interest and therefore responsibility to decide whether more details of an old bug advances the usefulness of DIS29500 in any way. But I certainly have absolutely no faith in the expertise of people who have never even worked in CJK text processing and markup.


(Indeed, it may be that "Global Relevance" is actually a reason against having details of the autoSpaceLikeWord95 fullwidth spacing problem: if, say, it is only relevant to Japan. Presumably it is relevant to Korea too. However, the other aspect is of course that at the time of Word 95 there were seven different forks of Word for different script-regions. I am not certain which fork had this issue.)

Rick Jelliffe
2007-08-12 08:04:12
Andre: I meant objective evidence. Lots of people saying the same thing is not the same thing, especially lots of parrots. Where is the evidence of people actually having difficulties implementing DIS 29500? We have now multiple implementations, so if there were all these problems they would have come out. Apple implemented SpreadsheetML from the draft (and the normal reverse engineering I would hope) for example.


Now editorial issues are myriad, but a lot of work will be done on that, following the Ballot Resolution Meeting.