Metrics for XML Projects #1: Element and Attribute Count

by Rick Jelliffe

Evidence-based management needs comprehensible information; metrics are distilled facts: not a bad fit.

Here is a series of blogs giving a metric that can be useful in many areas of XML project management, from verifying the suitability of adopting a particular schema, to making sure that only work and capabilites arising from business requirements are being carried out, to estimating the price variation that a schema change may entail.

Everyone using XML already uses a metric: well-formedness! Validity is also a metric. (I am simplifying away the difference between a metric and a measure in these blogs: pedants please lower your hackles!) But the metrics for XML on the Web are either concerned with communications and information theory, or are based on programming complexity measures, or are a little polluted by voodoo ideology about good structures and bad structures; I don't buy into the latter, at least not at the current state of knowledge. But there is a need for a good set of metrics for XML project management, scoping and to inform XML schema governanc, so I thought people might be interested in some of the metrics I have been developing and using.

They all address different, but to me vitally important, aspects of XML projects, and most are, I hope, common sense. Of course, you can make up your own metrics as well: but I think it is good to at least have a basic vocabulary of XML metrics to use or adapt or decry as appropriate.

Element and Attribute Count

This most basic and coarse metric asks the question "How many element and attribute names are there?"


2006-05-11 08:14:19
Every time I produce the schema documentation for the release of a database, I provide management with a count of the tables and fields. Our cert gal asked how to interpret that. I told her, "roughly and don't get into reification fallacies". What she should really ask is about the number of keyed types. She asked about the growth rate. I told her it has settled at a number approximately at 200 per release. "Isn't that decreasing?" she asks. "It used to be about 400 per release but our coverage is very good now." I reply. She said she expected it to drop further. I told her it isn't likely to do that very fast because if she looks at the actual values, they are in the system tables where we are handling local variations, and in the seldom-sold features that we now implement for local agencies.

IOW, the dynamism is in the exceptions now but they never quit coming.

Any database that integrated around a set of common types (eg, names, vehicles, properties and locations) can grow quite large. If monolithic, it is a sales and cert nightmare. If modular, it is a pretty stable system.

So I would want to know how many of the related tables are related simply by product bundling or by mixed namespaces that are created because the published artifact (document, report, etc.) actually require that.

Rick Jelliffe
2006-05-11 08:57:59
Yes Len, I agree they never quit coming. When developing the Document Complexity Metric, I sampled hundreds of documents using different DTDs. What I found, for any particular group of documents using the same DTD, was that every particular document only had about 70% of the elements (for medium sized DTDs). So sampling one or two documents was not enough to determine structure well, and in fact even sampling large numbers of documents was not enough to completely cover the number of elements.

This has several impacts: the need to be suspicious of "fixed" schemas based on limited samples of documents, the need to have a change process instead, the need for tools to anyalyse document sets, the need for metrics for the tools to express useful things, the need for a rejection mechanism whereby rare elements (or elements only used because of tag abuse) can be dealt with.

As is my point here, this is a particular problem when using the kitchen sink standard schemas, which are always too big. They were designed to be subsetted.

2006-05-12 08:31:40
Agreed. We learned about this in CALS wars.

As you know, it depends on the approach taken to the results of the sampling. If one tries to cover every contingency, it becomes the DTD/Schema from Hell. If one attempts to subset too early, it becomes the DTD that spawns competitors (which ain't all bad). If one attempts to abstract away the differences, it becomes an abstract data dictionary and too many non-local types get pushed into it. Too much modeling leads to analysis paralysis.

There is no single answer, but the rule of thumb that simpler is better seldom fails. If one acknowledges agreements are local, precisely defines the locale, and avoids the temptation to use recruitment as a means to pre-fix the market ("We MUST get buy-in first!) thus incurring ever expanding mission cressp, it usually goes a bit better. My best thought is to limit the use cases and the ambition. Evolution proceeds mostly by co-opting bits from near neighbors and adapting one's own uses to these.

Reciprocity and a willingness to adapt mean living longer and getting more done. Better to pick one piece and do it well.

2006-05-14 08:09:47
GMX-V - new LISA OSCAR Draft Standard for Word and Character counts and the general exchange of metrics within an XML vocabulary:

Hi Everyone,

LISA OSCAR's latest standard GMX/V (Global Information management Metrics eXchange - Volume) has been approved and is going through its final public comment phase. GMX/V tackles the issue of word and character counts and how to exchange localization volume information via an XML vocabulary. GMX/V finally provides a verifiable, industry standard for word and character counts. GMX/V mandates XLIFF as the canonical form for word and character counts.

GMX/V can be viewed at the following location:

Localization tool providers have been consulted and have contributed to this standard. We would appreciate your views/comments on GMX/V.



Rick Jelliffe
2006-05-14 09:16:17
Thanks for the heads-up on that, AZ. Lay readers may not be aware of the extent to which automated translation systems are used, in particular the success and penetration of translation memory systems. I am not at all surprised that a forward-thinking consortium like LISA, bringing together lean-and-hungry competitors whose cream largely comes from quality improvement, should be a leader here too.

GMX/V has a name only a tech writer could love, but it seems completely serviceable for any industry needing a simple metrics reporting framework which supports phases (e.g., workflows, processes, etc.) Well done!

2006-05-14 12:51:49
Many thanks for you feedback Rick. I specially liked your description of GMX/V as a name only a tech writer could love. I will add this quote to my presentations on the subject!