The Exception Question: Schematron and Document Engineering

by Rick Jelliffe

Lets imagine that we are transitioning into a "Document Engineering" style of architecture, so that we can model our entire business using old but not-as-outmoded-as-we-first-thought Data Flow Diagrams. At each data flow we need to ask the Exception Question: Does an exceptional document need human intervention or can it be dealt with automatically? Indeed, the expected answers to this question is probably what distinguishes the document community from the database community: the docheads would expect exceptions to be dealt with by humans who can monitor, fix and reset the production flow at all stages, the dataheads would expect exceptions to be dealt with by automated process, since humans involvement is at the input/output periphery of systems.

Obviously, the most "exceptional" kind of document is the invalid-against-a-schema document. However, Schematron allows a much milder (or tougher, depending on how one looks at it) bar: the presence or absence of any arbitrary pattern in the document can allow it to be marked as exceptional. (Schematron not only define valid/invalid, but it also allows complex dynamic diagnostic messages, and it also allows various flags to be set by assertions that fail.)

So the Exception Question then becomes a criterion for evaluating schema or constraint languages: when exceptional documents are to be sent to humans for intervention, does schema language A provide clear enough information to be usable by those humans. Similarly, when exceptional documents are to be sent to software (services) for intervention, does schema language A provide clear enough information to be usable by that software. Looked at in those terms, grammar-based systems do not shine. Grammar-based systems excel in all-or-nothing Great-Wall-of-China exclusion uses, but then throw the users (systems and humans) at the mercy of the validator-developer for the kinds of feedback and information possible, who has of course absolutely no idea of the problem domain of the schema. XSD is perhaps a little more organized in this regard compared to the other schema languages, because it defines a specific list of outcomes that can be found in the notional Post Schema Validation Infoset after validation.

But, the trouble is that, whether for humans or systems, the more that problems are diagnosed in generic terms (i.e. in terms of the markup) rather than in domain terms (i.e. in terms of the intended patterns, or dare I say semantics), the less chance that the diagnostic can serve any practical purpose for downstream systems. Notoriously this is true for system which "hide the markup" from the user: the grammatical errors are unavailable and incomprehensible to the users. Grammars have shown themselves over the last 20 years to make programmers more productive but to stupify end-users: the traceablity issue I raised this week on XML-DEV in response to one of Roger Costello's excellent fishing expeditions is another head of the same Hydra.

4 Comments

len
2007-07-08 03:41:09
The problem for the dataheads is the blithe assumption in cut and paste from the spec requirements that the system as required matched the system as practiced such that the document received is actually mappable to the system built. The problem for the document heads is the blithe assumption that anyone is actually reading the documents and comparing them to the real processes or products until and unless a damming exception occurs that can't be ignored or covered up.


Are people buying iPhones to improve their information or their processes? Will they speed up their access to more trivia or will they write the cure for cancer?


Until you look at the use for the content, it makes no difference that the system supporting it can find mistakes in more detail or more quickly. In fact, as Jobs shows, it is better that they don't look too closely until it fails, then they can send it back and receive a refurbished model. Maybe the best idea is to accept the document as good until a catastrophe occurs, then ship it back to the source with diagnostic that says "Fix this!" and let the source figure it out.

Rick Jelliffe
2007-07-08 06:30:52
Len: That is a very interesting point. The Document Engineering approach views systems as a set of defined system neutral interfaces (documents) between processes, and sees validation at those interfaces as an essential way of verifying system functionality (though, of course, simple schema-style validation is hardly the only choice here: where there is a protocol, then extended kinds of multi-document validators are of course possible.) But the process-centric or functional view still holds sway and still has value: must we (organizations) choose one or the other?


As far as the iPhone being an example of anything, that is something for Americans to figure out. We don't have them here.

len
2007-07-09 07:15:30
They're coming on the next boat. I was just given my first decent laptop and a cellphone so I've a long ways to go to catch up. I am enjoying the laptop. Wow! I can take work home. Whadda world...


I'm unconvinced that system interface neutrality is not contradiction in terms. The fact of 'system' implies local organization or 'meta' organization which implies rules or constraints that are external to the interface. Therefore, BizTalk/danceSing-to-same-sheet-of-music where entrances and exits may be ad hoc but there is a goal to the ensemble. Those goals are documented somewhere but not in the interfaces. So exactly what does 'system' mean? One might compare this to the design of a database where the tables are normal and have named columns that make sense to the database designer, but when the screen designer goes to work, he/she discovers the tables don't make semantic sense to them and the data dictionary isn't a source of clarity.


Schema-based tests are clearly not enough. Co-occurrence constraints are more effective the assertion being that given that goal set for some time organized or conditional use of the interfaces, co-located constraints emerge and can be identified. I don't think one can work without a process view and actually create anything resembling a 'system'.

Dave S.
2007-07-17 16:55:32
The conclusion I've come to after years of observation, is there is no such thing as data.


Everything is an executable. Every Word file. Every Postscript file. Every database. Sometimes a person is part of a hybrid processor, but handling data is indistinguishable from an OS handling programs - the appearance of difference is a bias on the part of the observer.


So, exception handling at the processor level, the application level, the compiler level, or an end user trying to send an e-mail is almost indistiguisahble except for that bias.


The converse is also true - that it's all data. Applications are data for an OS, an OS is just data to the processor.


Either way - everything looks the same.


The better the planning and the more control over the system the more predictable the outcome. Less planning and control, less predictability.


If Schematron works, terrific. One way or the other - Good luck to us all.