A general theory of markup: attack of the fuzzies

by Rick Jelliffe

How delightful it is to be confounded, when something that shouldn't work does work well, and when we are forced to admit the world is not boring, predictable and under control. The success of XML is such a thing, as was the success of HTML. I have come up with a new little theory about the success; maybe it is not new, I forget so much: undoubtedly it is obvious to someone else and has been raised and pooh-poohed before.

Lets imagine we take a database (a set of facts) and a set of human comments on those facts and a set of metadata about the whole thing. We'll call that our data. Now lets categorise the relation between an information item and another information as:

  • intimate

  • strong

  • moderately strong

  • connected

  • similar

Such categorization is in addition to the labels on the data. Even though these relations are very general, they are enough for a human to clarify a lot of semantics based on the labels. Not nearly as clear as "has a" or "is a" relations, or grouping as bags or sets, or labeling as about or description or topic. Much more fuzzy. Nothing like what goes on with relational data or RDF.

But those categories are just what XML markup provides. An attribute suggests intimiate relationship. A child is strongly related to its parent. A successor element is moderately strongly related to its predecessor. A referenced or linked to element is connected to its parent. Two information items with the same name have some kind of semantic similarity, which increases the more that their context grows. A programmer uses these to glean the meaning of the markup.

Now the relations I suggest may not be perfect; you could improve on them undoubtedly. And they may not be reliable guides, in that there's nowt as queer as folk and folk make schemas. And the categories do not necessarily correspond to neat or orthogonal logical or linguistic categories. They don't need to. But surely we can reverse engineer something about how humans work from their artifacts. Are there some kind of quasi-linguistic properties that make XML successful, apart from the obvious reasons of representational power, internationalization power and corporate power?

Of course, there are other ways to look at it: elements/attributes as noun/verb, as substance/accident, and so on. Those operations can also be at work.


2006-05-25 08:36:32
They are very local, deal with a small subset of information most of the time, deal with time in a short period, and trust their neighbors to mind their own business (meaning, handle what is on their desk). They are very comfortable with rules of thumb and hazy assumptions as long as the bet they are making on them is affordable and not a zero-sum game. They tolerate disasters at a distance. Instances are 'facts on the ground'. DTDs and Schemas are theories at a distance. The two assumptions that make the web scale are links can snap and less control is required as long as local control is possible. IOW, it runs loose and we deal with the mistakes as they happen. BTW, if your car was designed that way, you would still be riding a horse because the horse has millions of years of testing behind its design.

On the other hand, the human dilemmas of the web technology being a caveat emptor system are only now being realized by the larger numbers of human users even if perfectly predictable by the observers of their behaviors. As a result, a linguistic layer of law is being imposed on it to make up for the shortfalls of its own design.

Quasi linguistic: a name is a name is a name. Trust and verify as necessary. Humans are extremely noise tolerant because their design rules are simple:

1. Fight or flight.
2. Hesitate when uncertain.
3. Breed young, fast and often.

The first is survival of the individual. The second restrains damage. The third perpetuates the species if the first two fail.

The rest is time scale, numbers, and habitable environment.

XML works about like that too.

Rick Jelliffe
2006-05-25 10:35:43
I missed out on the breeding part, let alone fast or often!
2006-05-25 12:37:06
Then You Die in one generation of play.

... seems like a pity to waste the very advanced genetic progress, though.

One could quibble that XML does as well as it does precisely because it doesn't impose anything but a syntax, and that all of the connection strength relationships (is-contained really a relationship or just a side effect of syntax and text linearity?) are derived intensionally. Now one asks if the relationships you list have a deeper derivation by being ubiquitous among humans? I don't know how one tests for that and I can only speculate as to their behavioral or cognitive stimuli. Maybe digging into Seymour Papert's work will provide clues.

2006-05-26 08:34:24
BTW, Rick, consider the following:

1. Your categories are force vectors (the coupling strength based possibly on distance (angle distance from topic (name) in semantic space) and rate of change (could be a motion property).

2. If an instance is expressed by the AND relationship, in a vector space, that is a tensor.

3. If the instance is a single viewpoint (data and functions), that is an objective system. If the instance is data only, therefore, information with multiple interpretations or viewpoints, that is a subjective system.

Until interpreted, the XML instance is operationally in superposition to the semantic space. Quantum entanglement effects (eg, given a term such as pet-fish, the most frequent selected value is 'guppy' but this relationship is tensor and affected by where in space and time as a set of scalar motions it is intanced) affect the interpretation.

This may be a reasonable model for human thinking or intelligence in general. XML is more amenable to the way intelligence works than objects as long as interpretations are in motion with respect to each other and the observables (data).

Quantum logic as a formalism for relating observables is one approach. Possilbly augmented by theories of Evolutionary Stable Systems (ESS) and lattice notations, one might make a better or at least dynamic semantic web. Interval analysis might play a role.

The problem of the semantic web is the formal ontologies. As we learned the hard way with DTDs, they are just testable theories and if they aren't testable, they are useless except as door stops.