PRESTO: URLs as XPaths to views of information (+ schemas for URLS?)

by Rick Jelliffe

In the markup world, the jargon is that inline markup is the tags that delimit ranges of text in a document (e.g., Plain Old XML), while out-of-line markup is where the structures and labels are in one place but the subjects of the structures and labels is in other place (e.g., XLinks). Of course, you can have XPaths which drill down to some piece or bundle of information with inline markup, but where there is out-of-line markup there is potentially another XPath that can drill down through the out-of-line markup and end up labelling the same information.

What may not be obvious is that a web system that uses the PRESTO is in effect using URLs that act like XPaths on virtual out-of-line markup. "Virtual" because no actual tree is ever explicated (necessarily): notionally PRESTO uses resolver rewriting.

That good markup practice is to directly markup the information without fluff and tricks and in as pleasant a way as possible is universally acknowledged; and that there are many kinds of information structure where the markup cannot be a neat model of the data such that all elements represent objects of the same analytical importance is also widely known and regretted. (Think of the distinction in XSD between the components (the objects of the schemas) and the tags used for each component, for example. Or the *Pr containers in OOXML. )

A PRESTO URL should give the view in terms of the (conceptual) components, not the specific tags used if the resource is stored as an XML document. And not necessarily every tag, certainly. But every concept (every significant concept) should have a URL, even if there is no representation available or only a pretty crappy one.

So if in PRESTO a URL represents a kind of XPath to a virtual out-of-line markup view of some data, then it is possible to have a virtual schema for that virtual markup: in effect, you could have a schema for the URL. For example, given the virtual schema (as RELAX NG compact syntax here):

element address {
element tent { text },
element oasis { text },
element wadi { text },
element desert { text }
}

which would allow PRESTO URLs like

http://www.eg.com/address
http://www.eg.com/address/tent
http://www.eg.com/address/oasis
http://www.eg.com/address/wadi
http://www.eg.com/address/desert


In PRESTO, these should be available regardless of how the data is stored, because the idea is to model the user's conceptions. (And if an exact match is not available, to provide the best fit. This certainly creates a task allocation between front-end and back-end systems that may not be workable for some organizations or tasks. No sweat.)

But what about cardinality? Here is a schema more typical of literature:


element law {
element title { text}
element part * {
element title { text } ,
( element p { text } |
element list {
element item { text } +
}
)*
}
}


The Xpath for accessing a particular part's title would be /law/part[2]/title so the PRESTO URLs would need some kind of convention.

In PRESTO we *might* have URLs for

http://www.eg.com/law/
http://www.eg.com/law/title
http://www.eg.com/law/part
http://www.eg.com/law/part2/title
http://www.eg.com/law/part2/p3
http://www.eg.com/law/part2/list4
http://www.eg.com/law/part2/list3/item4


Now, I am not sure I understand the issues well enough to say which system for indexing is absolutely best. But I think the advantage of http://www.eg.com/law/part2/title over http://www.eg.com/law/part2/title is that it is probably a more common case that your system is interested in /law/part[2]/title rather than all titles of parts /law/part/title. But it is a matter of the particular use case and the consequent virtual schema.

(Another possibility is just to bite the bullet and allow XPath syntax directly in the URLs, with appropriate percent escaping. For example http://www.eg.com/l/law/part%5B2%5D/title. Is this reinventing XPointer? Well, in a way, except that in Xpointer you are locating a file then drilling down according to the actual markup: in PRESTO there information is merely hierarchically accessible according and you are using the Use Case concepts to zero in on the information.)

5 Comments

Danny
2008-03-13 02:33:49
Looks mighty promising. XPointer always seemed a good idea, but a bit cumbersome in practice (Annotea managed to make RDF/XML look even worse than usual :-)


Somewhat related is the MS Project Astoria URI syntax. While I have certain doubts about the overall modelling (the URIs seem to point to data constructs rather than first-class resources, if you see what I mean), it's a neat use of URIs.


What I'd personally like to see is something like this used as sugar for SPARQL query URIs.


CLOSMADEUC
2008-03-13 04:29:35
Fine with Java and J2EE (huge projects).


Does it fit with Php (small projects) ?


Etienne

Rick Jelliffe
2008-03-13 04:38:07
Etienne: It happens at the resolver level, not the page service level. So it is neutral as far as PHP, J2EE and so on. We have PRESTO facades with .NET/IIS and with Java/Tomcat/Tuckey and are pretty happy with it so far.


It is definitely a leap for programmers to catch on, it seems; and how far you take it depends on the discipline/monomania of the developers too: for example, it seems that developers often have a very clear mental demarcation about what can be named and what should be provided by a parameter: I have noticed that some developers find that
/aaa/bbb/ccc?format=xml
makes more sense than
/abc/bbb/ccc/application/xml
though I think the latter follows the PRESTO idea more, because it identifies a resource which could itself be drilled into

Jeni Tennison
2008-03-13 13:20:42
Rick,


Interesting post. I started writing a reply here, but it got too long so I blogged instead, at http://www.jenitennison.com/blog/node/80.


Jeni

Rick Jelliffe
2008-05-29 01:47:58
Jenni brings up a good point in her blog commentary.


It is the well-known problem that positional indexing is fragile, and that the public "number" for something may actually be a name and not a simple sequential number at all.


That would be a good reason to favour /law/part2/title over /law/part[2]/title, though perhaps /law/part[id='part5A']/title
would be better.


There are certainly many tradeoffs.