OAI-ORE Compound Documents Drafts Published
by Erik Wilde
Last week, the Open Archives Initiative (OAI)
published a set of beta-stage recommendations for compound documents, called Object Reuse and Exchange (ORE)
. This set of specifications has been published as version 0.9
and has been released for public review and comments
(ironically, the press release is a PDF blob).
The problem of compound documents (how to specify that a set of URI-identified resources together form one compound resource) has been around for a while, and never has been solved properly. There are various proposals from different application areas, such as XLink
(not quite for compound documents, but it could be used for this purpose as well), METS
(using and extending XLink), and DIDL
. I am certainly missing some other technologies here, please let me know what they are. The problem is that none of these languages ever caught on, mostly because none of them tried to be general. XLink focused on navigation, METS on libraries, and DIDL on multimedia.
However, it would be good to have a general and simple language for compound documents. If designed well, it could even be easily extended to be used for application-specific scenarios such as those covered by XLink, METS, and DIDL.
The problem is, OAI-ORE will not be it. Instead of designing a simple data model and a simple language for it, they settled for RDF. None of the documents contains any explanation as to why RDF was chosen over a simpler XML-based model. There even is a document that talks about how to implement OAI-ORE in Atom
, and all it does is showing how to embed RDF into Atom. Which means that for processing such an Atom feed you need an Atom toolkit as well as an RDF toolkit. As a side note: the terms in the Atom categories are URIs, which does not really follow Atom's idea of terms as strings
Generally, it is disappointing to see that a problem as important and manageable as compound documents, which still is an open problem looking for a good solution, has been approached on the wrong level. It is of course possible to come up with an RDF-based solution for that problem, but this unnecessarily introduces technology layers which for this particular problem are not required.
This means that the quest for a general and XML-based format for compound document descriptions is still on, and OAI-ORE is not a real contender in this race. Well, maybe it still could be one if the abstract data model
also got a representation in plain XML. Unfortunately, the model is not as abstract as its name implies, it is a rather concrete definition of an RDF vocabulary
, which will make it quite a bit harder to come up with a good and isomorphic XML representation. The effort might be worth it, however, the installed base of XML is significantly bigger than that of RDF.
This is exactly what multipart/related has been doing forever.
"The problem of compound documents (how to specify that a set of URI-identified resources together form one compound resource) has been around for a while, and never has been solved properly."
"Generally, it is disappointing to see that a problem as important and manageable as compound documents, which still is an open problem looking for a good solution, has been approached on the wrong level."
Could you elaborate on what the "proper" "good" solution, and "correct" level of approach to the problem of compound document(/object) description might be? And why and for what purposes? That would be much more interesting and helpful than a mere summary dismissal of existing solutions for being either not general enough, or not XML based.
BTW, METS is pretty friggin' general, and a major weakness IMO is precisely its reliance on a few overloaded XML syntactic structures (element sequencing/nesting and id/idref) for representation of the relationships among the components of the digital object a METS instance aims to describe. It's not at all apparent to me that (setting aside circumstantial issues such as install bases) an general XML-based solution would be either necessary or sufficient. I'd like to hear more from your perspective.
Thanks, Terry, for your comments.
What I was referring to as a goal that I think would be worth pursuing would be a simple and universal data model for compound documents, expressed in plain XML. The reasons why I am saying this are the following:
- other approaches don't even have an abstract data model (XLink does not even has one, METS build on XLink that does not have one). ORE's abstract data model does not look all that abstract, it is the description of an RDF vocabulary.
- Being plain XML, all required to process such a format would be XML tools. If anybody wants to handle the data in RDF, they are welcome to do so, and there even might be an additional syntax provided mapping the abstract data model to RDF, but the normative format should be plain XML.
- METS' problem in my view is that it relies on XLink (which in itself is underspecified and not that successful), and in its behavior section pretty much allows anything, which makes it hard to build interoperable implementations without bilateral agreements. what I like about METS is the structural map, but its hierarchical approach may not be the best way to do it. I would favor a simple list-based approach.
Having a thorough discussion on how to define a compund document format in blog comments probably is a hard thing to do. here are the key issues I wanted to highlight in my post:
- It would be good to have a universal format for compound documents. There currently is no widely accepted format for this.
- OAI-ORE is another attempt to define such format.
- The major flaw I see with OAI-ORE is that it unnecessarily introduces heavyweight technology layers, instead of settling for plain XML.
Why do we need XML-specific compound document formats?
Brian: The format is not XML-specific, it can describe any compound document as long as its components are identified by URI. But any compound document description format needs some representation, and XML is a good candidate for reasonably simply structured data such as compound document descriptions.
Having had a look at the "ORE Resource Map Implementation in RDFa", I'd like to respond to some of the points you've raised.
"Well, maybe it still could be one if the abstract data model also got a representation in plain XML"
There is an XML format, that being the one expressed using XHTML and RDFa. They have been eminently sensible in not creating a new XML grammar, but instead have chosen an existing grammar that can be viewed by web browsers and has been annotated with RDF vocabularies. They have therefore, made the Resource Map both human-readable and machine processable.
"It is of course possible to come up with an RDF-based solution for that problem, but this unnecessarily introduces technology layers which for this particular problem are not required."
The use of RDF and more specifically RDF/XML does not require the use of RDF parsers, and the like, to process the information. As RDF/XML or XHTML+RDFa, a Resource Map can be processed using XSLT to retrieve the resources. The presence of the RDF vocabulary enhances rather than detracts from the information, and after all a format that identifies resources via URIs seems an ideal fit for RDF. If one wanted to load the RDF into a triple-store and query aggregations using SPARQL then that is additional value at no extra expense because the metadata is already and waiting to be used.
I prefer to see people reuse and combined existing technologies before they attempt to add to the frightening heap of domain specific mark-up languages and data formats.
"There is an XML format, that being the one expressed using XHTML and RDFa."
RDFa is not an XML format, it is an format for embedding RDF into (X)HTML. It is thus an RDF syntax, simply one that can be more easily embedded into (X)HTML than RDF/XML. RDFa is good, but it is still RDF.
"I prefer to see people reuse and combined existing technologies before they attempt to add to the frightening heap of domain specific mark-up languages and data formats."
If you are developing a new language, you have to come up with a representation for it. You could do it based on plain text, on XML or on RDF, but in either case, you are creating a vocabulary (a set of terms and rules for using them) in that language.
"The use of RDF and more specifically RDF/XML does not require the use of RDF parsers, and the like, to process the information. As RDF/XML or XHTML+RDFa, a Resource Map can be processed using XSLT to retrieve the resources."
It is a common misconception that RDF can be appropriately processed using XSLT. The analogy I am usually using is XML and Perl. You can process XML (which is text-based) using Perl, but answering non-trivial questions turns out to be surprisingly hard, if you want to do it robustly (think of CDATA sections and entities, for example). In the end, you end up implementing an XML parser in your Perl program.
The same is true for RDF and XSLT. Of course you can process RDF/XML with XSLT, but answering non-trivial questions will require the reconstruction of the RDF data in your XSLT code, so in the end you either implement RDF yourself, or you get a toolkit for it.
The important question is: Does the OAI-ORE vocabulary need RDF as its foundation? If not, it would be better to use XML, because it is simpler to use and more widely established. The academic community has a tendency to do everything in RDF these days because it is en vogue and "everybody does it anyway". But if you look past academia and compare the popularity of XML and RDF, I think it becomes obvious that if both an XML and an RDF syntax can do the same job equally well, it is better to choose the simpler foundation.
I have been following the OAI-ORE effort with great interest. I was excited to see the Atom serialization, but disappointed when it turned out to be simply Atom stuffed w/ RDF. Do you think Atom itself (w/o the RDF) could handle this task (in the interest of not creating a new syntax)? Seems to me it quite likely could, although I think a few extensions for dealing w/ nested feeds would be necessary (those would be quite useful in any case). My fear is that OAI-ORE will turn out to be, like METS, a "library-only" approach.
Discussions on the OAI-ORE mailing list have focussed primarily on the RDF/SemWeb aspects of the spec which is far more cognitive baggage than I need when I want to simply move an aggregation (an electronic journal, say) from one system into another.
"RDFa is not an XML format, it is an format for embedding RDF into (X)HTML."
I still see XHTML+RDFa as an XML format, and as such, the data it contains can be extracted for the purposes of processing the aggregation.
"It is a common misconception that RDF can be appropriately processed using XSLT."
It depends what you mean by 'appropriately'. I would use XSLT to extract/transform RDF information from an XML document, or for that matter to extract/transform information in an RDF graph expressed as RDF/XML.
"The same is true for RDF and XSLT. Of course you can process RDF/XML with XSLT, but answering non-trivial questions will require the reconstruction of the RDF data in your XSLT code, so in the end you either implement RDF yourself, or you get a toolkit for it."
What sort of non-trivial questions are you needing to answer when it comes to dealing with a resource map that identifies a collection of related resources that you want to process. There's no requirement for inferencing or anything complicated like that.
I support the use of XML as much as anybody, but I also see that RDF has an important role in the extension of existing XML languages, by annotation, using standardised vocabularies that make machine processing easier. If some comes-up with an XML format for identifying related resources, all well and good. If they use known vocabularies like Dublin Core Metadata Terms, even better; and if they sensibly extend an existing XML grammar, even better still, because the learning curve is easier.
I annotate all my XSLT code with a block of RDF that captures useful metadata about the code. I could have dreamt-up my own XML metadata format, but why would I when Dublin Core is out-there and understood. I can run transforms over the collection of XSLT for documentation purposes and I don't need RDF tools because I don't have any 'non-trivial questions'.
XML is very very good and so for that matter (IMO) is RDF. They have there own domains but they are not mutually exclusive. Whether RDF/XML enriches RDF is a moot point but RDF can enrich XML.
"I still see XHTML+RDFa as an XML format, and as such, the data it contains can be extracted for the purposes of processing the aggregation."
It is of course your right to see RDFa as an XML format, and on the surface, this is true. However, the actual data model backing the data being encoded in RDFa is RDF, not XML.
"It depends what you mean by 'appropriately'. I would use XSLT to extract/transform RDF information from an XML document, or for that matter to extract/transform information in an RDF graph expressed as RDF/XML."
Like i said, you can of course process RDF (in its RDF/XML or RDFa syntaxes) in XSLT, but trust me, for anything but the most trivial things (let's define "trivial" as finding one triple mathing a certain patterns), you'll have to do quite a bit of serious XSLT programming, if you want to reliably process the RDF data. This is especially true for RDF/XML, which allows a lot of syntax variations for the exact same RDF triples to be serialized.
"What sort of non-trivial questions are you needing to answer when it comes to dealing with a resource map that identifies a collection of related resources that you want to process. There's no requirement for inferencing or anything complicated like that."
Well, inference would probably be close-to-impossible to solve in XSLT. But anything beyond "find me a triple that matches this pattern" already is pretty hard to do. Let me challenge you to write a simple piece of XSLT that, given some arbitrary RDF/XML input (not just your favorite way of serializing RDF), finds all resources which are described by at least two triples matching given patterns. In terms of the RDF data model, this still would be a trivial thing to ask for. In terms of implementing this is XSLT, I wish you good luck.
"I support the use of XML as much as anybody, but I also see that RDF has an important role in the extension of existing XML languages, by annotation, using standardised vocabularies that make machine processing easier."
I am wondering why RDF always has this almost magic capability attached to it saying that it allows you to do things that you could not do before. RDF represents vocabularies. XML represents vocabularies. RDF vocabularies are based on another data model and use other schema languages, but apart from that, if you have a vocabulary and want to represent it, you can do so in both RDF and plain XML. Like is said, it has become en vogue to choose RDF by default, but that does not necessarily mean this is the right thing to do.
Consider syndication: RSS first was XML, then went proto-RDF, then went RDF. However, it was always based on a restricted syntax subset of RDF/XML, because it was clear that allowing arbitrary RDF/XML would make RSS processing much too complicated. Atom then went back to XML by saying "there is no need for RDF here", and I think this was the right thing to do.
"I have been following the OAI-ORE effort with great interest. I was excited to see the Atom serialization, but disappointed when it turned out to be simply Atom stuffed w/ RDF."
Same here. I found it kind of weird to find that. If you simply stuff RDF into Atom, why even describe it? There's nothing special to it. The only thing worth describing are the categories, but i think that currently does not really look like it probably should be in Atom.
"Do you think Atom itself (w/o the RDF) could handle this task (in the interest of not creating a new syntax)? Seems to me it quite likely could, although I think a few extensions for dealing w/ nested feeds would be necessary (those would be quite useful in any case)."
That's a really interesting question. In theory, it could be done. Incidentally, I have been looking at nested feeds for the past months, but not for compound documents, but for packaging feed metadata into feeds.
I don't think, though, that Atom would be the best foundation for a compound document format, I guess you would end up "misusing" Atom quite a bit. Another thing to consider is Atom's focus on time as the primary key (I have written about this on http://dret.typepad.com/dretblog/2007/12/timeless-atom.html ).
"Discussions on the OAI-ORE mailing list have focussed primarily on the RDF/SemWeb aspects of the spec which is far more cognitive baggage than I need when I want to simply move an aggregation (an electronic journal, say) from one system into another."
That's right to the point! RDF often is used by people surrounding themselves with other RDF people, so in the end everybody agrees that RDF is appropriate because all others see it the same way. However, these decisions tend to happen in circles which probably do not contain a representative sample of the intended *users* of a new data format. Users often simply want to get their job done, rather than being able to "participate in the giant global graph of linked data." If the latter is implied in the further, all is well; if not, there is a problem.
As some shameless self-advertisement here: Look out for the July 2008 issue of CACM, where we have written (in a piece called "XML Fever") about this pattern of how XML and RDF sometimes are selected for political and social reasons, rather than technical issues.
I should begin my post with a confession of my lack of impartiality on the subject of OAI-ORE. I have dedicated a good part of the last two years trying to get this spec right. All of my following comments should be read in that context and, also, with the understanding that I consider criticism important in these efforts and certainly don’t imply a wholesale rejection of your comments. Nevertheless, I do strongly disagree with many of your criticisms.
One of the joys of working in the web standards world is placing one’s self in the cross hairs of divided camps of orthodoxy. There are a number of these divided camps but the split between the RDF and XML purists is particularly difficult to navigate. As someone who eschews orthodoxy and prefers to find a path that accommodates the positive aspects of various technologies this is not the first time that the work I am involved in has been criticized as “playing on the wrong side”. Needless to say, your comments from the XML perspective (or the RDF is wrong perspective) are matched by criticisms from the RDF community who say we’ve polluted our RDF model with this Atom stuff.
I am indeed mystified at your criticism of the RDF foundations of the ORE model. I have a few points in response:
1. Simple aggregations are indeed hierarchical such as a one level deep container that consists of an image in multiple formats, a scholarly article with multiple pages, etc. More complex aggregations can be a nested hierarchy - for example, a book with chapters, each containing pages. But these trees ultimately become graphs in many real life cases - a scholarly paper may cite the same resource in multiple places, a document may reuse the same image, etc. Obviously these are very simple examples, but these graphs appear again and again in both simple and complex examples. Since all trees are graphs, we sought a model that was graph-based - accommodating both the simple and more complex cases.
2. XML is primarily a tree-based, hierarchical model. Yes, it is possible to serialize graphs in XML - witness RDF/XML (which I find a bit of an abomination). But, if aggregations are best characterized as graphs, and I believe they are, we need an underlying graph-based model.
3. RDF is at its heart a nice, simple graph-based model. Sure, its been overloaded, and complexified, with all sorts of “ai-lite” and “inferring knowledge” noise that has alienated many, like me, who want simplicity. But, the core triple-based abstraction is, not surprisingly, quite useful for modeling all sorts of things, including information objects. And yes, the more advanced stuff such as inferencing and SPARQL querying provides a nice path for increased functionality in the long term.
Based on these points, we in OAI-ORE have tried our best to plot a path that accommodates the XML and RDF communities.
* If you see the world largely through nested trees serialized with lots of angle brackets, and you want to model an aggregation that is mainly tree structured, you can ignore the whole RDF model and use ORE to express an aggregation such as that shown inhttp://www.openarchives.org/ore/0.9/atom-examples/atom_dlib_mini.atom. It has no RDF in it, is well-formed XML passes with no complaints through the Atom validator. And, with the GRDDL XSLT at http://www.openarchives.org/ore/atom/atom-grddl.xsl , it can be transformed to RDF/XML encoding triples that conform to the ORE data model. By the way, I just don’t understand why you characterize this as RDF “embedded” in Atom. Atom is a pretty loosely written standard that has been leveraged by many (including Google) to do something beyond syndication. We in ORE are doing nothing else.
* If you see the world largely through triples expressed in N3, turtle, etc. you can express all sorts of interesting aggregations and, to satisfy the XML crowd, still encode them in atom using the rules in http://www.openarchives.org/ore/0.9/atom.
In closing, I am left confused by what you mean by “a general and simple language for compound documents” and maintain that we’ve not managed to do this. In my mind (again admittedly prejudiced) we’ve accomplished just that - its just atom! But even more we’ve accomplished a general and simple and -->extensible<-- language for compound documents by providing a bi-directional entry point between this simple atom and the richer RDF data model.
If instead of this you propose a whole new native XML schema for expressing aggregations ,I just don’t agree. That would require a whole duplication of the infrastructure that already exists in the atom and rdf communities.
Lastly, one of the comments to your post from Peter Keane states his fear that OAI-ORE is a “library-only “ approach. Our decision to use atom was precisely in response to our perception that we needed to move to a more general application context. As we argue throughout the specifications and user guides, aggregations occur across the web world (not just in digital libraries). The integration of our work into the mainstream web architecture and leveraging of atom hopefully ensures that is available across all those applications.
So, excuse, while I have my cake and eat it too! Meanwhile, I’ll await your response.
good to hear from you, and thanks for commenting! I'll respond in more detail next week. just a short reply regarding your tree/graph argument, which of course is at the heart of the debate around XML vs. RDF.
I am not pro XML in all cases, but I think when designing a language, you need to have a data model, and then have a hard look at it. If it is a tree or can be treeified reasonably, pick XML. If it is a graph where you cannot find a reasonable treeified version, then pick RDF.
Why I am saying this: The OAI-ORE "abstract model" in my eyes is not really an abstract model. It uses RDF and as such it is an RDF implementation of an (underlying and only implicit) abstract model. To conclude that RDF is appropriate as a language is not a surprise, because the abstract model *is based on* RDF. In my eyes, the question would be: Is it possible to create a reasonable XML syntax for the implicit abstract model. If yes, the XML syntax should be preferred. I hope I never said that RDF was not possible as a syntax, of course it is well-suited for anything that is a graph; I just think it should only be chosen in cases where XML cannot do the job.
Carl & Erik-
I think that this is a *really* useful conversation, and one which I have tried, unsuccessfully, to get started on the OAI-ORE mailing list. I would chalk that up to my inability to express my viewpoint effectively as much as anything.
Carl, you said:
"One of the joys of working in the web standards world is placing one’s self in the cross hairs of divided camps of orthodoxy. There are a number of these divided camps but the split between the RDF and XML purists is particularly difficult to navigate. As someone who eschews orthodoxy and prefers to find a path that accommodates the positive aspects of various technologies this is not the first time that the work I am involved in has been criticized as “playing on the wrong side”. Needless to say, your comments from the XML perspective (or the RDF is wrong perspective) are matched by criticisms from the RDF community who say we’ve polluted our RDF model with this Atom stuff."
From my point of view, there are not two camps. I can argue for both the RDF point of view OR the pure tree-based XML point of view and I think both can provide effective solution to a set of problems. The problem that OAI-ORE attempts to solve (or at least my understanding of it -- I am not 100% clear on it, and that may be part of the problem) can be effectively addressed with a graph-based, RDFish approach quite effectively, and the triples-based abstraction seems well-suited to that. The XML/Atom approach could also be quite effective -- you mention Google's work in this area, and they are undoubtedly leading the way (some of the most important folks behind the Atom spec now work at Google, and Google is taking a very "open" approach to things).
Where OAI-ORE gets into trouble (in my opinion) is in mixing the two. If ones goes back and looks at the deliberations of the Atom & AtomPub working groups, you will see that a painstaking and difficult process and open discussion led them to the design decisions they made. Had they accepted the RDF'ers wishes, Atom would look much more like RSS 1.0 and would be a much different spec. Maybe better, maybe worse, maybe even unnecessary, since RSS 1.0 got a lot of things right. But they rejected that approach, and Atom looks like a "better" RSS 2.0. And it is a superb and extensible spec. It is maddening in places, (author is required, focus on timestamp, etc.) but superb nonetheless. And its uses will be influenced by that minimalist approach. "Beefing up" Atom w/ RDF semantics strikes me sort of like adding catchy melodies to a Phillip Glass piece or flowery descriptions to a Raymond Carver short story. The minimalism is the point, really. I *do* think that the sort of incremental, well-considered extensions to Atom that the folks at Google and elsewhere are throwing out there for folks to talk over may get us our best solution for the aggregated resource issues/problems that I regularly face.
OAI-ORE will likely finds its place as well, and I am sure it'll be a superb spec. But the world does not necessarily need Atom++. I think we really need RDF/SemWeb--. That is, a clear, simple, graph-based model with a limited, extensible serialization format (XHTML/RDFa, RSS 1.0 come to mind) that scales down to the simple case as readily and easily as it scales up to the most complicated case.
I hope that does not sound overly critical/harsh. I would not even have the nerve to voice my opinions except that I am a librarian/programmer working in higher ed. and would dearly love to see a good spec, because I know I would put it to good use.
For an 'abstract' data model, I agree that we _do_ rely heavily on RDF, and it could easily be expressed in a more technology neutral fashion. For example, GraphML is another XML based graph serialisation which does not make the assumptions that RDF does, and I'm sure that there are many others. However we're playing the web game, and the web world that expresses relationships between resources identified by URIs uses RDF to do it.
A graph approach permits easy cross-referencing between any and all objects in the model, which in my opinion is a crucial aspect in today's web-enabled, mashed-up, linked data world. A hierarchical system, which we would get in a strict XML view, doesn't accurately reflect the much more complex real world of aggregated, nested, cross-referenced and re-aggregated resources.
RDF is unfortunately closely associated with the semantic web. If people don't buy into the semweb ideal (and I don't), then they likely throw out the proverbial RDF baby (a mistake I'm guilty of in the past). I suspect that the one global graph of absolutely everything view has engendered the "only academics like rdf" view in the original comment?
We could, although we all know the reasons why we shouldn't, invent a brand new XML serialisation for ORE that closely models the current ADM. Even just a skeleton framework would be sufficient with slots for any arbitrary XML that communities need to describe their compound objects. However look at METS and DIDL -- in not playing the web game, they've been relegated to the closed gardens of their respective standards' worlds, a fate that we hope ORE will avoid.
The format-du-jour seems to be Atom, *and* it seems to map reasonably well to the sorts of things we want to describe. Although some of the mappings are a bit strange, they are all more constrained definitions on top of the base Atom specification. For example, did you know that atom:subtitle "conveys a __human-readable description__ or subtitle for a feed"[2, emphasis added]? Not really what most people would think of when they saw the term 'subtitle'. As we've not drastically re-purposed any of the base atom elements, when you put an Atom resource map into a regular Atom feed reader, you get perfectly understandable results. Better results, one might argue, than putting in a Google API atom with its arbitrary extensions, as all the critical information we want to convey is directly in the correct atom element.
The 'properly' abstract data model (eg with no reference to enabling technology) seems to me to be something like:
Resources are 'things' identified by URIs. There exists a type of Resource called an Aggregation, which is a collection of 1 or more other Resources. These Resources are thus known as Aggregated Resources. Aggregations are described by one or more Resource Maps, which have exactly one serialization each. All of these resources can have 0 or more Agents (another type of Resource) associated with them, with roles such as a creator or contributor. All of the resources can also have metadata associated with them, such as time of creation, title, category or classification information, rights statements and so forth. There is a final class of Resource called a Proxy, which represents an Aggregated Resource as it appears within an Aggregation, but this is optional.
The described Resources are related to each other in a graph, in order to permit nesting of Aggregations, the provenance of Aggregated Resources, references between Resources and to avoid unnecessarily repeating information (for example metadata about common Agents).
And that's it. The rest of the data model is simply expanding on the above.
An alternative 'proper' XML serialization might be the along the lines of the following:
By the way, if you like the above, you're 99% of the way to liking RDF/XML! You can replace ore:uri and ore:ref with rdf:about and rdf:resource, add in a few extraneous wrapping elements, put an rdf:RDF element around it and ta da, it's perfectly valid RDF/XML.
If you don't like the above (be honest!), then I'd be extremely interested in seeing how you *would* serialize an ORE resource map in XML, as the above seems a very straight forward construction to me. As hierarchical as possible, and as simple as possible in terms of not inventing new elements.
-- Rob Sanderson [*]
[*] Obligatory Disclaimer: I'm an editor for the ORE spec, and hence extremely biased.
"For an 'abstract' data model, I agree that we _do_ rely heavily on RDF, and it could easily be expressed in a more technology neutral fashion. For example, GraphML is another XML based graph serialisation which does not make the assumptions that RDF does, and I'm sure that there are many others. However we're playing the web game, and the web world that expresses relationships between resources identified by URIs uses RDF to do it."
i entirely disagree. if you indeed played the web game, you would define your data model in some model suiting your needs, and then you would come up with a good representation using a web technology, such as XML (http://www.oreillynet.com/xml/blog/2008/05/bad_xml.html has more anbout this).
what you are playing is the "semantic web" game, which is a pretty different game. it is your choice to do this, but to say that "web" and "semantic web" are the same is just not true. on the contrary, the term "semantic web" mostly is very clever marketing for something that does not connect with the web all that much (except that URIs are used for identification). this is similar to "web services" (in their SOAP flavor), which also simply use the web as a transport infrastructure and then do their own thing, still claiming to somehow be part of the "web game". i think both terms ("semantic web" and "web services") are more about clever branding than anything else.
thanks a lot for the "executive summary" of the data model, i really liked it! i think this or something very similar to it should be the central part of the "abstract data model" document, because it is the abstract data model.
about the question whether i like RDF/XML or not: it can be seen as a straightforward serialization of a graph. what i think makes it really problematic is the myriad of possible serializations of the exact same set of triples that RDF/XML allows because of various language constructs. this makes it very hard to process anything in RDF/XML using XML tools, without implementing quite a lot of normalization code. on the other hand, a well-designed XML syntax might look eerily similar to RDF/XML, but if it is well-designed, then there are much fewer ways in which the exact same data can be serialized, making the data much more easy to process using XML tools.
"The format-du-jour seems to be Atom, *and* it seems to map reasonably well to the sorts of things we want to describe."
Forgive me for saying so, since I really appreciate the hard work you've all done on OAI-ORE, but this is the exactly the problem. The same attitude, held by the SOA'ers that HTTP was the "protocol-du-jour", is what got us into the heavyweight WS* mess. You need to play by the rules of Atom, or it is beside the point (why not RSS 1.0?, which seems much more appropriate). There is a very deliberate approach at work there, and if one looks at discussions surrounding Atom you will see exactly the sorts of things that OAI-ORE is trying to do rejected out of hand. Not, to be clear, that they might be implemented later, but rather not in keeping with the goals of Atom. This is not to say that Atom cannot be used in ways far different from its blog-centric origins. Here's the best description, from Tim Bray:
"Suppose you think of your data as a list of, well, anything: stock prices or workflow steps or cake ingredients or sports statistics. Atom might be for you. Suppose the things in the list ought to have human-readable labels and have to carry a timestamp and might be re-aggregated into other lists. Atom is almost certainly what you need. And for a data format that didn’t exist a year ago, there’s a whole great big butt-load of software that understands it." Tim Bray, "Don't Invent XML Languages"
Lists and "lists of lists". That's what Atom is good at. Graphs, not so much. If the ORE use cases can be decomposed into lists & lists of lists, Atom could be a fine fit. But as a "cover" for RDF, there'll always be dissonance.
"Better results, one might argue, than putting in a Google API atom with its arbitrary extensions..."
Actually, the Google extension are quite well thought out and will certainly influence the standardization of extensions for Atom. While they may have jumped the gun a bit w/ some when GData was first released, their current approach is to write an Internet Draft (see http://googledataapis.blogspot.com/2008/06/atompub-multipart-media-creation.html) and let the community discuss and refine it. They've already begun incorporating the community developed OAuth as an alternative authorization mechanism. There is *real* engagement with real standards communities. I am surprised/disappointed to see the library community take a seemingly superficial approach to the Atom standard. I've not seen any engagement by the OAI-ORE technical committee on the relevant Atom working group mailing lists, which is a bit of a surprise, since the Atom folks seem to be ready and willing to help, advise, critique, etc.
"Tim Bray: 'Don't Invent XML Languages'"
http://www.tbray.org/ongoing/When/200x/2006/01/09/On-XML-Language-Design is one of the things i have all my students read. on the other hand, it does not say "don't invent", it says "carefully check whether you really need to do so". this is quite a difference.
i think that the scenario of compound documents is one of the really important buildings blocks of the web, but so far no proposal has been successful (this is basically what my original blog post said). the abstract model of compound documents could be quite simple, and as rob showed, it actually is quite simple.
i think rather than saying "rdf is en vogue" and "atom is en vogue" and trying to somehow use both and produce a weird mix that probably makes nobody entirely happy, this would be a good opportunity to actually create a simple and well-designed new XML language.
i don't think the use cases underlying atom's design are a very good match for compound documents. you can tweak atom to somehow do it, but you have to work around atom's peculiarities (such as ordering by time and required authors), and you don't gain all that much (because atom at least in its current form does not give you a lot of support for treating collections as a interconnected graph of resources). and by throwing RDF into that to solve the graph problem, you eliminate the capability to use plain XML tools on the representation.
i am almost certainly repeating myself here, but i really don't want to discredit the overall need for a compound document format or the abstract OAI-ORE model, but i think the language design is not all that great, and definitely (as suggested by peter) could have been improved by a more open approach.
back in november 2007, when i first heard of OAI-ORE and saw the first documents and RDF all over the place, i asked about what caused that important design decision to create a semantic web standard rather than a web standard; i quickly got the distinct feeling that questions of this kind were not very welcome. RDF was just what it had to be, out of general principle.
so far, i find the arguments as to why RDF is chosen still not convincing. they usually are something like "that's what people are doing these days", and i really think that this is a misconception. there is an (arguably big) community of semantic web users and advocates, but they are just a tiny fraction when being compared to the overall population of people using web standards.
if you think you want to cater for the semantic web community, define an XML language, and define an RDF vocabulary in your schema language of choice, so that the data model has representations in both XML and RDF. but make the XML model the normative model for data exchange, so that all processing can always be done in XML.
As a short update to this thread, and because it nicely illustrates that Atom might not be a perfect or natural fit for representing compound documents:
carl: i still owe you a more detailed reply than the one i gave last week... my apologies for the delay!
as always in any debate involving XML and RDF, there is a lot of finger pointing, and the usual accusations of fundamentalism. my main concern was and is, and i think this has nothing to do with XML vs. RDF, that any specification should have a technology-agnostic data model, and i am pretty confident that such a thing is implicit in the OAI-ORE specification, but unfortunately it is not made explicit. what is made explicit is an technology-specific representation of the data model, which already implies a choice of technology (without ever making it clear why this choice has been made). it is this approach of mixing the data model and the technology for representing it which i think is most problematic in OAI-ORE.
in our recent "XML Fever" article http://www.oreillynet.com/xml/blog/2008/06/xml_fever.html we look at exactly those problems: how it happens that technologies are somehow chosen because of perceived advantages, rather than making a clear distinction between the application semantics and the associated data model, and a representation for that data model.
i really think that OAI-ORE is not doing itself a favor with the current representation for the data model. and i really think that the Web needs a format for compound documents. OAI-ORE's data model could be it, but in its current form, it is a solution for the semantic web, and even semantic web folks are not entirely happy (as you write) because of the technology mix.
but let me address your comments:
1. i agree: the model has to support non-tree graphs (DAGs, i guess).
2. i agree in principle. but if you use RDF/XML (which really is a bad design for graph representation in XML) as an example why you don't want to create your own graph serialization in XML, and then use RDF, which in the end for most users will mean they have to process RDF/XML, i really don't see the benefit of all of this. the benefit only comes to play if your assume that most users will use semantic web tools (in which case they will never have to deal with RDF/XML), but then you are playing in the semantic web space.
3. i agree that RDF lite (RDF without all the fancy AI stuff) can be seen as a nice and simple graph model. but even with RDF lite, things can get pretty complicated in RDF/XML because there are so many possible serializations of the same RDF lite data. my view of this is that even though RDF lite is a possibility, this does not translate into something lightweight on the XML layer. the only exception would be to restrict RDF/XML drastically and to only allow certain serializations and disallow others; but this would be very problematic because then the serialization produced by some tool might be perfectly valid RDF/XML, but might not satisfy your RDF/XML lite subset.
looking at the examples you are pointing to, i still cannot figure out how i can get access to the complete data model describing a compound document with the average non-RDF toolbox of web technologies. one minor remark regarding the use of atom: using URIs as @term is valid atom, but i would suggest to only put the terms in there (without the scheme prefix), and to encode the scheme only in the @scheme.
I think discussions like these are beneficial. But, I do disagree with your points about RDF/S.
1.) RDF/XML is a serialization format. One doesn't need to use it to represent ones ORE instance. One could use n3, n-triples, turtle, TRiX, RDFa, etc as serialized representations of an ORE model. Those included in the specification are just examples.
2.) RDFS is at its heart a data modeling tool, there are fairly useful mappings between it and other modeling tools such as UML/XMI. Thus, with these tools generic mapping capabilities between such representations, round tripping across various technological implementations of the ORE model expressed in RDFS becomes possible. Note, this also includes round-tripping into w3c Schema and alternative representations of the model in specific XML specifications.
I've been through one other abstract data model specification process with the Data Documentation Initiatives DDI 3.0 Abstract model definition in UML and technical implementation in W3C Schema and consider the approach taken by the ORE architects to be the most sensible and sound approach. By using RDF to express the Abstract Data Model, they provide a clear delineation between the definition of a standard abstract model and its technical implementation in basic serialized RDF, Atom, and (place your favorite format here).
I would recommend that exploring a definition in UML Classes mapped from the RDFS would be an interesting and beneficial practice, as it would express the mapping needed by software developers to walk the shortest path from the specification to actual working software implementations. I think this would not be difficult with existing modeling tools available today and reflects how the current choice of representation in RDFS of the abstract model is the most appropriate choice for accomplishing that which is "Data Modeling".
Developer and Systems Manager : MIT Libraries
Developer and Commiter : DSpace
mark: thanks a lot for your comment! here are my thoughts:
1) when designing for loose coupling, emphasis should be on designing a good representation that people can work with easily (parsing it an generating it). if you have a good service with a great data model backing it but people have a hard time adjusting to the representations they are supposed to parse and generate, the service will not be used as much as it could be.
as an example, i had a job to process some data coming out of a system that used a UML-based model for geographic data. the XML used a schema that was generated from the UML diagrams. while it technically contained all the bits needed to fully understand the data, it took a huge amount of time to basically write a library that was able to reconstruct the UML-level data from the very poorly designed (in fact, not designed at all, just generated by using some 1-click feature in some modeling tool) XML representation. to me, in these scenarios the emphasis is not on supporting XML users, but on supporting UML-based applications, and XML is just a dump format from some UML-based software. this works fine for UML-based pieces of software supporting the same "XML dump format", but makes your life really difficult as somebody with a different programming environment.
2) this is exactly what i am talking about. you can always map models using generic mappings, there is no doubt about that. but have you ever tried to work with these generated models in a native way? it gets very difficult, because instead of designing a vocabulary in the language that works for everybody (assuming that is why in the end you are interested in XML), you design it in some other model and then just generate the vocabulary. working with XML generated from UML models is about as unpleasant as working with XML generated from RDFS models: technically speaking, it is possible (as you point out), but you'll need surprisingly complex libraries to make it works robustly, and this is only for parsing data. try generating XML data for a vocabulary generated through some generic mapping from UML or RDFS: it is really hard. not impossible, but really hard.
i think i did it already but i'll once again point to http://www.oreillynet.com/xml/blog/2008/05/bad_xml.html and would like to point out that this translates pretty well to UML and RDFS (even though it's about OOXML, at least partly). for complex application scenarios, it might make sense to declare XML as inappropriately simplistic and settle for a data model that is located at a a higher level, treating XML as a dump format for that level; for a data model describing compound documents, i don't think this is necessary.
FWIW, I have put a more extended set of thoughts on the whole OAI-ORE/Atom thing at http://blogs.law.harvard.edu/pkeane/2008/06/26/oai-ore-atom/
this one just came in from firstname.lastname@example.org:
AtomTriples: Embedding RDF Statements in Atom
Mark Nottingham and Dave Beckett (eds), IETF Internet Draft
A version -00 Internet Draft for "AtomTriples: Embedding RDF Statements in Atom" has been published through the IETF process. specification describes AtomTriples, a set of Atom (RFC 4287) extension elements for embedding RDF statements in Atom documents (both element and feed), as well as declaring how they can be derived from existing content.