Semantic Web... of what data?

by Bob DuCharme

Related link:

I love RDF. it's great for implementing linking technology because it lets you specify typed relationships between addressable resources. It even lets you specify attributes of the relationships themselves. It's also great for many new classes of database applications, because of the unstructured way it lets you accumulate property values and the ease with which distributed collections can be aggregated.

Discussions of RDF often mention the Semantic Web as well. The first mention I can find of the two of them together is in a February 1999 Society for Technical Communication article by Tim Berners-Lee. Five and half years later, we see lots of Semantic Web talk, tools, and FOAF files, but outside of the FOAF files, where's the data upon which this web will be created? Where are these machine-readable facts that will be linked into this Semantic Web? (I'm talking about a Semantic World Wide Web here—if you you're using Semantic Web tools to manage RDF data on your company intranet or on your personal local storage, I'm happy to hear RDF success stories, but the Semantic Web vision I've always heard about described connections between widely dispersed data contributed by systems that were unaware of each other. In other words, one big Semantic Web, not a collection of smaller, unconnected Semantic Webs. This requires data to be available on the public internet.)

RSS files as we know them can't play much of a role in any web, because their data is too transient. Yes, use cases exist for the value of transient data, such as looking up movie times, but a format designed to notify one system about new resources available on another system isn't the best way to do this, and people aren't doing it anyway. With very few exceptions, such as Monkeyfist and the Center for Science in the Public Interest, few sites even archive their RSS files. If a feed holds ten items, then after the next ten appear, all the data currently in that feed will be lost.

But enough of my complaining. I decided to really look for publicly available RDF and to accumulate a list. When I saw that the domain name wasn't taken, I couldn't resist grabbing it. With some help from the rdf-interest mailing list, some Google tricks, and a wiki page, I've accumulated an initial list. I try to spend some time each day searching for new entries, and I hope to see more suggestions added to the wiki page.

The site includes an RSS feed to notify people about new entries, and you can download all of its entries as a single RDF file. While entries that point directly to RDF files are distinguished from the rest, most entries point to HTML files and directory listings that include links to multiple RDF files. In many cases, the RDF files are zipped or gzipped, making them a little less useful for a live Semantic Web, but any large collection of publicly available RDF helps.

FOAF files tend to be small, and a list of individual FOAF files on would be redundant with other lists out there, so points to the existing lists instead of to individual files. I'm mostly interested in collections of RDF that weigh in at 90K or greater. (If we're interested in the semantic content of these RDF files, then RSS files bulked up to that size by CDATA sections don't really count—when you tell an XML parser "don't treat any of this as structural markup," which is what CDATA delimiters do, then that section has little if any semantic value in the context of that document.)

Perhaps it's a bit bombastic to assert that the September 2004 RDF Semantic Web is little more than talk, tools, and FOAF files, but I don't see a whole lot of data outside of those FOAF files that can be used by those tools. I'd love to be proven wrong. Show me the data! I want URLs. Add some to the wiki, and I'll move them to the collection. Then, hopefully, the list of resources will grow large enough that people will easily find plenty of machine-readable data to work with as they build a real Semantic Web.


2004-10-01 00:24:25
No semantic web without shared ontology?
Even if there were collections, what use would they be? Just because I have a bunch of statements and you have a bunch of statements does not mean that a computer program can reconcile them. Your collection's truth criteria may be different from mine. Your categorization may be different. My data may be old.
2004-10-01 05:12:40
No semantic web without shared ontology?
There are plenty of difficulties with the advanced use cases of the semantic web, like the truth criteria you describe. I think (hope) there will be enough low-hanging fruit to implement more mundane applications, like looking up movie times to match against one's personal calendar. If the three movie chains who have movie theaters in my town use three ontologies to describe movie times, I'd have to code more to write this little app than I would if they all used a shared ontology, but it's not insurmountable.

For now, I see collections like the RDF versions of the CIA World Fact book and the Stanford TAP knowledge base, which are large collections of simple, straightforward facts, as the most significant contributions toward the possibility of RDF-based apps to aggregate information to create new knowledge, even if it's just slice-and-dice aggregations of subsets of that knowledge.

2004-10-02 04:49:43
No semantic web without shared ontology?
What use are collections of data? The same questions and problems exist irrespective of the data model or format. Having a shared framework is a very good start to being able to make use of the data.

With the RDF model there's a chance of expressing the truth criteria and categorization, along with rules regarding age. There's a spec-based route to doing as much or as little as needed for creating shared definitions (i.e. schemas, ontologies). Processing is immediately available through the inference logic of RDF/OWL using general-purpose tools, alternately special-purpose processing can be layered on the framework. Reconciliation might still be difficult, but at least it's possible.

2004-10-02 14:06:03
Feeds and transience
We might see more persistenly stored data in feeds once Atom (both the protocol and the format) become more widespread — cf. AtomWiki et al.

(I do know that this is all relies on speculation, yes.)