Platform Independent

The Semantic Web: It's Whom You Know


Even though we experienced plenty of information overload before the Web, hearts were racing throughout the mid-1990s simply because the Web was making so many resources available. Now, of course, our affections have gotten a bit bruised and we've confronted the Web with the need to help us sort and organize those resources as well.

Hence the concept of the Semantic Web. Using XML and other recently developed technologies, authors and designers formally tag text and objects so that automated agents can offload some of our information overload.

If it's done right, the Semantic Web would be accompanied by widespread knowledge management. This means, for instance, that if you started a Web search with the goal of simplifying your process of computer programming, you would be directed to various design concepts such as real-world modeling, and thence to specific techniques such as inheritance and subclassing. A rather abstract overview of knowledge management can be found in the February 2002 issue of Communications of the ACM, under the term "ontology."

I used to have high hopes for knowledge management; in one article I even suggested (admittedly tongue in cheek) that library science would become the next hot job category. But there's no reason why librarians wouldn't become the next hot job category if the Semantic Web really depended on the formal organization of information. Why hasn't this happened? And why are there so few Web pages in XML or applications that handle them? Will the SOAP/Schema/XSLT/RDF syndicate succeed in transporting us safely across the ocean of information?

I would like it to be so, but now I wonder. As our long-tossed boat approaches the semantic shore, I can sense danger hidden in the vegetation.

Problems With Current Notions

The more you delve into formalizing a semantic system, the more complex it gets. Order recedes like the proverbially chased rainbow.

We may be able to design a system that distinguishes a "child" or "class" as a computer concept, so you don't come across Web pages for Montessori schools while performing your search for programming techniques. But even in the field of programming there are many "children."

A radio button is the child of a generic button, in the sense that you create a subclass of the generic button class in order to design your radio button. But the radio button is also the child of the form that contains it in the graphical design. I'm sure that the erudite knowledge managers have found plans for handling such complexity, but I don't know whether search-engine makers can carry out those plans or whether users know how to derive benefit from the complex search engines that would result.

And even though there's some benefit from knowledge management, what does our current concept of the Semantic Web have to offer? A few keywords attached to a document can help me decide if it's relevant to my search, but what do I get from all the other complicated tagging we're expected to do? I sense that most people will do an informal cost/benefit analysis and just utter a semantically significant "No."

I like reading plays, but what good does it do me if I'm able to search for an instance of "stage direction"? How often do I say, "I want to refer to paragraphs 4 through 7 of the second subsection of this document?" That's nice to do for fair-use quotation, but trying to seriously assemble documents out of subsections of other documents will invoke all the dangers of quoting someone out of context. (Observers have expressed concern over this problem for many years, even back when Ted Nelson was talking about Xanadu.)

O'Reilly Emerging Technologies Conference

The 2002 O'Reilly Emerging Technologies Conference explored how P2P and Web services are coming together in a new Internet operating system.

Some of us remember from high school what "semantics" used to mean and how it differed from "syntax." (I shouldn't depend on what I learned in high school to write technical articles, but hey, it was a pretty good high school.)

"Syntax" explained nitpicking stuff like why one should properly say "it's whom you know" instead of "it's who you know." In contrast, "semantics" applied to more interesting issues such as the presence of different verbs in Romance languages for knowing information and knowing people, so that the quip "it's not what you know but whom you know" would be not so notable in one of those languages.

What would the Semantic Web really entail to be successful? It would consist of reducing semantics to syntax. The complexities of whatever or whomever you want to know would become formalized in tags. The most subtle areas of knowledge would become subject to the syntactic experience of parsing and tree structure.

I don't think it can happen. That's why censorware, for instance (programs that claim to recognize Web pages with undesirable sexual or political content) have been, are, and always will be execrable.

In short, I think semantic tagging and Web services are useful for certain business applications and other areas where interactions can be formalized, but they aren't going to create a completely new way of using digital information.

I will not retreat, however, into the simplistic canard that says, "Technology can't solve problems." I believe that technology can solve problems--or rather, that it can be part of a solution. We should look for ways that technology can augment natural human activities. Let's try going in an entirely different direction and see whether some other kind of organization holds promise.

Popularity Ranking and Collaborative Filtering

Google has been widely praised for its PageRank technique of ordering pages by the number of links other people make to them:

In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links, a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important."

The Teoma team has recently claimed they are extending this popularity ranking even farther by determining (through knowledge-management techniques) what pages are especially relevant to rank other pages.

Subject-Specific Popularity ranks a site based on the number of same-subject pages that reference it, not just general popularity, to determine a site's level of authority.

The designers of these systems have cleverly recognized and programmatically adapted an old strategy for recognizing intellectual contributions. Academics have long been judged by the number of citations of their works that appear in respected journals. It's quite reasonable to assume that if a lot of people find a work worth mentioning, it's a significant work.

This strategy does not replace knowledge management, of course. They work well together. (That is, semantic tagging can tell you whether something's in your ballpark and popularity ranking can indicate how fast it was hit.)

But popularity ranking is still a very crude start toward assigning value to Web pages. Even if would-be sophisticates around the country think that the NPR show All Things Considered is a significant source of news, that doesn't mean I have to think so. In any field you choose, you can probably find someone who is widely acknowledged as an authority but whom you think is a flabby thinker or barks up the wrong tree--too loudly.

There are many other reasons popularity ranking can be skewed, too: someone who happens to be first to raise a topic is likely to be noticed just for that accomplishment; someone who addresses a narrow technical audience may not get adequate recognition for his or her contributions; someone who's a lone nut may be referenced a lot just because no one else expresses the same ideas, etc.

So collaborative filtering appeals to me even more than popularity ranking. A collaborative filtering system lets you rate things (movies, books, politicians--anything) and compares your ratings to other people's ratings. If someone has movie preferences similar to yours, you are likely to agree on the next movie to come along.

Hidden in collaborative filtering is some subtle knowledge management. The system can't say, "You'll like A because you and she both liked B" unless the system knows that A and B are two instances of the same class.

The Next Step Toward a Semantic Web?

While considering the successes and failures of a technology, it often helps to step back and look at how individuals solve information problems on their own, informally and with minimal technological support.

When I want to educate myself regarding a topic, my first step is to find a place where interested people congregate (it could be a mailing list, if I am doing my research in virtual mode) or a collection of useful documents. When I find people who impress me with their insights or who simply intrigue me with their points of view, I spend more time reading what they have to say and ask them for pointers to new material.

This technique uses affinity between individuals, as collaborative filtering does, but the individuals are actively seeking affinity rather than passively waiting for it to emerge from a collaborative filtering system.

Furthermore, once I know someone's interested in a topic, I tend to forward that person URLs or mail messages that I think will be valuable. This again is an active filtering system. It helps extend a past interest to future topics.

Most subtle, perhaps, is the way I discover new topics of importance by following what interests the people I respect. For instance, if I learn from someone's views in a particular software area and find that he's becoming obsessed over some piece of hardware, I decide that it's time to look into that hardware. I don't simply screen out this new information because it's different from the software area that we've always talked about.

How can we take this informal "whither thou goest, I too shall go" system and use computers to make it more efficient or comprehensive? Improvements might be something as simple as enhancements to current mail filtering programs.

A mailer could check whose postings you mark as valuable (or save in special folders), and automatically elevate new postings by those people so that your likelihood of seeing each one is relative to your interest in the person posting. Such a system should be tuned somehow so that long-term posters don't have an unfair advantage in relation to new posters with interesting perspectives to offer.

People could maintain favorite bookmarks on various subjects and open these lists of bookmarks up to other people who flatter them with their interest. Technology could help by making it easier to rate and categorize documents. (So knowledge management comes back into the picture.)

Technology could also identify new topics that are interesting to interesting people. Currently, you might realize that someone's on to something potentially exciting because he keeps mentioning it on a weblog or because he explicitly tells you to take a look at it. Perhaps the notification could be aided by an automated agent that told you, "So-and-so mentioned OS X four times this month."

The potential for this new kind of Semantic Web calls for active exploration, not idle speculation. I do not know at this point what the successful technology is; I do not know who will design and promote it. But this much I know: I'd rather spend time filing my email and organizing my bookmarks than going back over 200 of my old articles and putting angle brackets around every keyword.

Andy Oram is an editor for O'Reilly Media, specializing in Linux and free software books, and a member of Computer Professionals for Social Responsibility. His web site is

Return to the O'Reilly Network.