The Internet Archive made headlines back in November with the release of the Wayback Machine, a Web interface to the Archive's five-year, 100-terabyte collection of Web pages. The archive is the result of the efforts of its director, Brewster Kahle, to capture the ephemeral pages of the Web and store them in a publicly accessible library. In addition to the other millions of web pages you can find in the Wayback Machine, it has direct pointers to some of the pioneer sites from the early days of the Web, including the NCSA What's New page, The Trojan Room Coffee Pot, and Feed magazine.
How big is 100 terabytes? Kahle, who serves as archive director and president of Alexa Internet, a wholly-owned subsidiary of Amazon.com, says it's about five times as large as the Library of Congress, with its 20 million books.
"What we have on the Web is phenomenal," Kahle says. "There are more than 10 million people's voices evidenced on the Web. It's the people's medium, the opportunity for people to publish about anything -- the great, the noble, the absolute picayune, and the profane."
The existence of such an archive suggests all kinds of possibilities for research and scholarship, but in Kahle's vision, all of the streams of research commingle into a single purpose: "The idea is to build a library of everything, and the opportunity is to build a great library that offers universal access to all of human knowledge. That may sound laughable, but I'd suggest that the Internet is going exactly in that direction, so if we shoot directly for it, we should be able to get to universal access to human knowledge."
Brewster Kahle's Internet Archive stores more than 100 terabytes of web content. "What we have on the Web is phenomenal ... the great, the noble, the absolute picayune, and the profane."
If the goal sounds lofty, the Wayback Machine itself may be the crudest imaginable tool for data-mining a 100-terabyte database. At the Archive's Web site, simply enter a URL and the Wayback Machine gives you a list of dates for which the site is available.
Clicking on an old site is like time travel. I visited a December 1996 issue of Web Review (webreview.com) and found a cover story on "Christmas Cookies" an article dismissing privacy concerns about the new-fangled Web technology. A report from Internet World featured the hottest and most promising technology of the day: Push.
But that report, and the other articles I looked at in the Wayback Machine, were truncated; links to subsequent pages and many graphics were missing. Kahle concedes the Web interface does not show the full glory of the archive, but he says it wasn't meant to. "This is a browsing interface, a wow-isn't-this-cool interface ... It's a first step, but it's technically rather interesting because it's such a huge collection."
While the Wayback Machine has received plenty of press, we were interested in going deeper into the technical workings of this audacious project. We sat down with Kahle (who previously worked at the late supercomputer maker Thinking Machines and founded WAIS, Inc.) at the Archive's offices in San Francisco's Presidio.
Consider the hardware: a computer system with close to 400 parallel processors, 100 terabytes of disk space, hundreds of gigs of RAM, all for under a half-million dollars. As you'll read in this in interview, the folks at the Archive have turned clusters of PCs into a single parallel computer running the biggest database in existence -- and wrote their own operating system, P2, which allows programmers with no expertise in parallel systems to program the system.
Richard Koman: So how much stuff do you have here?
Brewster Kahle: In the Wayback Machine, currently there are 10 billion Web pages, collected over five years. That amounts to 100 terabytes, which is 100 million megabytes. So if a book is a megabyte, which is about what it is, and the Library of Congress has 20 million books, that's 20 terabytes. This is 100 terabytes. At that size, this is the largest database ever built. It's larger than Walmart's, American Express', the IRS. It's the largest database ever built. And it's receiving queries -- because every page request when people are surfing around is a query to this database -- at the rate of 200 queries per second. It's a fairly fast database engine. And it's built on commodity PCs, so we can do this cost-effectively. It's just using clusters of Linux machines and FreeBSD machines.
Koman: How many machines?
Kahle: Three hundred, we may be up to 400 machines now. When we first came out, we didn't architect it for the load we wound up with, so we had to throw another 20 to 30 machines at serving the index.
Koman: You just throw more PCs at the problem?
Kahle: You can build amazing systems out of these bricks that cost only a couple hundred dollars each, and you just throw more bricks at the problem to give it more computer power, more RAM, more disk, more network bandwidth, whatever it is you need. So we build massive database systems by striping the index over tens of machines. And its a very cost-effective system.
Koman: What kind of performance do you get?
Kahle: We're getting exceptional performance. Basically to build a 10oTB database costs -- in hardware costs -- less than $400,000, including all the network equipment, all the redundancy, all the backup systems. We've had to do it based on necessity, because there's not a lot of money in the library trade. Where the Library of Congress has a budget of $450 million a year, you can be sure we don't.
Koman: How does it work technically?
Kahle: How the archive works is just with stacks and stacks of computers runnning Solaris on x86, FreeBSD, and Linux, all of which have serious flaws, so we need to use different operating systems for different functions. The crawling machines are running Solaris; there's a dozen or possibly more.
Koman: What are the crawlers written in?
Kahle: Combinations of C and Perl. Almost everything we can, we do in Perl -- for ease of portability, maintability, flexibility. Because there's so much horsepower we don't really require a tight system. The crawlers record pages into 100MB files in a standard archive file format, and then store it on one of the storage machines. Those are just normal PCs with four IDE hard drives, and its just writes along until it's filled up and then it goes to the next one. It goes through a couple of these machines a day: hundreds of gigabytes a day. The total gathering speed when everything is moving is about 10 terabytes a month, or half a Library of Congress a month.
Then they're indexed onto another set of machines -- there's a whole hierarchical indexing structure for the Wayback Machine, and that is kept up to date on an hourly basis. So when people come to the Wayback Machine, there's a load balancer that goes and distributes those queries to 12 or 20 machines that operate the front end, and those query another dozen or so machines that hold a striped version of the index, and that index allows the queries to answer what pages are available for any particular URL. So if you were to click on one of those pages, it goes back to that index machine, finds out where it is in all the hundreds of machines, retrieves that document, changing the links in it so that it points back to the path, and then hands it back to the user. And it does that at a couple hundred per second.
What's amazing to me is the fact that the hardware is free. For doing things even in the hundreds of terabytes, it costs in the hundreds of thousands of dollars. When you talk to most people in IT departments, they spend a couple hundred thousand dollars just on a CPU, much less a terabyte of disk storage. You buy from EMC a terabyte for maybe $300,000. That's just the storage for 1 TB. We can buy 100 TBs with 250 CPUs to work on it, all on a high-speed switch with redundancy built in. Something has changed by using these modern constructs that are heavily used at Google, Hotmail, here, Transmeta. There's a whole sector of companies that are more cost-constrained than say, banks, that just buy Oracle and Sun and EMC.
Koman: You mentioned Perl, Linux, and FreeBSD. Do you use exclusively open source software?
Kahle: We use as much open source software as we can; we make as much of our software as we can open because we're a library. The idea is to help people make sense of the Net and we leverage all the open tools. Alexa put up a television archive called tvarchive.org, which is televison news from around the world from Sept. 11 to Sept. 18. Twenty channels in Chinese, Russian, Japanese, Iraqi. Iraqi television is really interesting. So in three weeks, Alexa took all these recordings from tape, massaged them, put them online, and converted them into several different formats. The only way to do this is to cross-cluster hundreds of commodity Linux boxes and use freeware tools, all of which barely work.
Koman: This all takes a lot of brain cells; you have to have some smart people working on this.
Kahle: Yes, this is not for the light of heart. If you're going to run 100TB databases and support hundreds of queries per second, it's going to take good folks. But on the other hand, there are good folk doing a whole lot less than that. The archive is a real vindication that you can do new and different things with these open tools. Because these open tools are available to use in ways different from those for which they were originally designed, it makes striving for the biggest collection of information ever possible.
Koman: Does the fact you can do this at this scale suggest new possibilities for the private sector, that businesses can operate on a scale not previously imagined?
Kahle: Having the capital cost of equipment drop to effectively zero allows you to think bigger. You start thinking about the whole thing. For instance, the gutsy maneuver of saying "let's index it all," which was the breakthrough of Altavista. Altavista in 1995 was an astonishing achievement, not because of the hardware -- yes, that was interesting and important from a technical perspective -- but because of the mindset. "Let's go index every document in the world." And once you have that sort of mindset, you can get really far.
So if all books are 20 TBs, and 20 TBs are $80,000, that's the Library of Congress. Then something big has changed. All music? It's tiny. It looks like there're only one million records that have been produced over the last century. That's tiny. All movies? All theatrical releases have been estimated at 100,000, and most of those from India. If you take all the rest of ephemeral films, that's on the order of a couple hundred thousand. It's just not that big. It allows you to start thinking about the whole thing.
It will change also the relationships of corporations to their IT departments. IT spends a lot of money on this stuff; they spend millions. And if they really understood that it doesn't have to cost millions, it could cost hundreds of thousands of dollars, and they could hire a few smart people rather than large numbers of people to maintain all this equipment, we might be able to make some big steps forward. It would open it up to smaller companies to do bigger things. Where people used to think that warehouses full of mainframes was an asset, that may not be the case.
Koman: How do you mine all this stuff?
Kahle: That's where the fun begins. Datamining these materials is great fun. What Alexa does in its free toolbar is create a related-links service, and it does it based on the collaborative filtering of "other people who went to this page went to these other pages." We use the link structure of the Net and the usage trails from the Alexa users to be able to compute this. And all of these techniques require tens if not hundreds of machines to be able to data process.
Because there are only a couple hundred gigabytes for every processor and the processor and RAM are very closely tied to the disks, you can operate this cluster as a large parallel computer. It's very inexpensive to do. We program the computer using a technology called P2, which we'll be putting out as open source for other people to able to operate parallel clusters of Linux or FreeBSD or Solaris boxes.
Koman: What is P2?
Kahle: P2 is a Perl script that takes commands and runs them on remote boxes, splits up data to be able to run on them, and then brings back and correlates the data.
Koman: It's an operating system for a parallel cluster?
Kahle: But it sits on top. You can take people who know how to do shell scripts or Perl scripts on normal Unix boxes and within two weeks, they can be world-class parallel data miners. That's a huge step past the problems we've had with parallel computing, where you had to learn a whole new methodology. This is: no new methodology, no rocket science, no magic. And it's only because it's straightforward that we've been able to leverage normal programmers' expertise to be able to run programs on hundreds of machines.
Koman: It sounds quite simple.
Kahle: We've been at it for years. The first company I worked in was Thinking Machines. And we blew it. We built the fastest computer in the world that very few people could program. It required people to think in a new way. What a horrible thing to have to do to be able to attract customers. The idea is to be able to think the same and be able to do more. I think we've cracked the parallel computer problem for a very large set of problems, which is fundamentally data-mining and database-type operations.
Koman: So will people looking for more than the Wayback Machine be able to mine the Archive?
Kahle: The idea is to try to allow people to use a Web interface -- clunky, but you can step through it -- but then it would show you the command that's going to be run across the cluster. But if you say, "Yeah, that's kind of what I want, but instead of this I want to be able to go in and put in my own Perl script," then we'll allow people to do it.
We're going to try to expose what we do internally, but first put an easy interface to at least get something done, and then an easy path from novice to expert. But you'll need to know things like Perl. And then our challenge will be how to manage, say, 10 to 20 programs running at the same time over the data sets and not have people clobber each other. Kind of timesharing, but at the hundreds-of-computers level.
Koman: You have several other collections besides the Web. The ephemeral films and the television archives are not content from the Web, but content you're putting on the Web.
Kahle: We've put 1,000 films up online for people to download and use in any way that they want. What we really want is for people to make their own movies. But these, they're pretty wild films; education films, government films, propoganda films, industrial films. They're all available for download in MPEG2, which is DVD-quality, for people to do anything they want. People have made some really terrific films, and some of them are on the site as well. I really recommend "The ABCs of Happiness" and "The Consequence of War." Awesome films.
Koman: You wouldn't think with 100 terabytes of stuff already that you would need to encourage the creation of more content.
Kahle: We're trying to show how people can do it themselves. We're trying to encourage everyone to take their old content that's not online and put it online. A professor at UC Berkeley said that students use the Web as the resource of first resort, which is a huge change. But that's a little dangerous if the Web doesn't have the good stuff on it, and many people complain it doesn't. Instead of trying to whip students to go back to the physical library, let's put the good stuff on the Net. Otherwise, we could have a whole generation learning from ephemeral content collections, as opposing to learning from the books of the ancients. And a lot of materials are not there yet.
Koman: are you working with the great libraries on digitization?
Kahle: Yes, we're working with the Library of Congress on some of these Web collections and starting to work with them on digitizing different parts of their print collections. The Prelinger Archives is digitizing films. We're working with different researchers on automatic transcription of the television materials, so we can get that to be a referenceable resource. These are the sort of things we have to get to, and get to very soon. Every year that passes, we have more and more students using not the best we have to offer and that is a tragedy. We are the establishment. We should be making tools that allow children and students to have access to it all. And we're letting them down so far.
Koman: What about the question of rights? I just wrote about Lawrence Lessig's book on intellectual property. Surely the publishers and the television networks and the record companies aren't willing to let you keep a copy of all of their stuff?
Kahle: All we collect for the Web archive are sites that are publicly accessible for free, and if there's any indication from the site owner that they don't want it in the archive, we take it out. If there's a robot exclusion, it's removed from the Wayback Machine. Over the years, people would notice these things in their logs and would say, what are you doing? And we'd explain what we're doing -- building this archive and donating a copy to the Library of Congress, etc., etc., and 90% of the time they say, "Oh, that's cool, you're crazy, but go ahead." About 10% of the time, they'd say, "I don't want any part of it," and we instruct them on how to use a robot exclusion and they're taken out of history. That seems to work for everybody at this point. People are really excited about this future that we're building together.
Koman: The dot-bomb hasn't disillusioned you at all?
Kahle: I never predicited the capital market in the first place. I don't know where that came from, but wow, there was a lot of money there for awhile. But I love the era of dreams. I loved it when people were trying to make services whose only constraint was to be popular. They didn't have to make money, they just had to do something people liked. It was amazing the ideas… I'm glad they're captured in some way, because it's those dreams when the medium is new before you realize all it's faults and foibles, and the Internet is going to disappoint, it's going to be good at a few things and not good at everything else, but at least those dreams are something we should try to live up to the next time. As we refine technologies and come up with the next thing, let's see if we can live up to a few more of those dreams, not just the making a million dollars, but having the ability to get your words out, to reinvent government, whatever it is. If it doesn't happen this time, let's remember it, so the next time, let's give it another good shot.
Am I disillusioned? No. Is it depressing to see a lot of my friends out of work? Yes! But the goal of universal access to human knowledge is in many ways an original goal of the Net. It's a tremendous goal. It makes me want to jump out of bed in the morning and try to get this thing done. People working on digital divide issues want to join in, advocates for children's literacy programs want to join in. It's not about driving slick cars, it's about using this technology for the betterment of education and people. I'll take that any day over random stock option grants.
Richard Koman is a freelancer writer and editor based in Sonoma County, California. He works on SiliconValleyWatcher, ZDNet blogs, and is a regular contributor to the O'Reilly Network.
Copyright © 2009 O'Reilly Media, Inc.