Querying the Web Library

by Piroz Mohseni

Editor's note: This is the introductory column for a series that will explore methods of retrieving data from back-end applications, using emerging technologies like REBOL and the Network Query Language (NQL). In this first column, Piroz Mohseni defines the problem space that the series will address.

Consider how the URL has changed over the past few years.

A few years ago, each URL contained a collection of static pages.

The dynamic aspect of a site was a factor of how often the pages were updated which for news sites was quite often. Given that model, search engines were the ideal tool for finding relevant information in a large pool of unstructured data (in other words, web pages). The Web was a large library without a card catalog. Search engines provided an automatic text-based index and retrieval mechanism which was not comprehensive, but was useful.

The library model changed very quickly and with it web applications emerged. URLs no longer contained static pages, but were a source of dynamic data produced by sophisticated (and sometimes not very sophisticated) Web applications.

For example, what you'll find at is not a collection of HTML pages, but dynamic market data retrieved from back-end systems and made available to the browser. Catalog and pricing information for many e-commerce sites are not stored in static pages, but often dynamically generated from a variety of back-end systems.

And, of course, web applications are a two-way street: you can not only retrieve data, you can add to or update existing data through the browser.

Searching a Dynamic Library

A regular search engine cannot search these dynamic pages. The Web library is still growing as more legacy systems and applications are becoming "web-enabled." The massive amount of data stored within these systems are exposed to web channels (URLs) through a Web application.

If we still want to view the entire Web as a giant library, how can we now interact with these various applications? How do we aggregate data from various sites when some of that data is only exposed through an application instead of a static page? We know how to interact with relational databases via SQL (Structured Query Language). What is the SQL for the Web?

The problem of web-enabling a database is relatively well understood and there are many tools that automate this process. But how can an application systematically interact with the various web front-ends that now stand in front of the traditional relational databases? How can a transaction be completed across multiple Web applications and data aggregated from various web sites?

Grabbing Dynamic Data

This is a key issue facing web developers today. Having enabled our systems on the Internet and within the enterprise, we find there is no consistent or standard manner to search and retrieve this information.

For a simple page, accessing the data is as easy as typing the URL. For content stored in databases that appears dynamically in the browser, we have to be more creative. The least common denominator for all web-based data is that they are viewable (or at least retrievable) by a browser. And from an application (or user) perspective, it should not matter where or how the data is stored or exposed.

Access to some data requires more sophisticated interactions such as authentication and going through a number of pages. For example, an e-commerce site may require you to log in, select a product, and enter your shipping address before it can calculate and display the shipping cost. Regardless of the steps, the data is eventually retrieved by the browser. So the key element in tackling the problem is to simulate a browser inside of an application that needs data from the Web library.

This approach is not new. Most languages now support HTTP which is used to send requests to Web servers as a browser does. Many tools are built around the same idea. The proposed solution, however, warrants further attention. The nature of the data is that it is unstructured. XML is supposed to fix that, but we are not there yet. With presentation and data mixed together in the form of HTML pages, the data retrieval problem often parallels screen-scraping techniques used to extract information from mainframe applications. Also, the dynamic nature of Web content/data also makes scripting languages more attractive for these types of applications as opposed to traditional "compiled" languages.

This column will focus on solutions for the general problem described above. We will take a look at various ways to access and manipulate the Web library. We'll focus on techniques based on established scripting languages such as Perl and new ones like REBOL and NQL. We'll discuss how Java can tackle this problem space and how XML can bring some order into this massive library.

A library is only as good as how easy its collection can be searched and accessed. There is not doubt that the Web Library is a tremendous collection of information and data needed by various applications. Without a standard card catalog, it is up to the applications to select the right data access strategy. Comments and feedback on this column, suggestions for what you'd like to see covered and your own experiences are always welcome.

Piroz Mohseni is president of Bita Technologies, focusing on business improvement through the effective use of technology. His areas of interest include enterprise Java, XML, and e-commerce applications. Contact him at