Whither (wither) the Relational Database
by William Grosso
My friend Richard, who runs
the Business Intelligence SIG for SDForum,
is putting together a slightly larger event; it's a panel discussion
on relational databases.
He's got a lot of interesting ideas in his head, and they got me to wondering: is the role of the relational database going to change dramatically in the next 10 years?
Right now, relational databases are used in almost every enterprise application. That's a slight exaggeration, of course, but lots and lots of applications use them. And they're convenient, and cheap, enough that they're surreptitiously included in lots of client applications as well.
The key benefits that relational databases bring are persistence, transactions, reliability, and indexing. The price you pay is that your data has to be shoehorned into the relational model. Sometimes the shoehorning is gentle; sometimes it's an act of violence. And you've either got to write the queries yourself, or use some sort of object-persistence tool. And you've got all the overhead of the RDBMS system itself.
I like using databases enough that, on occasion, I've advocated using them everywhere. For example, this snippet from 1997 on the original wiki.
I also think the use of embedded databases is going to skyrocket. Every time I look at them, I start drooling. Not because I have a compelling new application that requires them, but because I think they make a lot of currently-complex tasks a little easier. It's going to require a mindset-change on the part of a lot of developers (using an RDBMS for persistent data is a lot different from using an embedded database within an application for non-persistent data), and it's emphatically a step away from OO onto a much more declarative path, but I think the potential is mind-blowing.
What I was doing there was looking at Moore's law, looking at how relational databases simplify my life in the enterprise application universe, and thinking "oooh. Moore's law says that I can get me one of them database things in every process. Cool!"
But here's what Moore's law really says:
You can get you one of them indexing and persistence things in every process. And, as time goes on, you'll be able to spend more and more cpu cycles on the indexing and persistence thing.
E.g. Moore's law gives me permission to use indexing and persistence engines in every process, but it doesn't insist that I use relational databases. And I'm starting to think I don't really want to, for five reasons:
If you're reading this and thinking "all he's saying is that lucene is useful for indexing xml fragments," you're halfway there. And If you're an XML lunatic who then says "Hey! Wow! And the world, in its entirety, is entirely composed of XML (or possibly RDF) fragments," then you've gone way too far. What I know is that the world is mostly made up of semi-structured data and I know that database schemas often evolve at a ferocious rate because, when we impose more structure, we often get it wrong.
And so now what I'm wondering is if I was completely off base in 1997. That is, I'm wondering if Moore's law really says that relational databases are going to become vastly less important over time, because for most applications there's a less-structured (and less efficient) way to do things that's more convenient for the programmers.
In much the same way that we moved to "higher level" and "scripting" languages, I'm starting to think we're going to move backwards, towards "more primitive" indexing systems where we just toss all the documents into the indexer and then pull out things based on text search.
The embarassing thing about this little essay (it's too long to call it an entry) is that I think I might have just understood Perl for the first time.
What do you think? What's the role of the relational database in 2006?
Fields as disjoint sets
This is an interesting conjecture, and I'm not sure I agree. In the book example, you could easily have a column called "Sale Price," which would have the same type as the original price. However, it's not clear that that would make it much harder to search. Even searching for any price that is $19.99 would narrow down the records a lot.
Please let us know if you find any empirical studies.
Don't forget Smalltalk environments and OODB
When trying to aggregate content or extract meaning, lots of text is too much text.
Lots of direct data links can preserve and express meaning... beyond just indexing.
Are we really talking about relational, or just disk-based SQL?
Maybe I'm just one of those disgruntled purists out there, but it seems that every time some expert proclaims the demise of relational DBMSs, they narrow the focus of their complaint to traditional SQL systems. An RDBMS does not have to use SQL (in fact, SQL is not fully relational anyway), and an RDBMS does not have to limit itself to disk storage.
And of course, the only real complaint made every time is 'but performance is so bad with an RDBMS'. Performance is an implementation issue, not an issue of which logical querying/manipulation methods you use.
Note this quote:
"The key benefits that relational databases bring are persistence, transactions, reliability, and indexing."
Um... aren't we forgetting something? (oh, yeah... logic) In fact, not one of these benefits mentioned above are distinctly relational. The relational model is about how we can logically query, manipulate, and constrain your data. No other method comes anywhere close to giving us the expressive, declarative power over our data. Again, don't limit this to SQL's shortcomings. What we need is systems that are MORE relational, not less.
"The price you pay is that your data has to be shoehorned into the relational model."
As opposed to shoehorning data into a hierarchical (XML) or network (object-oriented) model? There is a reason the industry abandoned hierarchical and network database models in the 1970s: mainly because they were ad hoc, path-based, dependency-driven messes. And, the hidden cost of these systems is that every query is actually a proprietary piece of code, that doesn't translate well to multiple systems. SQL, bad as it is, at least allows multiple software systems (written in different languages) to talk to the same data seamlessly.
Not to mention that a truly relational language can handle hierarchies or network structures AS easily as it handles set-oriented data. There is nothing preventing the relational model from encompassing "complex" datatypes, domain inheritance, and number of other goodies that would go a long way toward lowering the drudgery of database work, while if anything allowing for many more performance benefits than traditional SQL systems.
For some more information on what a true relational DBMS could look like, look at some of the pioneers in the field: C.J. Date and Hugh Darwen (www.thethirdmanifesto.com).
Finally, the "disjoint set" comment... HUH!?! How on earth would "SELECT * where PRICE = $19.99" be the same as "SELECT * where AUTHOR = $19.99"? Or does the author refer to some mythical attribute wildcard query, where the system queries EACH column for a match? If so, this makes even less sense. HOw would querying all attributes be as fast as querying one attribute?
And finally, XML would be the worst way to handle some sorts of data; let's just look at an article by Joel Spolsky, with a couple comments on implementation that maybe can cast some light on this: http://www.joelonsoftware.com/articles/fog0000000319.html (look further down for "SELECT author FROM books")
We aren't talking about either disks or SQL. That's some other article, on a different web site.
Taking this point by point.
(1) I'm not an expert on databases; I'm just a guy wondering out loud. Please don't call me an expert or think of me as an expert on databases.
(2) I didn't mention SQL at all. I'm referring to the use of relational databases in practice, as I have seen them used. I have no idea why you dragged in either SQL or performance complaints.
(3) The key benefits are the ones I listed: the relational model of data is nice, because it lets us predict what the results of queries will be, but it feels secondary compared to, for example, atomic transactions.
(4) You'll note that I'm not actually attacking RDBMS performance. It's fine. What I'm saying is actually the opposite: we can move to a less performant (and more agile, in the programming sense) solution because Moore's law has won, and we don't need to structure our data quite so thoroughly for performance reasons. And yes, we might wind up "indexing on every column." For a large class of cases, it might not matter.
(5) You might have a point about shoehorning. But ... we're going to shoehorn into XML anyway (for interop). Do we need to go through the shoehorn pain twice?
(6) The whole "ad-hoc" query paragraph is confusing. I'm not advocating a return to network or hierarchical databases, or path-based querying. What I'm saying is something far simpler.
(7) I've read the third manifesto, and I'm familar with what you're talking about (by and large, non-existent software with idealized attributes that will solve problems nobody really cares to solve in practice).
(8) I was unclear in my example. By "[ANY COLUMN]" I meant, select over all columns. The article is exploring the idea that relational databases can be replaced by keyword-based text search in certain scenarios. And the bullet point simply says that since database columns are often disjoint sets, specifying the column is (often) logically redundant.
(9) I didn't claim that we should get rid of RDBMS, or even that they are a bad thing. What I claimed was that, with the advent of ever faster machines and high-quality text searching libraries, maybe the "structured data" model is unnecessary in a lot of cases. Maybe, with the advent of high-quality and readily available text searching solutions, and the movement of the world towards "document-centric" architectures, we don't need to use relational databases in quite so many scenarios.
(10) We're talking about moving the boundaries, and realizing that the class of applications for which a relational database (as implemented in the real world) makes sense has shrunk. That Joel can invent a hypothetical example where the (completely unmeasured) performance cost is prohibitive is (possibly) interesting but (almost certainly) irrelevant.
Are we really talking about relational, or just disk-based SQL?
Sorry for the tone of my hastily-written comment.
"(2) I didn't mention SQL at all. I'm referring to the use of relational databases in practice, as I have seen them used. I have no idea why you dragged in either SQL or performance complaints. Perhaps you were actually replying to a different entry?"
You're right... I was responding to what I thought you were saying, which I have seen in many places. Apologies, yes, but your only example was an SQL example.
"(3) The key benefits are the ones I listed: the relational model of data is nice, because it lets us predict what the results of queries will be, but it feels secondary compared to, for example, atomic transactions."
I have to disagree most strongly here. Any method of working with data can use transactions. I still think the logical benefits are far greater. (Meaning, with the relational method I can accomplish more with less code, and be more secure that the system will do what I want.)
"(4) You'll note that I'm not actually attacking RDBMS performance. It's fine. What I'm saying is actually the opposite: we can move to a less performant (and more agile, in the programming sense) solution because Moore's law has won, and we don't need to structure our data quite so thoroughly for performance reasons. And yes, we might wind up "indexing on every column." For
a large class of cases, it doesn't matter."
I see, and yes that is an interesting difference from what I usually hear in these cases. However, I don't structure my data for performance reasons. I structure it based on what I want it to do.
"(5) You might have a point about shoehorning. But ... we're going to shoehorn into XML anyway (for interop). Do we need to go through the shoehorn pain twice?".
I think that is a case-by-case thing, but I was under the impression there are all kinds of easy automated ways to get XML output from relational databases.
"(7) I've read the third manifesto, and I'm familar with what you're talking about (by and large, non-existent software with idealized attributes that will solve problems nobody really cares to solve in practice)."
Non-existent? In my limited circles I know of at least 3. Yes, this is a new area for database software in general, so unfortunately there is not a lot out there. But these systems WILL solve problems that I want solved, and the work I do is very business-oriented, non-abstract, non-research-oriented. I don't see why so many people dismiss this stuff out of hand.
"(8) I was unclear in my example. By "[ANY COLUMN]" I meant, select over all columns. This is a pretty obvious deduction, once you stop assuming I don't know what I'm talking about."
OK, OK... sorry again...
"I'm talking about keyword based text search, and saying that since database columns are often disjoint sets, specifying the column is (often) logically redundant."
So, given that we are worrying less about performance, we can search everything as text? Is that what you are getting at? If not, I still don't see why specifying the column is redundant, because PRICE would certainly be numeric, not text. Maybe I'm missing something. Please explain more.
"(9) I didn't claim that we should get rid of RDBMS, or even that they are a bad thing. What I claimed was that, with the advent of ever faster machines, maybe the "structured data" model is
unecessary in a lot of cases. Maybe, with the advent of high-quality and readily available text searching solutions, and the movement of the world towards "document-centric" architectures, we don't need to use relational databases in quite so many scenarios."
Fair enough. I don't necessarily want a relatiol DBMS inside everything. But, I can't think of a single large application I have worked on that would be better off without it.
"(10) We're talking about moving the boundaries, and realizing that the class of applications for which a relational database (as implemented in the real world) makes sense has shrunk. That Joel can invent a hypothetical example where the (completely unmeasured) performance cost is prohibitive is (possibly) interesting but (almost certainly) irrelevant."
Maybe so, but also maybe we can move some other boundaries. The idea of the relational database is pretty well solidified in the SQL/disk-based client-server concept, when that doesn't necessarily have to be the case. Nor are we stuck with any certain way of representing our data in a relational system. Isn't some of this worth exploring also?
Structured and unstructured data both have their place
Sometimes we are interested in structured data and sometimes we are not. Unstructured search is wonderful when it works, but it does not do everything.
For example, lets take your database table for books. I am currently working with a database with a table that has both the price and the cost of each item, so if I am interested in books that with a price that is less than $19.99, and I could do an unstructured search for less than $19.99, I would see both books that cost less than $19.99 as well as books that are priced less than $19.99.
This little example has introduced a couple of problems. Firstly there is an inequality, which when we get into it uncovers the need for data types, otherwise, for example, in a character comparison 100 is less than 20. Also as the example shows, inequalities can be very useful part of search.
Secondly if the search can distinguish between the price number and the cost number, it will return better results. Recently, I wrote some scripts in awk to collect and average performance numbers. I collect both the CPU time and the elapsed time for queries, and these scripts certainly depend on the positions of numbers in a line.
In practice we have a whole spectrum of data ranging from the unstructured, through the semi-structured to completely structured. The more we know about the structure of the data the easier it is to do sophisticated searches. Normally we do not throw away structure when searching.
For example, consider your calendar. Dates are always a pain to deal with, even when we know that they are dates. Trying to formulate an unstructured search on a date range would be so extremely painful that I cannot imagine trying to do it. On the other hand, there are certainly occasions where I would have liked to be able to do an unstructured search on structured data.
"Towards a Self-Tuning RISC-Style Database System"
This was a relevant paper presented at the VLDB conference in Cairo in 2000. Here's an online copy:
It's an interesting argument that today's "universal" database systems are unsupportably complex, that the flexibility is not required for most applications, and that we could replace most db's with "netapp" style boxes that can self-tune for the load you place on them (no explicit indexing, no explicit tuning) by reducing the instruction set to a (mathematically) well-understood subset. NB this isn't throwing away the relational model, but they do advocate throwing away SQL.
To paraphrase Dick the Butcher - "First thing we do, let's sack all the DBAs...". Obviously this won't work for everyone, but Chaudhuri and Weikum's point is about the 80/20 rule, and is well made. The article is worth a read. (there's lots more papers on vldb.org on the future of db technology for those interested)
"Towards a Self-Tuning RISC-Style Database System"
Wow. Great paper recommendation. Thanks.
Tight or loose: clean or messy
Some learn piano by sitting down and playing; others learn from teachers steeped in Western musical tradition. Each can play any piece of music but the traditional learner can communicate more precisely about the music with his/her peers, since he shares a detailed common musical vocabulary.
So also with the continuum of search that ranges from pure relational database search to free text search. If a problem is well understood (has a proven model) then the RDBMS approach will dominate. Otherwise text search allows testing of models which, once they become stable, can be "frozen" in RDBMS implementations. Also RDBMS can "loosen up" via fuzzy, probabilistic or text queries; text search can "tighten up" by implementing relational search based on a given model.
So part of the question is to what degree do we wish to communicate about the models (the meta-data)? And to/with whom (our friend Tom, other people, computers, etc.) do we wish to communicate?
Hiding in this discussion is the old memory vs CPU tradeoff. RDBMS use CPU early to organize the data into memory structures and to minimize CPU required for search; text searchers use CPU late to organize and search the data. Either technique will work; my belief is that they are mathematically equivalent from several perspectives. But should the cost of CPU and memory fall to zero (and this is a certainty), the CPU/memory criteria will not be useful for preferring either RDBMS or text search.
I would argue that the more critical difference is that any human(s) involved would prefer one or the other depending on their interaction with the system, i.e., their ability to construct correct models of the world and that, once a model is shown to be correct, then a relational implementation would be the preferred one.
So for accounting, RDBMS would be preferred, whereas for a criminal investigative system a text search system with intelligence (a hybrid, perhaps fuzzy-relational) would be better, since the underlying models are in flux and not so definite as in the case of the financial system. [Aside: in a crime any object can be used to perform any action. Systems that too tightly define the objects (people, places, things) involved in a crime (or other human activity) will be unable to characterize certain crimes.]
There is growing momentum in the Open Source Java world for RDBMS alternative s like Jisp, JDBMS, and Prevayler. These are all systems that allow objects to be persisted to disk, and in the case of JDBMS and Prevayler in a transactional manner.
The beauty of these systems is that they are extremely light-weight (talking kbs for the enitre package) and have pretty impressive performance.
I know that my first instinct when designing a new application is to start designing the database - I seem to simply presume that this is how it will be built. Perhaps in many cases (outside the enterprise, in particular) moving to alternative technlogies has benefits.
Wither the Relational Database?
Wm Grosso's observation that '...fields in structured data are often disjoint sets...' does not, so far as I know, match up to any references on database design, but a similar notion does have a role in the literature of data warehousing. A 'fact table' (or 'event table' as I named it years ago) has by design a very few, and in some cases just one, fact columns. The other columns identify the dimensions of the fact. This database structure gives a database architect the option to limit the domain of an fact column to values selected from disjoint sets. So although at fact column may contain author name, publication date, and other attributes, only the rows of price type would contain a dollar value that would match $19.95, or perhaps range between $0.0 and $19.99. A query that selects all rows with a fact column value in a specific range would then yield the identifiers of the books that have prices in that range.