ONJava.com -- The Independent Source for Enterprise Java
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button
Article:
  Hierarchical SQL
Subject:   You might want to run the numbers ...
Date:   2004-08-07 10:14:24
From:   --CELKO--
In most hierarchical problems, more time is spent doing queries and not insert/delete operations. If you need to do random insertions, then use the nested interval model; it has the advantage of adjacency list insertion speeds and nested set subtree searches, but trades this for more complex procedures.


The adjacency list model is as fast as possible on insertion; a new node is a simple INSERT INTO statement. Deleting a node is the real problem because you have to connect the subordinates back into the tree! Also, this model requires procedural code to get data out of it for hierarchical aggregations, and to do searches for subtrees.


The nested sets model is meant for queries, but it updates much faster than people first think. The tree structure is in a table with three columns (node_id, lft, rgt), so a large tree fits into a few physical disk pages, or fits completely into a covering index.


If you have an ordered (what Sybase/SQL Server called "clustered") index, updates to the structure are done at table scan speeds.


I have tested nested sets on ~250,000 nodes at depths of 10 and 15 levels and the numbers were quite good. The test was to sum a numeric value of all subordinates at each node, with a grand total at the root. The second test was to randomly delete and insert 20,000 nodes, then re-organize the tree.


Depth does not matter in a Nested Set model-- it depends on the number of nodes (rgt-lft+1) in each sub-tree. Over an order of magnitude faster on subtree and aggregation queries and a bit over twice as long on insertion, once the disk pages were in cache.


In all hierarchy models, the constraints require to maintain a hierarchy are a bit complicated and are based on graph theory. The current crop of programmers does not seem to know about graph theory or constraints, so they do not bother to write any, and then the data integrity gets trashed.