Sunday, April 30, 2006

Mathematics Genealogy: Indexing

Exploring some of the entries in the Mathematics Genealogy project has led me learn about some interesting and unexpected connections, but it's also highlighted quite a number of limitations regarding accuracy and promptness of updates. Saying this is really just an indication that if you offer something good, then people will be looking for more!

One particular issue is that the total number of descendants requires a separate process to run as explained by the FAQ, which says:

Because of the time required to run the descendant counting program, it is only run once per week (early morning US Central Time on Sundays), while our data is updated nightly.

That surprises me somewhat as with around 100,000 people with not very many details stored per user and few relations, it's not a big complex database. The issue here is probably that it's a relational database and the advisor-student relationship is hierarchical, somewhat like a tree structure. However, it's not a tree because of having multiple parents (multiple advisors), but rather a directed graph, where the nodes represent the mathematicians and the edges correspond to the advisory relationship. [I'm taking definitions from MathWorld, an encyclopaedia that provides clear and nicely formatted explanations with diagrams]. Further, I think there is a fair chance that it would be more general than a simple directed graph from the scenario of the same supervisor supervising a candidate in more than one thesis - although it might sound unlikely today, it is quite plausible a few centuries ago, when a researcher could be at the forefront in a number of fields. I'd also expect it to be an oriented graph, in that supervision is expected to go in one direction, but it's not inconceivable that a student produces a thesis separately in two fields under two supervisors and then shares the knowledge back across.

Returning to the problem of counting, hierarchical relationships are easy to model in a relational database, but retrieving even summary counts may mean a lot of spidering through the hierarchy, which can be very slow. The key consideration is how to index the database. I'm not a database expert, but have seen this issue in the daily work I undertake as an administrator of WebLearn an e-learning system based on software called bodington, which is essentially a web database application. The system contains various resources, arranged hierarchically, in trees, so more specific than the genealogy case.

Jon Maber, the original developer, had started work on Bodington in the mid 90s and had thought about the issue of efficent queries about resources within a given branch; he reviewed approaches to indexing and decided to adopt the tree visitation model devised by Joe Coelko. Celko had given consideration to this graph theory problem and came up with SQL for Smarties: A Look at SQL Trees, an article that appeared in DBMS, March 1996 . Basically, each node or vertex has two indices - left and right - that are numbered according to a complete tour of all the nodes, visiting each twice. It means that selecting the number of descendants of a resource a simple SQL statement that subtracts one index from another at the given node. However, there is a trade-off in that every time you update the database you need to update the index, so if lots of changes are being made it can be a major performance issue.

Celko's solution may not be appropriate in this case, but it looks like the of approach that may lead to a suitable index that will allow real-time queries of how many descendants. The article was published more than 10 years ago, so I expect research has progressed a fair bit since then.

1 comment:

Anonymous said...

Dis you find any progress to Mr Celko's theory ?

I didn't...since XML shreding database capabilities are still on their begining.