by John P. Pratt
Sat 24 Nov 2012
The Easy-Tree Database is designed for small databases, where only a few simple reports are needed. Many reports can be produced with one requesting program, much more easily than by using a relational database.
There are many ways to organize data. When there is a lot of data to organize, perhaps listed on many tables, the entire group of data is called a database. Let's look at some basics of databases.
One of the first ways one thinks of to organize data is what is known as a "tree" structure. It is so named because a tree has one trunk, which divides into branches, which in turn divide into twigs, which end in leaves. Each leaf is on only one twig which is on only one branch which attaches to the main trunk. Many sets of data fit that structure very well. Data organized in this manner is often called a "tree", but the official name is "hierarchical database". The word hierarchy first referred to the organization of a religious priesthood into levels of authority, but it has come to mean any such structure, such a the president of a company being over several vice presidents who in turn each supervise many managers who in turn each supervise many workers.
Probably the most common usage to describe tree relationships is the "parent-child" concept. The idea usually is that at each level, one object (the "parent") can be "over" several that belong to it in some way (the "children"). It is common for each parent to have several children, although it is not required to have any. Again it is similar to one trunk have many branches, each of which has many twigs, each of which has many leaves. This article is organized in a tree structure with each main part (numbered 1. or 2. etc) can have many subparts (1.1, 1.2, etc.) Files in a computer are also usually stored in a simple tree, where each parent is called a folder or directory, and can contain several files and also other folders. But the key feature of the tree database is that each child has only one parent.
Note that in a tree, a parent-child relationship can be almost any one-to-many relationship. It often is some sort of superior-inferior relationship such as bigger, older, or supporting, all of which are somewhat implied by the words "parent-child". While that is often true, even in the case of a real tree where a branch is bigger, older, and supports the twigs, that need not be the case. For example, if it is a database of all the people bitten by one mosquito, then the mosquito would be the "parent" and all the bigger and stronger victims would be the "children" because there is only one mosquito and many victims. All that is important for the tree data structure is that the relationship be well-defined, and that it be a one-to-many relationship.
In this article, I've decided to use the parent-child terminology because it also brings with it a rich vocabulary with words like "ancestors" and "descendants" which immediately bring to mind the desired concept of having branched through many "generations". But please keep in mind that the words can describe things like the attributes of a person, such as his height, weight and age. We don't think usually of a person as the parent of his attributes, but for the purposes of the Easy-Tree Database, those are the words used to describe this one-to-many relationship.
Soon it was realized that many sets of data do not fit that simple tree structure. Books in libraries are not all easily categorized to be on one certain subject. I've looked for a book in bookstores that had been put in very different areas by the different stores. So library card catalogs were designed that could have several cards for one book, assigning it to be under several subjects. Moreover, there were card indexes for the titles and for the authors.
And speaking of parents and children, real children always have two parents, rather than just one as in a tree structure. And when one looks at most data, it is rare indeed that it all fits into a simple tree structure. In life, many-to-many relationships are very common. A college teacher has many students, but the students also have many teachers. Back when I was in elementary school, a student had only one teacher, and life was simple. It was a huge jolt to advance to junior high where suddenly we all had seven teachers. Life became much more complicated, just as do most databases describing the so-called real world.
In 1969 a "Network Database" was introduced to handle those problems. A child can have multiple parents in a network database. It was implemented in a few computer schemes, but was almost immediately replaced with what has now become the standard, the "relational database".
In 1970 the "Relational Database" was introduced, which has become nearly universally accepted for all data. Because of the many possible relationships that can exist between a set of objects to be organized, it was designed so that any number of "relations" could be handled. It is based on a set of tables that describe the objects. For a library, one table could list all of the titles of books by one author. Another table could be a list of all authors, with a short biography of each. Another could be a table of who published each book and what it cost. Another could match titles to subjects, and another tell where each book is located on the shelves. With those tables, a special table relating authors to subjects would technically not be necessary because that information could be retrieved by consulting both the author-title table and the title-subject table. But for speed of looking up data, a new table (called an index) could be constructed that would relate authors directly to subjects.
The idea of a relational database is that each table lists some relationship between part of the data, and then other tables (indexes) can add other relationships. Then a standard query language (SQL) is introduced which allows one to ask questions about the database. Even if there had been some obvious structure to the overall set of data (such as a tree or network), that is all removed so that all relationships are treated equally. Any tree, network, or other graph or structure can be created from these "atoms" of relationships, given enough time by SQL programmers (of which I've been one) by writing a special program for each case to construct the desired structure.
The place where the relational database has a chance to shine is where there is lots of data and one is seeking to discover new relationships in the data. If one is curious to know if there is any relationship between resting heart rate and age, one can query the database and draw graphs and see what the data implies. But if one already knows what the principal relationships are, and is not seeking deeper esoteric knowledge, then it is not at all clear that it was a good idea to destroy that obvious relationship in order to use a bunch of little tables. This brings us to the reasons for introducing the Easy-Tree Database.
What led to the development of the Easy-Tree Database (ETD) was that my son began a basketball league and wanted me to create a web page where all the players could follow their statistics. At the time I was an SQL programmer and had created many websites, so he thought, as I did, that it should be an easy task. When I began to think through the problem, I realized that it would really be a lot of work. It would entail writing a very special purpose application just for him. If it could be reused by others then perhaps it could be sold to make it worth doing. But SQL is not reusable code, unless of course, the buyers wanted exactly what I produced. That wasn't likely, so I figured I'd have to charge him about $5,000 for my time. Okay, so I knew better than to suggest that!
So should I have told him that it was just too hard? What happened was that I reviewed the problems of a relational database, then how in my own experience I'd seen at least one solution implemented, and then decided to create a simplified new database based on some of the principles I had learned so that what I would produce would be usable in a wide variety of situations. Let's look at each of those steps.
The worst problem with a relational database is that for many applications, the whole idea looks like a huge step in the wrong direction. All of the natural relationships are first destroyed, with the promise we can not only rebuild them, but also create any new relationships we desire. But was it really necessary to destroy the natural relationships to start with?
For example, consider data that has a natural tree structure. In fact, consider data describing a real apple tree. Putting all the structural data of a tree into a relational database is like cutting the tree into a million pieces, with every part severed from every other part. Each leaf is cut off and given an ID number, with a set of attributes of that leaf, like size, number of lobes, etc. Similarly, all apples, twigs and branches are cut off. Then a set of tables is included to show which leaves went on which twigs, with other tables showing which twigs go on which branches, etc. With the resultant set of many tables it becomes quite a task to reassemble the tree. The good news is that if you want to build a hut from this tree, you can write the computer code to reassemble all the pieces to build a hut, or a anything else, perhaps with the leaves all stuck directly to the trunk, and the apples forming a heart on the door, which is made of branches. But if all you want is to reassemble it back into a tree, and in fact, if you never really wanted to cut up anything but only wanted to understand the tree better, then it might have been better not to cut it all up to start with!
To see how extreme this "cut-up" mentality is, one of the rules is that the order of the rows in a table can have no importance. In other words, there can be no implied relationship in the order of the rows, such as the rows for each leaf being their order on the twig. If you want the order to be important, then you better add another column such as an identification (ID) number, and then sort on it. That puts the last nail in the coffin of any natural order to the data. If it is desired to sort the data on different columns, that is easy to do with an index without actually destroying the original order of the rows in the table.
Another problem is with SQL, the language used to talk to a relational database. It leads to extremely clumsy computer code. It is not elegant, nor does it lead to design patterns which can be reused in other applications. It is a "brute force" method of programming. Yes, it can accomplish the task, but then one has to start all over on the next task. No real progress (like reusable routines) is made. I just did an Internet search for "SQL libraries" and found only a few hits, but they were not actual routines one can use to save having to write the code again. All they did was attempt to simplify code production.
This led me to re-examine the natural "tree database" which is ready made to describe trees. Why was it rejected? Can't the problems of single parentage be overcome in order to take advantage of the fact that the overall shape of the tree is obvious at a glance. Especially if the user promises in blood up front that he only wants to learn about certain relationships.
For example, for my son's basketball league, all he wanted was to know about the stats for each player for each game and also overall stats for each player. But he had no desire to make charts of how scoring was related to the heights of the players, or if the colors of the jerseys affected the number of 3-point shots attempted. A relational database gives the flexibility of asking such questions, making it a great research tool, but the price of that flexibility is that it is just as difficult to ask for the simple stats he desired.
Here allow me to insert my personal experience/bias. I once worked with a team for a company which was rebelling against the relational database. The company's creator/founder had seen these weaknesses and realized how much had been lost when all obvious overall structural relationships in the data were severed. He designed a database which preserved certain "tree-like" structures, and it greatly sped up the creation of certain desired output. Essentially, the problem with network databases had been that they had been very slow, and this company found a way to improve the structure of a network database and increase its speed significantly. That my first introduction to databases.
Then I worked for another company which used a standard relational database. I dutifully learned SQL and saw for myself how incredibly clumsy it is. The worst part was that there was almost no reusable computer programs. Every time someone wanted to do something just a little different, everything had to be rewritten. So there are a lot of people out there writing special code for every customer. It is a great way to employ SQL programmers but a terrible way to make progress.
My proposed Easy-Tree Database is a hybrid between the relationship database and the tree database. It starts with a set of tables just as does the relational. That because that is often what we are given for data. If the data were already truly in a tree form then it could be used directly. But as in the case of the books in a library and most other data, it does not come pre-assembled into a tree. So it starts with a set of tables similar to a relational database.
But the next step is one that distinguishes the ETD. One searches for some sort of "natural tree structure" in the database. If there are exceptions, such as some multiple parents, they can often be handled separately by creating indexes and including them in the tree structure. An index relates different objects in the tree to each other without adding any new information except that relationship. But if some over-arching natural tree structure is evident, then the Easy-Tree Database is a possibility to use. If there is none, then the relational might be better.
Here we can add that in the ETD the parent-child relation is generalized so that a parent can have many types of children, each assigned to a different class. The relationship of the parent to each member of a class is the same, but can be very different for different classes. For example, in a database of all the things a person is responsible to care for, a "parent" might include "children" of his pets, furniture, cars, etc. If it is important to track different attributes of each of those different kinds of cared-for items, or in their relationship to the parent caretaker, then each is assigned to a different "class" in the database. Thus, cars can have makes, models, and colors for attributes, whereas pets might have food required as an attribute. The relationship of the parent to each class can also be different, such as playing with the pets, repairing the furniture and driving the cars. But all can be listed as children of the caretaker in the Easy-Tree Database.
One way to tell if there is a natural tree structure is to ask what kind of reports you are wanting output from the database. If the reports have subject titles, and simple lines of output in parent-child relationships, then that points to the tree structure. This can probably best explained by going through the basketball example step by step.
But if the data is amenable to a tree structure, then the next step is that the user produces one more "Master" table which provides instructions on how to assemble all of the data into one big tree structure. After that step, then all of the reports are generated by using that tree. The original tables are never used again, the opposite idea from the relational database which only uses those tables.
One more example might illustrate the comparison of the ETD to the relational database. If the relational database concept were used to actually physically organize all the books in the library, the division into little tables could be compared to keeping all the books in the boxes they came in. Those might be a box of books on random subjects donated, or perhaps a box of all the books by one author. If the librarian throws up his hands after seeing that there is no way to please all of the people all of the time with any one arrangement, he could leave all the books in the boxes, and make lots of card catalogs that would lead people eventually to find the right book in the right box. If looking for a few books on astronomy, it might take walking all over the library to dig out the books, but it would be possible, although very clumsy and time consuming. That is one (not the best) implementation of a relational database, with one table for each box and then lots of indexes.
But if the librarian realizes that the books on shelves will take up just the same space as in boxes, and could be arranged there in a way to allow most of the people to easily find what they seek most of the time, then why not do that? So in fact, that was done is real libraries, with most books being arranged in a tree structure by subject, with general subjects branching down to more and more detailed subjects. A decimal system is employed to allow new branches to grow as new subjects appear. In this manner, someone looking for books on a subject can easily find most of them all together in one place (or a few places).
We already discussed how the other problems are solved in a library for people not wishing to search by subject. Card indexes were made to find all the books by one author, or books by title. That comes much closer to pleasing all of the people all of the time. The Easy-Tree Database is essentially the same as the library system. It puts everything into one tree, chosen to satisfy the needs of the most common searches, and then provides indexes for the others.
The final result in my case was that a simple tree could indeed handle the basketball program and with code that could be reused for many other purposes. In particular, nearly all the web pages my son desired could be produced with the same reporting program, rather than writing a separate program for every page as would be necessary in SQL. So let's now look at the basketball example in detail to see exactly how it works.
How can the basketball data can be organized into a tree in a manner that facilitates one simple program to produce many of the desired reports? Here I have to note that in even asking this question, I'm breaking one of the fundamental rules I was taught at one place of employment. It was: "Never design a database with any one view of the data in mind." But that is exactly what I want to do, to make it very easy to produce certain reports, namely the only ones required, at the expense of making others much more difficult. It may be true that the customer will always change his mind and want more, but in the simple case at hand I'm not going to allow that! The maxim may turn out to be correct in the general case, but in the short run I got the job done, and my son (wisely) has never asked for more.
So here is a description of the data. Each year there are a group of teams. Each team has a group of players, and players can change what team they are on from year to year. Teams are paired up to each play several games per year at given times and places. Stats are kept on each team member for each game he played. Oh yes, there are a few odds and ends, like a Regular Season and Playoffs, a Player-of-the-Day, and the like. Doesn't sound too hard, does it?
The first thing we note is that there is indeed a natural tree structure to this data. In particular, each year has many teams. That is like each year is the parent of many children teams. Remember, all that parent-child really means is that it is a one-to-many relationship. In one year there are many teams. Similar, one one team there are many players. Another branch of the tree could be that each year also has many games and each game has exactly two teams. So those are also parent-child relationships. And again, going further, each team in a game has many players who played that day, and each of those players had many stats. Moreover, each player also has many more permanent attributes, such as height, weight, and age. Those many attributes can also be thought of as "children" of one player. So this is a natural for a tree-structure database
But what about the parts of the data which do not fit a one-to-many tree structure? For example, not only does one team have many players, but one player can be on many teams. And one game has two teams, but those two teams each play many games. So the relation of players to teams and also teams to games are both many-to-many, which implies a network structure. In fact, that was the reason the network structure was invented.
This is where the Easy-Tree Database insists on using a simple tree anyway! Instead of throwing up our hands and discarding the tree (or rather, chopping up the tree), let's see if we can't still use the tree anyway. Anything with multiple relationships can be listed exactly once in the tree, and then any other relationship can be expressed by referring simply to that unique position in the tree. For example, there can be one list of all players who have ever participated. That can have their personal permanent information, such as height, high school attended and year of graduation. In a similar manner, there could be only one list of all teams, and their attributes, such as names and jersey colors.
So how can the yearly team roster be handled in a tree format? It's easy. Under each year where the roster of teams for that year are to be listed, in the place where the team name would be, only the ID number is entered that refers to the name on the team list. Similarly, where the players on each team that year would be listed, only the ID number uniquely referring to each player is listed. Thus extra tables (indexes) can be put in the tree which supply the many-to-many relationships. This is the same as what is done in a relational database, and both would used an almost identical table. The point here is that the index table is easily included in the tree.
Now let's note one more incentive to stick to a simple tree, namely the ease of generating desired reports.
Another important point is to look at is what are the required reports. Would the natural tree structure facilitate creating those reports?
In this case the answer for at least some of the reports is yes! Suppose we had one reporting program that could produce output that looked like this structure, where actual title and entries would come from the data:
|Dad 1||Son 1||Son 2||Son 3|
|Dad 2||Son 1||Son 2||Son 3|
|Dad 3||Son 1||Son 2||Son 3|
|Dad 1||Son 1||Son 2||Son 3|
|Dad 2||Son 1||Son 2||Son 3|
|Dad 3||Son 1||Son 2||Son 3|
|Dad 1||Son 1||Son 2||Son 3|
|Dad 2||Son 1||Son 2||Son 3|
|Dad 3||Son 1||Son 2||Son 3|
Of course, the report program would handle cases other than three of each level, that number was just chosen for the example. That one table generating program could produce many different reports by simply indicating where to begin in the tree and what levels of descendants to include. For example, it can produce all of the following reports which are three of those requested by my son.
|Team||Player||Height||High School||Yr. Grad.|
|Al||6' 1"||Mt. View||'02|
Now note that the next table of very different data is in exactly the same format.
|Tue Mar 4, 2003|
|Thur Mar 6, 2003|
|Tue Mar 11, 2003|
And again, the game stats in the same format. (Player stat totals will be handled separately.)
Note also how much can be implied in the ordering of the data. The winning team can be listed first, so that no special report is needed to declare the winner in the Game Stats table. Similarly, the stats can be listed in the order of points scored. If other orderings are desired, the one report program could have the feature of sorting the rows on any column. Of course other reports such as a list of all players and a list of all teams are also easily produced with the same routine.
It is probably only necessary to read this section if you are actually planning to use this database for your own purpose. It contains a detailed description of how the basic data tables are created, and then a special Master Table is made showing the Tree-making program how to stitch all of the data together into one big tree.
So let us look at exactly what tables were produced in order to create the Easy-Tree. These table mostly reflect the method of gathering the data. For example, the table of game stats is added to after every game by simply adding rows with the stats of each player for each game.
Indented Format. The format for every table is that they are in a simple text format, with items separated by commas, semi-colons or other chosen punctuation mark (called the "delimiter"). The first line contains the column headers for the table. The names can either refer to new data found in this table or to the ID numbers from other tables.
Something very different from relational database tables is that an indented format is allowed. If the first item on a line is skipped, so that it starts with a delimiter, then the item above is implied.
Here is a list of the raw tables, which apply to all years. They are all in simple delimited (usually with comma) text files, with each beginning with an ID number (or string of characters) to that item:
There is also a set of tables produced for every year. If any of the above items had changed from year to year, such as placed to play, then those tables would have been included in the lists below. Here are the annual tables:
All that is left is that the reporting script is called with the needed parameters to point to the exact place in the tree to begin the report, along with information about skipping any generations, or providing totals/percentages for stats. One more thing needed is to handle reports not directly generated by simply traversing generations on the tree. The main example is a report of the ranking of all players for a given statistic, such as points scored, or games attended. I'll also defer writing that section because I may change the design of how that is handled.