Resolving Database Dependencies Using Topological Graph Sorting

In past projects at my former employer, I had to deal with a lot of historical grown (duplicated) code and databases. Some source code files contained of nearly 2,000 lines of code and contained a lot of duplicated stuff. The databases were in a similar situation. Much worse in this scenario was the violation of nearly all database normalisation forms. Many tables had duplicated information or non-atomic values. And we even did not know which table was still in use!

At this time, I was wondering how to get all database dependencies together without touching each file (the project had more than 100 source code files, a couple of shell scripts and some reports). But I never had the chance to think deeper about this problem since the project deadlines were hard and work was waiting for me. But I would like to make up for that.

A quick Scan

In an unstructured project, you have no change to give a clear statement about which tables are still in use. One way would be to check the data in each table and list their insert/update dates (if available). If the table is not touched a long time, chances are high that it is not in use anymore. Another approach is to scan all files for all table names (using linux’s grep command, for instance). You will definetely miss some tables or consider some that are in use. However this seems to be a good approach if you make backups!

Database Table Dependencies

Luckily, tables have foreign key constraints. Almost every RDMS system stores those information in any system databases. With some simple queries, you can figure out which table depends on another. MySQL for instance has the following query to get all referenced tables:

SELECT TABLE_NAME,
       REFERENCED_TABLE_NAME
FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE
WHERE REFERENCED_TABLE_SCHEMA = '*schema_name*'
  AND REFERENCED_TABLE_NAME = '*table_name*';

SELECT TABLE_NAME,

REFERENCED_TABLE_NAME

FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE

WHERE REFERENCED_TABLE_SCHEMA = '*schema_name*'

AND REFERENCED_TABLE_NAME = '*table_name*';

Using this query, all dependencies can retrieved as a list that is necessary for the next step: sorting!

Topological Sorting

Let’s assume we have figured out all tables that are not in use and want to drop them (or move them to anywhere else, for a while). Well, simply firing a drop table command will fail since there are foreign key (or other) dependencies. We need to figure out an order that shows which table can be droped (or moved) first since it has no dependencies. Droping those tables may result in deleting dependencies of other tables. If there is no circular dependency, we can drop all tables step-by-step until there is no table available.

Having a list of tables and their dependencies, we first create a directed acyclic graph. We consider the tables “A”, “B”, “C”, “D”, “E” and “F”. Further, the dependencies are defined as: {A, D} , {F, B}, {B, D}, {F, A} and {D, C} (the second table depends on the first). These definitions results in the following graph:

Directed, Acyclic Graph
Question 7, Chapter 4, CTCI

These information can be used to create a list of order using topological sorting algorithms. The following steps applied to all nodes in the graph are necessary in order to create the list:

if not visited before, add the node to the list of visited nodes
if node has no adjacent nodes left, add node to a stack
if node have adjacent nodes, go back to step 1

Step 3. ensures that there are no adjacent nodes left and we are at the end of our graph (in the above example, that would be node “C”). After adding “C” to the stack, the same is done for all other nodes. Notice that there is a stack used as the data structure. In the above diagram, node “C” is the last node that can be deleted since all other nodes depends on “C”. However the LIFO property of stacks ensures that we retrieve all values in reverse order, which means that “C” is retrieved as the last element.

The source code for topological sorting (from TopologicalSort class in PHPAlgorithms) is like:

public function sort(AbstractGraph $graph): Stack {
    if (!$graph instanceof DirectedGraph) {
        throw new InvalidGraphTypeException("topological sorting is only valid for directed graphs");
    }
    if ($graph->hasCycle()) {
        throw new InvalidGraphTypeException("the graph has a cycle. Topological sorting is only possible for directed acyclic graphs");
    }
    $allNodes = $graph->getNodes();
    $result = new Stack();
    $visited = new ArrayList();

    /*
     * starting with any node, it is first necessary to determine
     * whether the node is already visited. If not, a helper method is
     * called, which has the business logic.
     */
    /** @var Node $node */
    foreach ($allNodes as $node) {
        /*
         * skip node if it is already visited. Notice
         * that $visited and $result are passed by reference
         */
        if ($visited->containsValue($node)) continue;
        $this->_sort($node, $result, $visited);
    }
    return $result;
}

/**
 * @param Node      $node
 * @param Stack     $result
 * @param ArrayList $visited
 * @return void
 * @throws InvalidGraphTypeException
 * @throws \doganoo\PHPAlgorithms\common\Exception\InvalidKeyTypeException
 * @throws \doganoo\PHPAlgorithms\common\Exception\UnsupportedKeyTypeException
 */
private function _sort(Node $node, Stack &$result, ArrayList &$visited): void {
    $visited->add($node);
    /*
     * continue with the adjacent nodes of $node. The method calls itself
     * recursively until there are no nodes (the node is a leaf). Then,
     * the node is added to the stack (which is passed by reference)
     */
    /** @var Node $adjacent */
    foreach ($node->getAdjacents() as $adjacent) {
        //skip if already visited
        if ($visited->containsValue($adjacent)) continue;
        //recursive call with the adjacent node of $node
        $this->_sort($adjacent, $result, $visited);
    }
    //add node to result stack
    $result->push($node);
}

public function sort(AbstractGraph $graph): Stack {

if (!$graph instanceof DirectedGraph) {

throw new InvalidGraphTypeException("topological sorting is only valid for directed graphs");

}

if ($graph->hasCycle()) {

throw new InvalidGraphTypeException("the graph has a cycle. Topological sorting is only possible for directed acyclic graphs");

}

$allNodes = $graph->getNodes();

$result = new Stack();

$visited = new ArrayList();

* starting with any node, it is first necessary to determine

* whether the node is already visited. If not, a helper method is

* called, which has the business logic.

/** @var Node $node */

foreach ($allNodes as $node) {

* skip node if it is already visited. Notice

* that $visited and $result are passed by reference

if ($visited->containsValue($node)) continue;

$this->_sort($node, $result, $visited);

}

return $result;

}

/**

* @param Node $node

* @param Stack $result

* @param ArrayList $visited

* @return void

* @throws InvalidGraphTypeException

* @throws \doganoo\PHPAlgorithms\common\Exception\InvalidKeyTypeException

* @throws \doganoo\PHPAlgorithms\common\Exception\UnsupportedKeyTypeException

private function _sort(Node $node, Stack &$result, ArrayList &$visited): void {

$visited->add($node);

* continue with the adjacent nodes of $node. The method calls itself

* recursively until there are no nodes (the node is a leaf). Then,

* the node is added to the stack (which is passed by reference)

/** @var Node $adjacent */

foreach ($node->getAdjacents() as $adjacent) {

//skip if already visited

if ($visited->containsValue($adjacent)) continue;

//recursive call with the adjacent node of $node

$this->_sort($adjacent, $result, $visited);

}

//add node to result stack

$result->push($node);

}

Printing the stack in LIFO results in: “F”, “E”, “B”, “A”, “D”, “C”. Notice that this is not the only valid order. It is okay if you have a different one until it matches the above dependencies.

Conclusion

At the time working in the project stated above it was a kind of black magic to solve dependencies on a mathematical way. However, this was not as magical as expected at that time 😊

I love to adopt theoretical definitions to practical solutions. One of the most important things I have learned through algorithm and data structure problem solving questions is to find patterns and adopt. Question “Build Order” in Cracking The Coding Interview and this database dependency solving blog post is a very simple example for that. But it shows exactly how having strong skills in algorithms and data structures can help in your daily life.

Topological sorting is now a part of PHPAlgorithms, a library that summarises my so far learned algorithm and data structures. I am proud how the library grows 😊

The database dependency solver algorithm, whose core consists in fact of only Topological Sorting, is outsorced in a seperate project and ~~soon available on GitHub!~~ “Dependency-Solver” is now on GitHub.

Maybe, in a project in the future, I will face a similar problem like this – and can adapt!