Detect Data Islands of Nodes and Edges using SQL on SQL Server
SQL developers can query hierarchy data using recursive CTE expression on SQL Server databases easily. Although SQL recursive CTE queries are advanced techniques for database developers which is not supported by all database providers like SAP HANA, it seems to be an option to work with connected nodes just like graph data. For example, to separate nodes connected to each other by edges into groups like islands seperating connected ones from others, many SQL programmers get stuck in recursive SQL queries.
In this SQL tutorial, I want to share a solution for SQL database developers showing how to group nodes which are connected into islands. I use the island term as grouped nodes which are related to each other but totally separated from nodes in other islands.
Following image is showing a group of nodes and relation between nodes (fromNode connected to toNode).
This fromNode and toNode naming calls a similarity for hierarchy data. On the other hand, it seems to be impossible for me at least to solve this grouping nodes into islands problem with recursive CTE query.
As seen in below image, nodes with same color have a relation and there is a path (ignoring the arrow direction and assuming relation is bidirectional) from one node in island to an arbitrary node in the same island.
My sample data has four islands, now I want to build a SQL script that will identify these four islands including the nodes and edges within each data island.
Before you execute the given SQL code script, maybe I can tell you how it works for you to make you understand it easily.
First of all, I create a temporary table and store my sample data in it.
This is the ##Edges table where I insert data rows representing edges (relation of fromNodes and toNodes)
A second temp table I use for calculating groups in SQL database is ##RelatedNodes.
This table has nodes field where I add every node (fromNode, toNode) which are related to each other.
If we continue to read the SQL code, you will see I create a SQL cursor and loop through all edges of my sample data and try to place these nodes in one of the rows of ##RelatedNodes temp table.
Basically, ##RelatedNodes temporary table is a sandbox where I insert independent edge data as a separate row.
If there is a relation of the current cursor row nodes with only one of the rows of the temp table, then I concatenate the loop nodes to the related data group in ##RelatedNodes temp table
One of the cases that database programmers cannot handle is the situation where the current row of the cursor has a relation with two of the rows (two previously calculated islands).
In this case the current loop data (fromNode and toNode) is a bridge between two different nodes.
This case requires removal of two different islands by replacing them with a single island representation.
All these calculations are managed within the SQL cursor.
The number of relation of the loop data is calculated into @cnt variable.
If you place your pseudo-code within the loop or query ##RelatedNodes temp table to understand how SQL script code works, you will see there are repeating nodes in table's nodes field concatenated and separated by ","" comma character.'
I did not follow a path to remove these repeats since it will cause more complexity in SQL code and might cause the performance be worse.
And one last note for performance. Although I mentioned a few times that recursive SQL CTE query is not the way to solve this problem, running it once and grouping hierarhical data at the beginning of the script can gain performance. After the first detection of related nodes, you will see recursive CTE fails to combine some data. At this point the following structure can be executed to fetch additional relations, to detect bridges and to merge different data islands into a single group of nodes.
Here is the SQL codes that can be executed on SQL Server database
When above SQL code is executed on a SQL Server database, it will create sample data, calculate node relations in SQL cursor code and display grouped data as islands identified with GroupId in execution output
To summarize, I know executing SQL scripts with cursors is not a best practise for database developers. I have to admit that I will accept a solution using a set based operation instead of SQL cursors and looping through each data row to perform grouping and building relations between nodes. But I worked to solve this problem, seperating nodes and edges into data islands as I shared in the image at the beginning of this SQL tutorial but failed to resolve the issue without the use of SQL cursors.
Please feel free to inform me for additional solutions of this SQL problem. I will be pleased especially if there is a graph solution of this grouping and categorizing of nodes and edges.