SQL Query to Detect Data Islands and Gaps with Boundary Values
To detect data islands and gaps and their boundaries is a popular problem for SQL database developers. In this SQL tutorial, I want to show how data islands can easily identified by database programmers using Row_Number() function and Min() and Max() With Partititon By clause.
What is a Data Gap and Data Island for SQL Developer
Let's start with a definition of data gaps.
What is a gap?
Assume that you have a table column with data type int to store integer values.
And the table data has values like 1,2,3,7,8,9,11,20,21 if we sort in ascending order.
Did you realize we have some missing integer values in the sorted numbers like 4,5,6 and 10 as well as all numbers between 12 and 19 including the two values.
These missing integer values cause existence of data gaps in our column values.
If SQL developers understand the definition of data gaps, it is easier to understand and identify data islands.
Again let's look at the same numbers in the sorted list: 1,2,3,7,8,9,11,20,21.
We can simply name data islands as the group of data seperated by two data gaps in a sorted list.
I can guess this definition is still complex.
1,2,3 are our first data island in given sample SQL data.
7,8,9 integer values form the second data island.
11 itself is a data island created by a single integer value.
Last data island in our sample data list is formed of integer 20 and 21.
I hope definition of data gaps and data islands is now more clear in SQL programmers' minds.
Let's do the same logical thinking on some other sample data formed of date values.
I have following sample data: '2018-01-01', '2018-01-02', '2018-01-04', '2018-01-07', '2018-01-08', '2018-01-09', '2018-01-13', '2018-01-14', '2018-01-16'
The sample date data is sorted in ascending order again.
3rd of January is missing. So we have identified the first data gap.
'2018-01-05' and '2018-01-06' create another data gap in the ordered dates data.
'2018-01-10', '2018-01-11' and '2018-01-12' forms the third data gap in our list.
Last data gap is in dates sample list is '2018-01-15'
Now it is easier to fetch the data islands from the date column in our sample database table.
1st and 2nd date of January forms the first data island in the same data.
4th of January is a single date data island.
7th, 8th and 9th of January is a larger data island in dates list.
13th and 14th of the month form andother data islan.
And the last data island is formed of only one date figure, 16th of Jan.
Now let's create SQL database table and populate it with sample data.
How to Identify Data Islands and Data Gaps in Integer Table Column
On our sample database table, execute following SQL Select query.
Please note that I commented the id column from the Select list, because the output will be the same with Row_Number() function output in this case.
I did not want to confuse you so I commented id column.
In our query database developers will use the Row_Number() function output.
Did you see it in starting_point column calculation.
Let's check the output of the above SQL Select query execution and try to understand how we can use this result to identify data islands and their boundaries.
This output shows that the numbers of the same data island have the same value in last column.
For example 1,2,3 which are in the same data island have 0 in the last column.
Similarly 7,8,9 of the same data island have 3 in their last column. And it goes on like that.
This result is the output of a very simple mathematical calculation fact.
If the numbers are listed in sequence one after an other without any gap, the increase in their values is 1.
The Row_Number() function value will also increase by 1.
So as seen in the starting_point column calculation, if we substract these two figures, the result will be same for all serially listed row data.
Let's go one step further but before notice that I name the last column as groupid because each one actually identifies a data island.
By executing following SQL Select statement where I used SQL CTE expression just like a subselect statement and STRIN_AGG function to concatenate values of the same data island as a comma seperated list.
Another benefit of the fist SQL Select statement is it provides the boundaries of the data island to the SQL developers.
Let's see how we can explicitly show the data island boundary values or limits of each data island in our sample database table data.
The same output can be obtained from the execution of the following SQL query where MIN() and MAX() functions are used with OVER Partition By clause.
As seen in above SQL query outputs, it is this much easy to find the lower limit and upper limit of a data island
Of course, to merge all above SQL queries into a single one to get all features of a data island. Please execute following database query to see the results.
In the following days, I will be adding a similar section for database developers to find data gaps in date values. So SQL programmers can identify data islands and gaps in date column values.