Team LiB   Previous Section   Next Section

3.4 Reporting Region Boundaries

3.4.1 Problem

As in the previous recipe, you want to find regions in the data. However, now you want to report only the boundaries of the regions and not all members of the region. Reporting only region boundaries is useful when a data set is large and/or when the sizes of any regions are expected to be large.

3.4.2 Solution

To find region boundaries in our ProductionLine data, use the following query:

SELECT p1.ContainerId RegBeg, p2.ContainerId RegEnd
FROM ProductionLine p1, ProductionLine p2
WHERE (p1.ContainerId < p2.ContainerId) AND
   NOT EXISTS(SELECT * FROM ProductionLine p3 
      WHERE (p3.Purity!=100 AND 
      p3.ContainerId BETWEEN p1.ContainerId AND p2.ContainerId) 
      OR (p3.ContainerId=p1.ContainerId-1 AND p3.Purity=100) 
      OR (p3.ContainerId=p2.ContainerId+1 AND p3.Purity=100))

RegBeg      RegEnd      
----------- ----------- 
1           2
10          12

3.4.3 Discussion

The solution presented here is based on the idea first published by Rozenshtein, Abramovich, and Birger (Optimizing Transact-SQL, Fremont, SQL Forum Press:1995) and is still considered a classic.

Just as before, we need two tables to find neighboring rows. We name them p1 and p2 and produce a join between them. We then write a "less than" condition in the main query's WHERE clause to make sure that any p1 row represents an element with an index lower than the corresponding p2 row:

WHERE (p1.ContainerId < p2.ContainerId) AND

Then we write a subquery named p3, which makes use of a third instance of the ProductionLine table. For every candidate pair from the outermost query, we verify that there are no rows between candidates with a purity other than 100:

NOT EXISTS(SELECT * FROM ProductionLine p3 
      WHERE (p3.Purity!=100 AND 
      p3.ContainerId BETWEEN p1.ContainerId AND p2.ContainerId)

If there is even one row returned by this subquery, that means the region is broken (i.e., it is not a continuous region of purity=100), and the candidate pair should be discarded. However, this isn't quite enough to get us to our desired result, because the subquery does not eliminate smaller regions that are wholly contained within larger regions. For example, in the region between 10 and 12, we would also find regions 10-11 and 11-12. The solution is in two additional conditions at the end of the subquery that check on the lower and higher boundaries for possible neighbors that comply with the region requirement:

(p3.ContainerId=p1.ContainerId-1 AND p3.Purity=100) OR
(p3.ContainerId=p2.ContainerId+1 AND p3.Purity=100)

This final set of conditions ensures that only regions that cannot be extended anymore, those that are the largest, are reported.

    Team LiB   Previous Section   Next Section