3.4 Reporting Region Boundaries
3.4.1 Problem
As in the previous recipe, you want to
find regions in the data. However,
now you want to report only the boundaries of the regions and not all
members of the region. Reporting only region boundaries is useful
when a data set is large and/or when the sizes of any regions are
expected to be large.
3.4.2 Solution
To find region boundaries in our ProductionLine data, use the
following query:
SELECT p1.ContainerId RegBeg, p2.ContainerId RegEnd
FROM ProductionLine p1, ProductionLine p2
WHERE (p1.ContainerId < p2.ContainerId) AND
NOT EXISTS(SELECT * FROM ProductionLine p3
WHERE (p3.Purity!=100 AND
p3.ContainerId BETWEEN p1.ContainerId AND p2.ContainerId)
OR (p3.ContainerId=p1.ContainerId-1 AND p3.Purity=100)
OR (p3.ContainerId=p2.ContainerId+1 AND p3.Purity=100))
RegBeg RegEnd
----------- -----------
1 2
10 12
3.4.3 Discussion
The solution presented here is based on the idea first published by
Rozenshtein, Abramovich, and Birger (Optimizing
Transact-SQL, Fremont, SQL Forum Press:1995) and is still
considered a classic.
Just as before, we need two tables to find neighboring rows. We name
them p1 and p2 and produce a join between them. We then write a
"less than" condition in the main
query's WHERE clause to make sure that any p1 row
represents an element with an index lower than the corresponding p2
row:
WHERE (p1.ContainerId < p2.ContainerId) AND
Then we write a subquery named p3, which makes use of a third
instance of the ProductionLine table. For every candidate pair from
the outermost query, we verify that there are no rows between
candidates with a purity other than 100:
NOT EXISTS(SELECT * FROM ProductionLine p3
WHERE (p3.Purity!=100 AND
p3.ContainerId BETWEEN p1.ContainerId AND p2.ContainerId)
If there is even one row returned by this subquery, that means the
region is broken (i.e., it is not a continuous region of purity=100),
and the candidate pair should be discarded. However, this
isn't quite enough to get us to our desired result,
because the subquery does not eliminate smaller regions that are
wholly contained within larger regions. For example, in the region
between 10 and 12, we would also find regions 10-11 and 11-12. The
solution is in two additional conditions at the end of the subquery
that check on the lower and higher boundaries for possible neighbors
that comply with the region requirement:
(p3.ContainerId=p1.ContainerId-1 AND p3.Purity=100) OR
(p3.ContainerId=p2.ContainerId+1 AND p3.Purity=100)
This final set of conditions ensures that only regions that cannot be
extended anymore, those that are the largest, are reported.
|