3.7 Working with Sequences
3.7.1 Problem
You want to find any arithmetically increasing
sequence
in the data. In our example, you want to find any sequences in which
the purity level is arithmetically increasing (100, 101, 102, etc.).
Any such sequences that are three or more containers long indicate
that a production line is overheating.
3.7.2 Solution
Find sequences in the ProductionLine data using the following query:
SELECT
p1.ContainerId SeqBeg, p2.ContainerId SeqEnd,
p2.ContainerId-p1.ContainerId+1 SequenceSize
FROM ProductionLine p1, ProductionLine p2
WHERE
(p1.ContainerId < p2.ContainerId) AND
NOT EXISTS(SELECT * FROM ProductionLine p3
WHERE (p3.Purity-p3.ContainerId!=p1.Purity-p1.ContainerId AND
p3.ContainerId BETWEEN p1.ContainerId AND p2.ContainerId)
OR (p3.ContainerId=p1.ContainerId-1 AND
p3.Purity-p3.ContainerId=p1.Purity-p1.ContainerId)
OR (p3.ContainerId=p2.ContainerId+1 AND
p3.Purity-p3.ContainerId=p1.Purity-p1.ContainerId))
SeqBeg SeqEnd SequenceSize
----------- ----------- ------------
2 5 4
8 9 2
3.7.3 Discussion
This query uses a framework similar to that used to find regions. The
difference is that the subquery contains a WHERE clause condition to
identify sequences. To explain exactly how that works,
let's begin by looking at the raw data:
ContainerId Purity Diff
----------- ----------- -----------
1 100 99
2 100 98
3 101 98
4 102 98
5 103 98
6 100 94
7 103 96
8 108 100
9 109 100
10 100 90
11 100 89
12 100 88
Notice that the purity levels of containers 2-5 represent an
arithmetically increasing sequence. Notice also that the container
IDs represent an arithmetically increasing sequence. We can use the
fact that both sequences are monotonic to our advantage. It means
that, within any given sequence, the difference between the
ContainerId and Purity will be a constant. For example:
100 - 2 = 98
101 - 3 = 98
...
We put knowledge of this pattern to use in the subquery, which uses
the following WHERE condition to find any candidates that break the
sequence:
p3.Purity-p3.ContainerId!=p1.Purity-p1.ContainerId
If any row in a candidate sequence (p3) has the same difference as
the first row of the sequence (p1), it is a member of the sequence.
If not, the candidate pair (p1, p2) should be discarded.
The rest of the framework is exactly the same as for finding regions,
and you can easily extend it for sequences in the same ways that you
can when finding regions. For example, to find only sequences larger
than three rows, add the following WHERE condition:
p2.ContainerId-p1.ContainerId>=2 AND
For example:
SELECT
p1.ContainerId SeqBeg, p2.ContainerId SeqEnd,
p2.ContainerId-p1.ContainerId+1 SequenceSize
FROM ProductionLine p1, ProductionLine p2
WHERE
(p1.ContainerId < p2.ContainerId) AND
p2.ContainerId-p1.ContainerId>=2 AND
NOT EXISTS(SELECT * FROM ProductionLine p3
WHERE (p3.Purity-p3.ContainerId!=p1.Purity-p1.ContainerId AND
p3.ContainerId BETWEEN p1.ContainerId AND p2.ContainerId)
OR (p3.ContainerId=p1.ContainerId-1 AND
p3.Purity-p3.ContainerId=p1.Purity-p1.ContainerId)
OR (p3.ContainerId=p2.ContainerId+1 AND
p3.Purity-p3.ContainerId=p1.Purity-p1.ContainerId))
SeqBeg SeqEnd SequenceSize
----------- ----------- ------------
2 5 4
With this framework, you can use algorithms for regions and apply
them to sequences with minimal changes to the code. Usually, you just
have to add an additional condition such as the one we added in this
recipe.
|