Team LiB   Previous Section   Next Section

3.7 Working with Sequences

3.7.1 Problem

You want to find any arithmetically increasing sequence in the data. In our example, you want to find any sequences in which the purity level is arithmetically increasing (100, 101, 102, etc.). Any such sequences that are three or more containers long indicate that a production line is overheating.

3.7.2 Solution

Find sequences in the ProductionLine data using the following query:

SELECT 
   p1.ContainerId SeqBeg, p2.ContainerId SeqEnd,
   p2.ContainerId-p1.ContainerId+1 SequenceSize
FROM ProductionLine p1, ProductionLine p2
WHERE 
   (p1.ContainerId < p2.ContainerId) AND
   NOT EXISTS(SELECT * FROM ProductionLine p3 
      WHERE (p3.Purity-p3.ContainerId!=p1.Purity-p1.ContainerId AND 
      p3.ContainerId BETWEEN p1.ContainerId AND p2.ContainerId) 
      OR (p3.ContainerId=p1.ContainerId-1 AND 
         p3.Purity-p3.ContainerId=p1.Purity-p1.ContainerId) 
      OR (p3.ContainerId=p2.ContainerId+1 AND 
         p3.Purity-p3.ContainerId=p1.Purity-p1.ContainerId))

SeqBeg      SeqEnd      SequenceSize 
----------- ----------- ------------ 
2           5           4
8           9           2

3.7.3 Discussion

This query uses a framework similar to that used to find regions. The difference is that the subquery contains a WHERE clause condition to identify sequences. To explain exactly how that works, let's begin by looking at the raw data:

ContainerId Purity      Diff        
----------- ----------- ----------- 
1           100         99
2           100         98
3           101         98
4           102         98
5           103         98
6           100         94
7           103         96
8           108         100
9           109         100
10          100         90
11          100         89
12          100         88

Notice that the purity levels of containers 2-5 represent an arithmetically increasing sequence. Notice also that the container IDs represent an arithmetically increasing sequence. We can use the fact that both sequences are monotonic to our advantage. It means that, within any given sequence, the difference between the ContainerId and Purity will be a constant. For example:

100 - 2 = 98
101 - 3 = 98
...

We put knowledge of this pattern to use in the subquery, which uses the following WHERE condition to find any candidates that break the sequence:

p3.Purity-p3.ContainerId!=p1.Purity-p1.ContainerId

If any row in a candidate sequence (p3) has the same difference as the first row of the sequence (p1), it is a member of the sequence. If not, the candidate pair (p1, p2) should be discarded.

The rest of the framework is exactly the same as for finding regions, and you can easily extend it for sequences in the same ways that you can when finding regions. For example, to find only sequences larger than three rows, add the following WHERE condition:

p2.ContainerId-p1.ContainerId>=2 AND

For example:

SELECT 
   p1.ContainerId SeqBeg, p2.ContainerId SeqEnd,
   p2.ContainerId-p1.ContainerId+1 SequenceSize
FROM ProductionLine p1, ProductionLine p2
WHERE 
   (p1.ContainerId < p2.ContainerId) AND
   p2.ContainerId-p1.ContainerId>=2 AND
   NOT EXISTS(SELECT * FROM ProductionLine p3 
      WHERE (p3.Purity-p3.ContainerId!=p1.Purity-p1.ContainerId AND 
      p3.ContainerId BETWEEN p1.ContainerId AND p2.ContainerId) 
      OR (p3.ContainerId=p1.ContainerId-1 AND 
         p3.Purity-p3.ContainerId=p1.Purity-p1.ContainerId) 
      OR (p3.ContainerId=p2.ContainerId+1 AND 
         p3.Purity-p3.ContainerId=p1.Purity-p1.ContainerId))

SeqBeg      SeqEnd      SequenceSize 
----------- ----------- ------------ 
2           5           4

With this framework, you can use algorithms for regions and apply them to sequences with minimal changes to the code. Usually, you just have to add an additional condition such as the one we added in this recipe.

    Team LiB   Previous Section   Next Section