7.7 Folding Tables
7.7.1 Problem
You imported a wide, non-normalized table
with columns organized in an
array-like manner. You need to convert the data in that table to a
more normalized form. For each non-normalized row in the original
table, you need to generate several rows of normalized output. Taking
the ranking spreadsheet from the student's office,
you would prefer to have a list-like report, where the rank for each
student in a year would represent one row, so the columns named
Y2000, Y2001, and Y2002 are the ones that need to be flattened. Given
the data shown earlier in this chapter for the ranking spreadsheet,
you want to see output like this:
Name Year Rank
---------- ---- -----------
Anna 2000 1
Anna 2001 2
Anna 2002 3
Billy 2000 4
Billy 2001 5
Billy 2002 6
Joe 2000 7
Joe 2001 8
Joe 2002 9
...
7.7.2 Solution
The act of taking a row in table and converting it into two or more
narrower rows in an output table is referred to as
folding. There are at least three approaches
that you can take to fold data. An easy-to-understand approach is to
use a union query. In some cases, you may be able to fold the data by
joining the table to itself. A more general approach is to fold the
data using a Pivot table.
7.7.2.1 Using a union query to fold data
A very easy-to-understand approach to folding data
is to write one SELECT statement for each
column that you want to fold and then union all those statements
together. With respect to the bookstore example being used for this
recipe, you could use the following query to fold the three columns:
SELECT Name, '2000' Year, Y2000 Rank FROM StudentRankings
UNION
SELECT Name, '2001', Y2001 FROM StudentRankings
UNION
SELECT Name, '2002', Y2002 FROM StudentRankings
The first SELECT statement returns the data for the year 2000, the
second returns data for the year 2001, and the third returns data for
the year 2002. The column alias Rank is used in the first SELECT
statement to provide a name for the column containing the ranking
data.
This appears to be a simple and straightforward solution, so why
worry about any other approach? There are two reasons, actually, why
you might not want to use a UNION query like the one shown here. One
reason is that older versions of SQL servers didn't
allow a UNION query to be encapsulated into a view. The other reason
is performance. In the example shown here, three SELECT statements
are being executed, requiring the server to make three passes through
the table. For a small table, that might not present a problem, but
as the amount of data in the table increases, so will the performance
hit that you take by doing this.
7.7.2.2 Using a self-join to fold data
You can use a self-join to fold the data
if
the following statements are true:
You have at least as many rows in the table as you have columns to
fold
You have a numerical column, such as an Id column
The values in the numerical column begin with 1 and are guaranteed to
be sequential
We are folding three columns (Y2000, Y2001, and Y2002), and our
StudentRankings table has three rows, so the first point is covered.
The row Id number is sequential, starts with 1, and has no gaps, so
the second two points are also covered.
The following query, then, can be used to fold the three volume
columns into one:
SELECT t1.Name Name,
(CASE WHEN t2.Id=1 THEN '2000'
WHEN t2.Id=2 THEN '2001'
WHEN t2.Id=3 THEN '2002' END) Year,
(CASE WHEN t2.Id=1 THEN t1.Y2000
WHEN t2.Id=2 THEN t1.Y2001
WHEN t2.Id=3 THEN t1.Y2002 END) Volume
FROM StudentRankings t1, StudentRankings t2
WHERE t2.Id BETWEEN 1 AND 3
ORDER BY Name, Year
If you've never seen a query like this before, you
may need to stare at it for a while before it begins to make sense.
All the data comes back from the occurrence of the StudentRankings
table that has the alias name t1. The WHERE clause restricts the t2
table so that only three rows are returned. These three rows have Id
numbers 1, 2, and 3 and are used to control the results returned by
the CASE statements. The discussion section for this recipe explains
how this works in more detail.
7.7.2.3 Using the Pivot table to fold data
If you can't do a self-join with your data, you
can use a more general approach to
folding data that involves a Pivot table.
You need to have at least as many rows in the Pivot table as columns
that you are folding. Just be sure that your Pivot table is large
enough. Once the Pivot table has been created and populated, you can
use it to fold the data by joining it to your data table. The
following query, for example, folds the data in the StudentRankings
table:
SELECT Name Name,
(CASE WHEN i=1 THEN '2000'
WHEN i=2 THEN '2001'
WHEN i=3 THEN '2002' END) Year,
(CASE WHEN i=1 THEN Y2000
WHEN i=2 THEN Y2001
WHEN i=3 THEN Y2002 END) Volume
FROM StudentRankings, Pivot
WHERE i BETWEEN 1 AND 3
ORDER BY Name, Year
This query is essentially the same as the one shown earlier in the
section on self-joins. The only difference is that the t2 table, in
this case, is the Pivot table and not another instance of the
StudentRankings table.
7.7.3 Discussion
The table-folding techniques shown here can be quite useful when
working with data from legacy systems. These systems often have
tables or other data structures containing repeating data elements,
such as the three rank columns in our example. By folding the tables
containing data from such systems, you can restructure the data into
a more normalized form. The normalized data is then more easily used
in relational operations, such as a table-join or a group-by
summarization.
Typically, table folding is done in batch procedures
when data from an
external system is imported into a SQL database. Folding is fairly
efficient, since it can be done in one query. To implement the same
process in a classical programming language, at least two nested
loops would be needed. The first loop would need to step through the
entire table, while the second would need to iterate through the
columns in each row and generate a flattened result.
The self-join shown in this recipe is very similar to the join used
with the Pivot table. With the self-join solution, the second
occurrence of the table was limited to only the first three rows. It
effectively functioned as the Pivot table. Why, then, do we show both
solutions? The self-join is shown here partly for artistic reasons.
Some programmers simply do not like to have more tables than
absolutely necessary. It's also easier to write one
query than it is to create a Pivot table, populate it with data, and
then write a query. From a performance point of view, however, the
Pivot table query is better. If you are writing a batch procedure
that is going to be used often, the Pivot table solution is the one
to go with.
The logic behind the two join queries is quite simple. They both use
two tables. The information is taken from the first table. The second
table helps in folding the first. In general, if you join a table
with N rows and a table with M rows, the result has N * M rows. Each
row from the first table is matched with each row from the second
one. This behavior is exactly what we need to fold data. The second
table is used as a guiding table that extracts M (in our case, three)
columns from the first one.
The following data represents the intermediate result set produced by
the join operation in the queries:
Name Y2000 Y2001 Y2002 i
----------- --------- --------- --------- -----------
1 1 2 3 1
1 1 2 3 2
1 1 2 3 3
2 4 5 6 1
2 4 5 6 2
2 4 5 6 3
3 7 8 9 1
3 7 8 9 2
3 7 8 9 3
As you can see, each data row is repeated three times. This gives us
one row for each of the three columns that we need to fold. The i
column (which was t2.id in the self-join) distinguishes between the
three, otherwise duplicate, rows. The i column can then be used
within a CASE statement to return the rank column of a student for
each year.
There are actually two CASE statements in the query, and they are
almost the same. The first CASE statement uses the i value as a basis
for returning a year. The second uses the i value as a basis for
returning the student's rank
value for that year.
|