7.7 Folding Tables

7.7.1 Problem

You imported a wide, non-normalized table with columns organized in an array-like manner. You need to convert the data in that table to a more normalized form. For each non-normalized row in the original table, you need to generate several rows of normalized output. Taking the ranking spreadsheet from the student's office, you would prefer to have a list-like report, where the rank for each student in a year would represent one row, so the columns named Y2000, Y2001, and Y2002 are the ones that need to be flattened. Given the data shown earlier in this chapter for the ranking spreadsheet, you want to see output like this:

Name       Year Rank        
---------- ---- ----------- 
Anna       2000 1
Anna       2001 2
Anna       2002 3
Billy      2000 4
Billy      2001 5
Billy      2002 6
Joe        2000 7
Joe        2001 8
Joe        2002 9
...

7.7.2 Solution

The act of taking a row in table and converting it into two or more narrower rows in an output table is referred to as folding. There are at least three approaches that you can take to fold data. An easy-to-understand approach is to use a union query. In some cases, you may be able to fold the data by joining the table to itself. A more general approach is to fold the data using a Pivot table.

7.7.2.1 Using a union query to fold data

A very easy-to-understand approach to folding data is to write one SELECT statement for each column that you want to fold and then union all those statements together. With respect to the bookstore example being used for this recipe, you could use the following query to fold the three columns:

SELECT Name, '2000' Year, Y2000 Rank FROM StudentRankings
UNION
SELECT Name, '2001', Y2001 FROM StudentRankings
UNION
SELECT Name, '2002', Y2002 FROM StudentRankings

The first SELECT statement returns the data for the year 2000, the second returns data for the year 2001, and the third returns data for the year 2002. The column alias Rank is used in the first SELECT statement to provide a name for the column containing the ranking data.

This appears to be a simple and straightforward solution, so why worry about any other approach? There are two reasons, actually, why you might not want to use a UNION query like the one shown here. One reason is that older versions of SQL servers didn't allow a UNION query to be encapsulated into a view. The other reason is performance. In the example shown here, three SELECT statements are being executed, requiring the server to make three passes through the table. For a small table, that might not present a problem, but as the amount of data in the table increases, so will the performance hit that you take by doing this.

7.7.2.2 Using a self-join to fold data

You can use a self-join to fold the data if the following statements are true:

You have at least as many rows in the table as you have columns to fold
You have a numerical column, such as an Id column
The values in the numerical column begin with 1 and are guaranteed to be sequential

We are folding three columns (Y2000, Y2001, and Y2002), and our StudentRankings table has three rows, so the first point is covered. The row Id number is sequential, starts with 1, and has no gaps, so the second two points are also covered.

The following query, then, can be used to fold the three volume columns into one:

SELECT t1.Name Name,
   (CASE WHEN t2.Id=1 THEN '2000' 
      WHEN t2.Id=2 THEN '2001' 
      WHEN t2.Id=3 THEN '2002' END) Year,
   (CASE WHEN t2.Id=1 THEN t1.Y2000 
      WHEN t2.Id=2 THEN t1.Y2001 
      WHEN t2.Id=3 THEN t1.Y2002 END) Volume
FROM StudentRankings t1, StudentRankings t2
WHERE t2.Id BETWEEN 1 AND 3 
ORDER BY Name, Year

If you've never seen a query like this before, you may need to stare at it for a while before it begins to make sense. All the data comes back from the occurrence of the StudentRankings table that has the alias name t1. The WHERE clause restricts the t2 table so that only three rows are returned. These three rows have Id numbers 1, 2, and 3 and are used to control the results returned by the CASE statements. The discussion section for this recipe explains how this works in more detail.

7.7.2.3 Using the Pivot table to fold data

If you can't do a self-join with your data, you can use a more general approach to folding data that involves a Pivot table.

You need to have at least as many rows in the Pivot table as columns that you are folding. Just be sure that your Pivot table is large enough. Once the Pivot table has been created and populated, you can use it to fold the data by joining it to your data table. The following query, for example, folds the data in the StudentRankings table:

SELECT Name Name,
   (CASE WHEN i=1 THEN '2000' 
      WHEN i=2 THEN '2001' 
      WHEN i=3 THEN '2002' END) Year,
   (CASE WHEN i=1 THEN Y2000 
      WHEN i=2 THEN Y2001 
      WHEN i=3 THEN Y2002 END) Volume
FROM StudentRankings, Pivot
WHERE i BETWEEN 1 AND 3 
ORDER BY Name, Year

This query is essentially the same as the one shown earlier in the section on self-joins. The only difference is that the t2 table, in this case, is the Pivot table and not another instance of the StudentRankings table.

7.7.3 Discussion

The table-folding techniques shown here can be quite useful when working with data from legacy systems. These systems often have tables or other data structures containing repeating data elements, such as the three rank columns in our example. By folding the tables containing data from such systems, you can restructure the data into a more normalized form. The normalized data is then more easily used in relational operations, such as a table-join or a group-by summarization.

Typically, table folding is done in batch procedures when data from an external system is imported into a SQL database. Folding is fairly efficient, since it can be done in one query. To implement the same process in a classical programming language, at least two nested loops would be needed. The first loop would need to step through the entire table, while the second would need to iterate through the columns in each row and generate a flattened result.

The self-join shown in this recipe is very similar to the join used with the Pivot table. With the self-join solution, the second occurrence of the table was limited to only the first three rows. It effectively functioned as the Pivot table. Why, then, do we show both solutions? The self-join is shown here partly for artistic reasons. Some programmers simply do not like to have more tables than absolutely necessary. It's also easier to write one query than it is to create a Pivot table, populate it with data, and then write a query. From a performance point of view, however, the Pivot table query is better. If you are writing a batch procedure that is going to be used often, the Pivot table solution is the one to go with.

The logic behind the two join queries is quite simple. They both use two tables. The information is taken from the first table. The second table helps in folding the first. In general, if you join a table with N rows and a table with M rows, the result has N * M rows. Each row from the first table is matched with each row from the second one. This behavior is exactly what we need to fold data. The second table is used as a guiding table that extracts M (in our case, three) columns from the first one.

The following data represents the intermediate result set produced by the join operation in the queries:

Name        Y2000     Y2001     Y2002     i
----------- --------- --------- --------- -----------
1           1         2         3         1
1           1         2         3         2
1           1         2         3         3
2           4         5         6         1
2           4         5         6         2
2           4         5         6         3
3           7         8         9         1
3           7         8         9         2
3           7         8         9         3

As you can see, each data row is repeated three times. This gives us one row for each of the three columns that we need to fold. The i column (which was t2.id in the self-join) distinguishes between the three, otherwise duplicate, rows. The i column can then be used within a CASE statement to return the rank column of a student for each year.

There are actually two CASE statements in the query, and they are almost the same. The first CASE statement uses the i value as a basis for returning a year. The second uses the i value as a basis for returning the student's rank value for that year.