Team LiB   Previous Section   Next Section

7.12 Extracting Master Records from Tables

7.12.1 Problem

You have a wide, non-normalized table with string columns that are repeated several times in a row. You want to restructure the table into a narrower form, replacing the strings with key values that point to rows in a master table that contain the longer string values. You not only need to restructure your table, you need to create a new master table with the necessary string values.

Let's say that the dean's office is maintaining a list of Ph.D. students; however, the list is currently in a spreadsheet format like this:

CREATE TABLE ThesisOld (
   StudentId INTEGER,
   Member1 CHAR(255),
   Member2 CHAR(255),
   Member3 CHAR(255),
   Grade CHAR(2)
)

The spreadsheet is filled with repeated names of professors:

StudentId  Member1         Member2          Member3
---------- --------------- ---------------- ----------------
1          Baird, Frances  Anderson, Terry  Newton, Isaac
2          Newton, Isaac   Anderson, Terry  Smith, John
3          Newton, Isaac   Einstein, Albert Base, Samuel
4          Anderson, Terry Smith, John      Einstein, Albert

Your job is to normalize the ThesisOld data and automatically copy it into tables ThesisData and ProfessorMaster.

7.12.2 Solution

The process of creating and populating a new master table, and of restructuring the fact table, consists of these four steps:

  1. Create a new master table.

  2. Extract master record values from the current fact table, and use them to populate the new master table.

  3. Create a new fact table.

  4. Populate the foreign keys in the new fact table with pointers to the appropriate master records.

The following sections describe each step in detail.

7.12.2.1 Step 1: Creating the new master table

Create the new master table, which will be named ProfessorMaster, as follows:

CREATE TABLE ProfessorMaster(
   ProfessorId INTEGER IDENTITY,
   ProfessorName CHAR(255) UNIQUE
)

The two key items to note about this new master table is that it contains one column for the professor name and one integer primary key column. The value in the Id columns will eventually be used in the new fact table as a foreign key into this table. Since we want the data to be normalized, there will be one master record for each professor.

7.12.2.2 Step 2: Extracting master record values

A SELECT query can retrieve the names from the ThesisOld table. Each row in the ThesisOld table contains three professors' names so a table-folding technique must be used to get all three names into one column. The following UNION query represents one approach:

SELECT Member1 FROM ThesisOld
UNION
SELECT Member2 FROM ThesisOld
UNION
SELECT Member3 FROM ThesisOld

Since professors may have sit on more than one thesis defense each, something needs to be done to ensure that only one instance of each professor name is returned. Because the query shown here is a UNION query, duplicate elimination is taken care of for us. By definition, a UNION query eliminates duplicate rows from the result set.

For this recipe, we will use a UNION query to fold together the three professors' names from each row in the ThesisOld table. To get the results into the new ProfessorMaster table, we can use the following INSERT . . . SELECT . . . FROM statement:

INSERT INTO ProfessorMaster(ProfessorName)
SELECT Member1 FROM ThesisOld
UNION
SELECT Member2 FROM ThesisOld
UNION
SELECT Member3 FROM ThesisOld

After this statement is executed, the ProfessorMaster table will contain one row for each distinct professor name.

7.12.2.3 Step 3: Creating the new fact table

The ThesisOld table contains three professor name columns. The new fact table must contain three references to ProfessorMaster records. The primary key of the ProfessorMaster table is an integer, so the three pointers must be integers. The new fact table, which will be named ThesisData, can be created as follows:

CREATE TABLE ThesisData( 
   StudentId INTEGER,
   Member1 INTEGER,
   Member2 INTEGER,
   Member3 INTEGER,
   Grade CHAR(2)
)
7.12.2.4 Step 4: Populating the new fact table

The final step in the process is to populate the new fact table. Data must not only be taken from the old fact table, it also must be joined with the appropriate rows from the new master table. The following statement pulls all the data together and inserts it into the new fact table:



INSERT INTO ThesisData(StudentId, Member1, Member2, Member3, Grade )
   SELECT StudentId,

   (SELECT m.ProfessorId
      FROM ProfessorMaster m
      WHERE t.Member1 = m.ProfessorId),
   (SELECT m.ProfessorId
      FROM ProfessorMaster m
      WHERE t.Member2 = m.ProfessorName),
   (SELECT m.ProfessorId
      FROM ProfessorMaster m
      WHERE t.Member3 = m.ProfessorName),
   Grade
FROM ThesisOld AS t

This query may look complicated, but it's not as bad as it seems. The first column in the select list is the StudentId value from the ThesisOld table. This goes into the new table unchanged. The second, third, and fourth columns in the select list retrieve the ProfessorId from the ProfessorMaster record for the professor named in the ThesisOld table. This ProfessorId is stored as a foreign key in the new ThesisData table. The rest of the columns are simply copied.

7.12.3 Discussion

The procedure described in this recipe can be used when normalizing data imported from external sources. The human-readable spelling of names consumes a great deal of space, which makes changing those names difficult. The new structure centralizes the used strings into one master table.The integrity of the procedure described in this recipe depends on the assumption that no data in the ThesisOld table is being changed while the procedure is underway. If data in the ThesisOld table is being changed, then there is the possibility that some names may not get captured and stored in the new master table. This, in turn, will prevent the proper foreign-key references from being recorded in the new ThesisData table. The ultimate result is that data will be lost.

You can use the following query to verify that all professors names in the Member1 column of the ThesisOld table have been properly recorded in the new ProfessorMaster table:

SELECT DISTINCT StudentId, ProfessorName
FROM ThesisOld AS t
WHERE NOT EXISTS (
   SELECT * FROM ProfessorMaster AS m 
   WHERE m.ProfessorName=t.Member1)

Similar queries can be used to check other columns. If any results are returned, the Id values in the result set will identify those records that have been inserted during the time that the steps in this recipe were being performed. You can either drop all the data and repeat the procedure from scratch or manually fix the problems that each query identifies.

Once you are certain that all the data in the new master table and ThesisData table is correct and consistent, you can create a foreign-key constraint from the ThesisData table to the master tables. While not absolutely necessary, such a constraint will protect data integrity as future changes are made to the data in those tables.

    Team LiB   Previous Section   Next Section