7.12 Extracting Master Records from Tables
7.12.1 Problem
You have a wide, non-normalized table
with
string columns that are repeated several times in a row. You want to
restructure the table into a narrower form, replacing the strings
with key values that point to rows in a master table that contain the
longer string values. You not only need to restructure your table,
you need to create a new master table with the necessary string
values.
Let's say that the dean's office is
maintaining a list of Ph.D. students; however, the list is currently
in a spreadsheet format like this:
CREATE TABLE ThesisOld (
StudentId INTEGER,
Member1 CHAR(255),
Member2 CHAR(255),
Member3 CHAR(255),
Grade CHAR(2)
)
The spreadsheet is filled with repeated names of professors:
StudentId Member1 Member2 Member3
---------- --------------- ---------------- ----------------
1 Baird, Frances Anderson, Terry Newton, Isaac
2 Newton, Isaac Anderson, Terry Smith, John
3 Newton, Isaac Einstein, Albert Base, Samuel
4 Anderson, Terry Smith, John Einstein, Albert
Your job is to normalize the ThesisOld data and automatically copy it
into tables ThesisData and ProfessorMaster.
7.12.2 Solution
The process of creating and populating a new master table, and of
restructuring the fact table, consists of these four steps:
Create a new master table.
Extract master record values from the current fact table, and use
them to populate the new master table.
Create a new fact table.
Populate the foreign keys in the new fact table with pointers to the
appropriate master records.
The following sections describe each step in detail.
7.12.2.1 Step 1: Creating the new master table
Create the new master table, which will be named ProfessorMaster, as
follows:
CREATE TABLE ProfessorMaster(
ProfessorId INTEGER IDENTITY,
ProfessorName CHAR(255) UNIQUE
)
The two key items to note about this new master table is that it
contains one column for the professor name and one integer primary
key column. The value in the Id columns will eventually be used in
the new fact table as a foreign key into this table. Since we want
the data to be normalized, there will be one master record for each
professor.
7.12.2.2 Step 2: Extracting master record values
A SELECT query can retrieve the names from the ThesisOld table. Each
row in the ThesisOld table contains three
professors' names so a table-folding technique must
be used to get all three names into one column. The following UNION
query represents one approach:
SELECT Member1 FROM ThesisOld
UNION
SELECT Member2 FROM ThesisOld
UNION
SELECT Member3 FROM ThesisOld
Since professors may have sit on more than one thesis defense each,
something needs to be done to ensure that only one instance of each
professor name is returned. Because the query shown here is a UNION
query, duplicate elimination is taken care of for us. By definition,
a UNION query eliminates duplicate rows from the result set.
For this recipe, we will use a UNION query to fold together the three
professors' names from each row in the ThesisOld
table. To get the results into the new ProfessorMaster table, we can
use the following INSERT . . . SELECT . . . FROM statement:
INSERT INTO ProfessorMaster(ProfessorName)
SELECT Member1 FROM ThesisOld
UNION
SELECT Member2 FROM ThesisOld
UNION
SELECT Member3 FROM ThesisOld
After this statement is executed, the ProfessorMaster table will
contain one row for each distinct professor name.
7.12.2.3 Step 3: Creating the new fact table
The ThesisOld table contains three professor name columns. The new
fact table must contain three references to ProfessorMaster records.
The primary key of the ProfessorMaster table is an integer, so the
three pointers must be integers. The new fact table, which will be
named ThesisData, can be created as follows:
CREATE TABLE ThesisData(
StudentId INTEGER,
Member1 INTEGER,
Member2 INTEGER,
Member3 INTEGER,
Grade CHAR(2)
)
7.12.2.4 Step 4: Populating the new fact table
The final step in the process is to populate the new fact table. Data
must not only be taken from the old fact table, it also must be
joined with the appropriate rows from the new master table. The
following statement pulls all the data together and inserts it into
the new fact table:
INSERT INTO ThesisData(StudentId, Member1, Member2, Member3, Grade )
SELECT StudentId,
(SELECT m.ProfessorId
FROM ProfessorMaster m
WHERE t.Member1 = m.ProfessorId),
(SELECT m.ProfessorId
FROM ProfessorMaster m
WHERE t.Member2 = m.ProfessorName),
(SELECT m.ProfessorId
FROM ProfessorMaster m
WHERE t.Member3 = m.ProfessorName),
Grade
FROM ThesisOld AS t
This query may look complicated, but it's not as bad
as it seems. The first column in the select list is the StudentId
value from the ThesisOld table. This goes into the new table
unchanged. The second, third, and fourth columns in the select list
retrieve the ProfessorId from the ProfessorMaster record for the
professor named in the ThesisOld table. This ProfessorId is stored as
a foreign key in the new ThesisData table. The rest of the columns
are simply copied.
7.12.3 Discussion
The procedure described in this recipe can be used when normalizing
data imported from external sources. The human-readable spelling of
names consumes a great deal of space, which makes changing those
names difficult. The new structure centralizes the used strings into
one master table.The integrity of the procedure described in this
recipe depends on the assumption that no data in the ThesisOld table
is being changed while the procedure is underway. If data in the
ThesisOld table is being changed, then there is the possibility that
some names may not get captured and stored in the new master table.
This, in turn, will prevent the proper foreign-key references from
being recorded in the new ThesisData table. The ultimate result is
that data will be lost.
You can use the following query to verify that all professors names
in the Member1 column of the ThesisOld table have been properly
recorded in the new ProfessorMaster table:
SELECT DISTINCT StudentId, ProfessorName
FROM ThesisOld AS t
WHERE NOT EXISTS (
SELECT * FROM ProfessorMaster AS m
WHERE m.ProfessorName=t.Member1)
Similar queries can be used to check other columns. If any results
are returned, the Id values in the result set will identify those
records that have been inserted during the time that the steps in
this recipe were being performed. You can either drop all the data
and repeat the procedure from scratch or manually fix the problems
that each query identifies.
Once you are certain that all the data in the new master table and
ThesisData table is correct and consistent, you can create a
foreign-key constraint from the ThesisData table to the master
tables. While not absolutely necessary, such a constraint will
protect data integrity as future changes are made to the data in
those tables.
|