2.4 Comparing Two Sets for Equality
2.4.1 Problem
You want to compare two sets of
rows for equality. For example, you took
a snapshot of the Students table in October and another in November.
Now, you want to compare those two copies.
The snapshot for October:
CourseId StudentName Score TermPaper
-------------------- -------------- ------ -----------
ACCN101 Andrew 11.00 3
And for November:
CourseId StudentName Score TermPaper
-------------------- ------------- ------ -----------
ACCN101 Andrew 11.00 3
ACCN101 Andrew 11.00 3
ACCN101 Bert 13.40 1
2.4.2 Solution
One solution is to consider the two tables as two different sets. You
can then adapt the "Implementing Set
Difference" recipe to this problem by using it to
report rows in one table that are not also contained in the other.
The following query will do this for two tables named StudentsOct and
StudentsNov:
SELECT so.*, COUNT(*) DupeCount, 'StudentsOct' TableName
FROM StudentsOct so
GROUP BY so.CourseId, so.StudentName, so.Score, so.TermPaper
HAVING NOT EXISTS (
SELECT sn.*, COUNT(*)
FROM StudentsNov sn
GROUP BY sn.CourseId, sn.StudentName, sn.Score, sn.TermPaper
HAVING sn.CourseId=so.CourseId AND
sn.TermPaper=so.TermPaper AND
sn.StudentName=so.StudentName AND
COUNT(*) = COUNT(ALL so.CourseId))
UNION
SELECT sn.*, COUNT(*) DupeCount, 'StudentsNov' TableName
FROM StudentsNov sn
GROUP BY sn.CourseId, sn.StudentName, sn.Score, sn.TermPaper
HAVING NOT EXISTS (
SELECT so.*, COUNT(*)
FROM StudentsOct so
GROUP BY so.CourseId, so.StudentName, so.Score, so.TermPaper
HAVING so.CourseId=sn.CourseId AND
so.TermPaper=sn.TermPaper AND
so.StudentName=sn.StudentName AND
COUNT(*) = COUNT(ALL sn.CourseId))
2.4.3 Discussion
This is a somewhat complex query. However, it's
really the union of two very similar SELECT statements. Once you
understand the first, you'll understand the second.
The first SELECT statement in the UNION returns a list of rows in
StudentsOct that do not also exist in StudentsNov. The second SELECT
statement does the reverse — it returns a list of rows from
StudentsNov that are not also in StudentsOct. Both of these SELECT
statements represent difference operations. If the two tables are
equal, neither SELECT will return any rows.
Let's focus on the first part of the SELECT
statement:
SELECT so.*, COUNT(*) dupeCount, 'StudentsOct' tableName
FROM StudentsOct so
GROUP BY so.CourseId, so.StudentName, so.Score, so.TermPaper
This SELECT statement retrieves all columns from the StudentsOct
table, groups the data by those columns, and uses the aggregate
function COUNT to return the number of rows in each group. For
example:
CourseId StudentName Score TermPaper DupeCount TableName
------------ ----------- ------ ----------- ----------- -----------
ACCN101 Andrew 11.00 3 1 StudentsOct
...
It's important that the GROUP BY clause lists each
column in the StudentsOct table. In this example, we added an extra
TableName column so that we can easily see to which table any extra
rows belong. The purpose of the grouping operation is to deal with
duplicate rows. If the StudentsOct table has two identical rows, then
the StudentsNov table must also have two identical rows that match.
The only way to check for this is to group the data by all columns in
the table, count up the number of occurrences in each group, and then
compare those counts across the two tables.
The HAVING clause functions as a WHERE clause, but at the group
level. For each group of one table, the HAVING clause checks to be
sure that there is a corresponding group of records in the other. All
columns must match, and the row count for the groups must match as
well.
The two queries in the union are symmetrical. Together, they compute
the table differences from two different directions. The following
possible table data illustrates why this is necessary:
SELECT * FROM StudentsOct
CourseId StudentName Score TermPaper
-------------------- ------------- ------ -----------
ACCN101 Andrew 11.00 3
(1 row(s) affected)
SELECT * FROM StudentsNov
CourseId StudentName Score TermPaper
-------------------- -------------- ------ -----------
ACCN101 Andrew 11.00 3
ACCN101 Andrew 11.00 3
ACCN101 Bert 13.40 1
(3 row(s) affected)
Notice that the StudentsNov table contains all the rows from the
StudentsOct table and then some. The first union query, which reports
all rows found in StudentsOct but not in StudentsNov, will return a
row for Andrew because Andrew has one row in the first table that
occurs twice in the second table. That row will be detected because
the counts are different. However, the first union query will not
detect the row for Bert in the second table at all.
That's because the subquery only checks the second
table for rows that exist in the first table. The second union query,
however, turns that around and will detect that the row for Bert in
the StudentsNov table has no counterpart in the StudentsOct table.
For example:
SELECT sn.*, COUNT(*) DupeCount, 'StudentsNov' TableName
FROM StudentsNov sn
GROUP BY sn.CourseId, sn.StudentName, sn.Score, sn.TermPaper
HAVING NOT EXISTS (
SELECT so.*, COUNT(*)
FROM StudentsOct so
GROUP BY so.CourseId, so.StudentName, so.Score, so.TermPaper
HAVING so.CourseId=sn.CourseId AND
so.TermPaper=sn.TermPaper AND
so.StudentName=sn.StudentName AND
COUNT(*) = COUNT(ALL sn.CourseId))
CourseId StudentName Score TermPaper DupeCount TableName
--------- ----------- ------ ----------- ----------- -----------
ACCN101 Andrew 11.00 3 2 StudentsNov
ACCN101 Bert 13.40 1 1 StudentsNov
If neither query generates any results, then neither table has any
rows that are not also held by the other, so the two tables must be
equal.
If both tables have a primary key defined, each group will be
guaranteed to have only one row. In such a case, you can eliminate
the GROUP BY references to COUNT(*) and convert the HAVING clause to
a WHERE clause. Doing that for the query shown in this recipe yields
the following result:
SELECT so.*, 'StudentsOct' TableName
FROM StudentsOct so
WHERE NOT EXISTS (
SELECT sn.*
FROM StudentsNov sn
WHERE sn.CourseId=so.CourseId AND
sn.TermPaper=so.TermPaper AND
sn.StudentName=so.StudentName)
UNION
SELECT sn.*, 'StudentsNov' TableName
FROM StudentsNov sn
WHERE NOT EXISTS (
SELECT so.*
FROM StudentsOct so
WHERE so.CourseId=sn.CourseId AND
so.TermPaper=sn.TermPaper AND
so.StudentName=sn.StudentName)
If one or both tables do not have a primary key,
the only reliable way to compare the tables is to summarize the data,
count the number of duplicate occurrences, and ensure that the
duplicate counts match for both tables.
|
|