2.4 Comparing Two Sets for Equality

2.4.1 Problem

You want to compare two sets of rows for equality. For example, you took a snapshot of the Students table in October and another in November. Now, you want to compare those two copies.

The snapshot for October:

CourseId             StudentName    Score  TermPaper   
-------------------- -------------- ------ ----------- 
ACCN101              Andrew         11.00  3

And for November:

CourseId             StudentName   Score  TermPaper   
-------------------- ------------- ------ ----------- 
ACCN101              Andrew        11.00  3
ACCN101              Andrew        11.00  3
ACCN101              Bert          13.40  1

2.4.2 Solution

One solution is to consider the two tables as two different sets. You can then adapt the "Implementing Set Difference" recipe to this problem by using it to report rows in one table that are not also contained in the other. The following query will do this for two tables named StudentsOct and StudentsNov:

SELECT so.*, COUNT(*) DupeCount, 'StudentsOct' TableName
FROM StudentsOct so
GROUP BY so.CourseId, so.StudentName, so.Score, so.TermPaper
HAVING NOT EXISTS (
      SELECT sn.*, COUNT(*)
      FROM StudentsNov sn
      GROUP BY sn.CourseId, sn.StudentName, sn.Score, sn.TermPaper
      HAVING sn.CourseId=so.CourseId AND 
         sn.TermPaper=so.TermPaper AND 
         sn.StudentName=so.StudentName AND
         COUNT(*) = COUNT(ALL so.CourseId))
UNION
SELECT sn.*, COUNT(*) DupeCount, 'StudentsNov' TableName
FROM StudentsNov sn
GROUP BY sn.CourseId, sn.StudentName, sn.Score, sn.TermPaper
HAVING NOT EXISTS (
      SELECT so.*, COUNT(*)
      FROM StudentsOct so
      GROUP BY so.CourseId, so.StudentName, so.Score, so.TermPaper
      HAVING so.CourseId=sn.CourseId AND 
         so.TermPaper=sn.TermPaper AND 
         so.StudentName=sn.StudentName AND
         COUNT(*) = COUNT(ALL sn.CourseId))

2.4.3 Discussion

This is a somewhat complex query. However, it's really the union of two very similar SELECT statements. Once you understand the first, you'll understand the second. The first SELECT statement in the UNION returns a list of rows in StudentsOct that do not also exist in StudentsNov. The second SELECT statement does the reverse — it returns a list of rows from StudentsNov that are not also in StudentsOct. Both of these SELECT statements represent difference operations. If the two tables are equal, neither SELECT will return any rows.

Let's focus on the first part of the SELECT statement:

SELECT so.*, COUNT(*) dupeCount, 'StudentsOct' tableName
FROM StudentsOct so
GROUP BY so.CourseId, so.StudentName, so.Score, so.TermPaper

This SELECT statement retrieves all columns from the StudentsOct table, groups the data by those columns, and uses the aggregate function COUNT to return the number of rows in each group. For example:

CourseId     StudentName Score  TermPaper   DupeCount   TableName
------------ ----------- ------ ----------- ----------- -----------
ACCN101      Andrew       11.00  3           1          StudentsOct
...

It's important that the GROUP BY clause lists each column in the StudentsOct table. In this example, we added an extra TableName column so that we can easily see to which table any extra rows belong. The purpose of the grouping operation is to deal with duplicate rows. If the StudentsOct table has two identical rows, then the StudentsNov table must also have two identical rows that match. The only way to check for this is to group the data by all columns in the table, count up the number of occurrences in each group, and then compare those counts across the two tables.

The HAVING clause functions as a WHERE clause, but at the group level. For each group of one table, the HAVING clause checks to be sure that there is a corresponding group of records in the other. All columns must match, and the row count for the groups must match as well.

The two queries in the union are symmetrical. Together, they compute the table differences from two different directions. The following possible table data illustrates why this is necessary:

SELECT * FROM StudentsOct
CourseId             StudentName   Score  TermPaper   
-------------------- ------------- ------ ----------- 
ACCN101              Andrew        11.00  3

(1 row(s) affected)

SELECT * FROM StudentsNov

CourseId             StudentName    Score  TermPaper   
-------------------- -------------- ------ ----------- 
ACCN101              Andrew         11.00  3
ACCN101              Andrew         11.00  3
ACCN101              Bert           13.40  1
(3 row(s) affected)

Notice that the StudentsNov table contains all the rows from the StudentsOct table and then some. The first union query, which reports all rows found in StudentsOct but not in StudentsNov, will return a row for Andrew because Andrew has one row in the first table that occurs twice in the second table. That row will be detected because the counts are different. However, the first union query will not detect the row for Bert in the second table at all. That's because the subquery only checks the second table for rows that exist in the first table. The second union query, however, turns that around and will detect that the row for Bert in the StudentsNov table has no counterpart in the StudentsOct table. For example:

SELECT sn.*, COUNT(*) DupeCount, 'StudentsNov' TableName
FROM StudentsNov sn
GROUP BY sn.CourseId, sn.StudentName, sn.Score, sn.TermPaper
HAVING NOT EXISTS (
      SELECT so.*, COUNT(*)
      FROM StudentsOct so
      GROUP BY so.CourseId, so.StudentName, so.Score, so.TermPaper
      HAVING so.CourseId=sn.CourseId AND 
         so.TermPaper=sn.TermPaper AND 
         so.StudentName=sn.StudentName AND
         COUNT(*) = COUNT(ALL sn.CourseId))

CourseId  StudentName Score  TermPaper   DupeCount   TableName   
--------- ----------- ------ ----------- ----------- ----------- 
ACCN101   Andrew      11.00  3           2           StudentsNov
ACCN101   Bert        13.40  1           1           StudentsNov

If neither query generates any results, then neither table has any rows that are not also held by the other, so the two tables must be equal.

Table Differences When Primary Keys Are Involved

If both tables have a primary key defined, each group will be guaranteed to have only one row. In such a case, you can eliminate the GROUP BY references to COUNT(*) and convert the HAVING clause to a WHERE clause. Doing that for the query shown in this recipe yields the following result:

SELECT so.*, 'StudentsOct' TableName FROM StudentsOct so WHERE NOT EXISTS ( SELECT sn.* FROM StudentsNov sn WHERE sn.CourseId=so.CourseId AND sn.TermPaper=so.TermPaper AND sn.StudentName=so.StudentName) UNION SELECT sn.*, 'StudentsNov' TableName FROM StudentsNov sn WHERE NOT EXISTS ( SELECT so.* FROM StudentsOct so WHERE so.CourseId=sn.CourseId AND so.TermPaper=sn.TermPaper AND so.StudentName=sn.StudentName)

If one or both tables do not have a primary key, the only reliable way to compare the tables is to summarize the data, count the number of duplicate occurrences, and ensure that the duplicate counts match for both tables.