Team LiB   Previous Section   Next Section

7.2 Using Set Operations to Compare Two Tables

Developers, and even DBAs, occasionally need to compare the contents of two tables to determine whether the tables contain the same data. The need to do this is especially common in test environments, as developers may want to compare a set of data generated by a program under test with a set of "known good" data. Comparison of tables is also useful for automated testing purposes, when we have to compare actual results with a given set of expected results. SQL's set operations provide an interesting solution to this problem of comparing two tables.

The following query uses both MINUS and UNION ALL to compare two tables for equality. The query depends on each table having either a primary key or at least one unique index.

(SELECT * FROM CUSTOMER_KNOWN_GOOD
MINUS
SELECT * FROM CUSTOMER_TEST)
UNION ALL
(SELECT * FROM CUSTOMER_TEST
MINUS
SELECT * FROM CUSTOMER_KNOWN_GOOD);

Let's talk a bit about how this query works. We can look at it as the union of two compound queries. The parentheses ensure that both MINUS operations take place first before the UNION ALL operation is performed. The result of the first MINUS query will be those rows in CUSTOMER_KNOWN_GOOD that are not also in CUSTOMER_TEST. The result of the second MINUS query will be those rows in CUSTOMER_TEST that are not also in CUSTOMER_KNOWN_GOOD. The UNION ALL operator simply combines these two result sets for convenience. If no rows are returned by this query, then we know that both tables have identical rows. Any rows returned by this query represent differences between the CUSTOMER_TEST and CUSTOMER_KNOWN_GOOD tables.

If the possibility exists for one or both tables to contain duplicate rows, we must use a more general form of this query in order to test two tables for equality. This more general form uses row counts to detect duplicates:

(SELECT C1.*,COUNT(*) 
 FROM CUSTOMER_KNOWN_GOOD
 GROUP BY C1.CUST_NBR, C1.NAME...
MINUS
 SELECT C2.*, COUNT(*)
 FROM CUSTOMER_TEST C2
 GROUP BY C2.CUST_NBR, C2.NAME...)
UNION ALL
(SELECT C3.*,COUNT(*) 
 FROM CUSTOMER_TEST C3
 GROUP BY C3.CUST_NBR, C3.NAME...
MINUS
 SELECT C4.*, COUNT(*)
 FROM CUSTOMER_KNOWN_GOOD C4
 GROUP BY C4.CUST_NBR, C4.NAME...)

This query is getting complex! The GROUP BY clause (see Chapter 4) for each SELECT must list all columns for the table being selected. Any duplicate rows will be grouped together, and the count will reflect the number of duplicates. If the number of duplicates is the same in both tables, the MINUS operations will cancel those rows out. If any rows are different, or if any occurrence counts are different, the resulting rows will be reported by the query.

Let's look at an example to illustrate how this query works. We'll start with the following tables and data:

DESC CUSTOMER_KNOWN_GOOD
 Name                         Null?    Type
 ---------------------------- -------- ----------------
 CUST_NBR                     NOT NULL NUMBER(5)
 NAME                         NOT NULL VARCHAR2(30)

SELECT * FROM CUSTOMER_KNOWN_GOOD;

   CUST_NBR NAME
----------- ------------------------------
          1 Sony
          1 Sony
          2 Samsung
          3 Panasonic
          3 Panasonic
          3 Panasonic

6 rows selected.

DESC CUSTOMER_TEST
Name                         Null?    Type
 ---------------------------- -------- ----------------
 CUST_NBR                     NOT NULL NUMBER(5)
 NAME                         NOT NULL VARCHAR2(30)

SELECT * FROM CUSTOMER_TEST;
   CUST_NBR NAME
----------- ------------------------------
          1 Sony
          1 Sony
          2 Samsung
          2 Samsung
          3 Panasonic

As we can see the CUSTOMER_KNOWN_GOOD and CUSTOMER_TEST tables have the same structure, but different data. Also notice that none of these tables has a primary or unique key; there are duplicate records in both. The following SQL will compare these two tables effectively:

(SELECT C1.*, COUNT(*)
FROM CUSTOMER_KNOWN_GOOD C1
GROUP BY C1.CUST_NBR, C1.NAME
MINUS
SELECT C2.*, COUNT(*)
FROM CUSTOMER_TEST C2
GROUP BY C2.CUST_NBR, C2.NAME)
UNION ALL
(SELECT C3.*, COUNT(*)
FROM CUSTOMER_TEST C3
GROUP BY C3.CUST_NBR, C3.NAME
MINUS
SELECT C4.*, COUNT(*)
FROM CUSTOMER_KNOWN_GOOD C4
GROUP BY C4.CUST_NBR, C4.NAME);

   CUST_NBR NAME                             COUNT(*)
----------- ------------------------------ ----------
          2 Samsung                                 1
          3 Panasonic                               3
          2 Samsung                                 2
          3 Panasonic                               1

These results indicate that one table (CUSTOMER_KNOWN_GOOD) has one record for "Samsung", whereas the second table (CUSTOMER_TEST) has two records for the same customer. Also, one table (CUSTOMER_KNOWN_GOOD) has three records for "Panasonic", whereas the second table (CUSTOMER_TEST) has one record for the same customer. Both the tables have the same number of rows (two) for "Sony", and therefore "Sony" doesn't appear in the output.

Duplicate rows are not possible in tables that have a primary key or at least one unique index. Use the short form of the table comparison query for such tables.

    Team LiB   Previous Section   Next Section