7.14 Working with Duplicates
7.14.1 Problem
You've just imported a new set of rows
into
your buffer tables. However, you forgot to delete the previous
import. Now you have a table with some duplicate rows. You need a set
of tools to eliminate these duplicates and to prevent them from
occurring in the future.
Let's say that you uploaded some data from your
sales agent to the Bookstore table; however, due to a mistake in the
code, the upload doubled some rows:
BookId Name Quantity Price Type
----------- ------------------------ ----------- ------- ----------
1 Software Engineering 5 15.00 Manual
2 Modern Operating Systems 7 20.00 Reference
3 Learn SQL 15 18.00 Textbook
4 Learn Advanced SQL 1 8.00 Textbook
5 JavaScript Tutorial 5 10.00 Textbook
6 Modern Operating Systems 7 20.00 Reference
7 Learn SQL 15 18.00 Textbook
7.14.2 Solution
We'll demonstrate a technique for dealing with
duplicates on the Bookstore table. Please note that the BookId column
is always unique and that it is used for identifying otherwise
duplicated rows. This will be useful when only one representative of
a duplicated set needs to be singled out.
The queries in this recipe show different ways to count duplicate
rows, report on duplicate rows, extract nonduplicated rows, and
eliminate duplicate rows.
7.14.2.1 Reduction
The term reduction refers to
the
elimination of duplicate rows. The following query returns all
nonduplicate rows from the Bookstore table:
SELECT DISTINCT * FROM Bookstore
Because the DISTINCT keyword was used, any two rows containing the
same set of values will be combined into one row in the
query's result set. Because this query includes the
primary-key column BookId, it's of no practical use
in our particular case; some books still appear twice.
7.14.2.2 Selective reduction
Selective reduction is reduction that
is performed only on a
subset of the columns in a table. The following query retrieves each
unique combination of name, price, and color from the Stock table:
SELECT DISTINCT Name, Type, Price FROM Bookstore
The table stores five different books:
Name Type Price
---------------------------------------- -------------------- ---------
JavaScript Tutorial Textbook 10.00
Learn Advanced SQL Textbook 8.00
Learn SQL Textbook 18.00
Modern Operating Systems Reference 20.00
Software Engineering Manual 15.00
This time, because we did not include the always unique primary key,
each book is listed only once.
7.14.2.3 Selecting duplicates
The following query shows how you
can
find duplicate name, price, and color combinations:
SELECT Name, Type, Price
FROM Bookstore
GROUP BY Name, Type, Price
HAVING count(*) > 1
The HAVING clause in this query ensures that if a given name, price,
and color combination is only represented once, it will not be listed
in the results from this query. Any such combinations that occur more
than once will be listed:
Name Type Price
---------------------------------------- -------------------- ---------
Learn SQL Textbook 18.00
Modern Operating Systems Reference 20.00
7.14.2.4 Counting occurrences
The following query produces a report showing
the number of occurrences of each
combination of values in a given set of columns. The results are
sorted in order of highest occurrence.
SELECT Name, Type, Price, count(*) Count
FROM Bookstore
GROUP BY Name, Type, Price
ORDER BY count(*) DESC
The query lists all combinations of name, price, and type. It also
lists the number of times each combination occurs. Combinations with
the greatest number of occurrences will be listed first:
Name Type Price Count
------------------------- -------------------- ------------ -----------
Learn SQL Textbook 18.00 2
Modern Operating Systems Reference 20.00 2
JavaScript Tutorial Textbook 10.00 1
Learn Advanced SQL Textbook 8.00 1
Software Engineering Manual 15.00 1
To see the same information, but only for cases where the occurrence
count is greater than 1, use the following variation on the query:
SELECT Name, Type, Price, count(*) Count
FROM Bookstore
GROUP BY Name, Type, Price
HAVING count(*) > 1
ORDER BY count(*) DESC
This second version of the query uses a HAVING clause to restrict the
results to only those combinations of name, price, and type that
occur multiple times:
Name Type Price Count
------------------------ -------------------- ------------ -----------
Learn SQL Textbook 18.00 2
Modern Operating Systems Reference 20.00 2
7.14.2.5 Selecting by number of occurrences
The following query selects duplicates over a subset of columns where
the occurrence count is at least 3:
SELECT Name, Type, Price, count(*) Count
FROM Bookstore
GROUP BY Name, Type, Price
HAVING count(*) >= 3
You can replace the constant 3 in this query with any other
occurrence count threshold of interest.
7.14.2.6 Selecting nonduplicates
The following query selects nonduplicated rows over a subset of
columns, ignoring the ones that are duplicated:
SELECT Name, Type, Price
FROM Bookstore
GROUP BY Name, Type, Price
HAVING count(*)= 1
In this example, rows representing id, name, and qty combinations
that occur more than once will not be returned in the result set:
Name Type Price
------------------------ -------------------- ------------
JavaScript Tutorial Textbook 10.00
Learn Advanced SQL Textbook 8.00
Software Engineering Manual 15.00
7.14.2.7 Selecting duplicates with an odd occurrence count
If you're interested in duplicates where the
occurrence count is an odd number other than 1, use a query such as
this:
SELECT Name, Type, Price, count(*) Count
FROM Bookstore
GROUP BY Name, Type, Price
HAVING count(*) % 2 = 1 AND count(*)>1
Since 1 is considered an odd number, but does not imply duplication,
the WHERE clause specifically excludes rows with an occurrence count
of 1.
7.14.2.8 Selecting duplicates with an even occurrence count
A similar query can be used to retrieve duplicates where the
occurrence count is an even number:
SELECT Name, Type, Price, count(*) Count
FROM Bookstore
GROUP BY Name, Type, Price
HAVING count(*) % 2 = 0
Note that there is no need to specifically exclude rows with an
occurrence count of 1, because 1 is not an even number.
7.14.2.9 Deleting duplicate rows
You may find yourself faced with a situation where you want to
arbitrarily delete duplicate data from a table. Data like this is
often the result of running an import procedure twice. As long as you
have at least one unique value, such as a primary key value, to
identify each row, you can arbitrarily delete duplicates using a
single DELETE statement. The statement in the following example
deletes duplicate rows, arbitrarily retaining one record for each
fruit and color combination:
DELETE
FROM Warehouse
WHERE BookId NOT IN (
SELECT MAX(BookId)
FROM Bookstore
GROUP BY Name, Type
HAVING COUNT(*) > 1)
The subquery in this example identifies the highest ID number value
for each name and color combination. This identifies the row that we
are arbitrarily going to keep. All other rows with that name and
color combination are deleted. The key here is that the ID is unique
for all rows in a name and type combination.
7.14.2.10 Preventing duplicates
To ensure that no duplicates can
be
inserted into a table, create a unique index using the UNIQUE INDEX
clause, or create a UNIQUE constraint while creating a table. For
example, the following statement creates an index named BookstoreInd
that does not allow any two rows in the Bookstore table to have the
same combination of values for the Name, Type, and Price columns:
CREATE UNIQUE INDEX BookstoreInd on Bookstore
(Name, Type, Price)
7.14.3 Discussion
SQL has built-in support for enforcing an occurrence count of 1 on a
table. This support can be in the form of a UNIQUE constraint or a
unique index on the table. To enforce an occurrence count other than
1, you'll need to use triggers.
When using the SELECT DISTINCT clause in a SELECT statement, you
should be very careful about performance. At first glance, SELECT
DISTINCT may seem like a good tool to use; in reality, such a
statement can be very expensive to execute. Using the DISTINCT
keyword in a SELECT statement causes a sorting operation to occur
when such query is run. The only way that SQL Server can identify
distinct values is to first sort all the results enough to group
duplicates together. Then only one value, the
distinct value, is returned from each set of
duplicates. Because of the performance impact of the sort, you should
avoid the use of DISTINCT unless it is absolutely necessary. The
following query shows a typical example of the DISTINCT keyword being
used unnecessarily. The query uses a subquery within an EXISTS
predicate to check for rows in table B that correspond to those in
table A:
SELECT * FROM A
WHERE EXISTS(
SELECT DISTINCT *
FROM B
WHERE B.id=A.id)
This code returns the correct results, but the use of DISTINCT in the
subquery hurts performance. This is because DISTINCT forces a sort to
occur for each execution of the subquery. Since the subquery is a
correlated subquery — executed for each row — the result is a
lot of sort activity.
The purpose of the EXISTS predicate is to test for the existence of
at least one row by executing a subquery. Since only one row is
needed to cause a TRUE result to be returned,
there's no point in using DISTINCT. It
doesn't matter if more than one row satisfies the
subquery: one is enough . By omitting DISTINCT, you allow the
optimizer to stop when that one row is found.
|