14.1 Know When to Use Specific Constructs

14.1 Know When to Use Specific Constructs

Depending on the circumstances, certain SQL constructs are preferable to others. For example, use of the EXISTS predicate is often preferable to IN. The same is not true for NOT EXISTS versus NOT IN. The next sections discuss the usage of such constructs.

14.1.1 EXISTS Is Preferable to DISTINCT

The DISTINCT keyword used in a SELECT clause eliminates duplicate rows in the result set. To eliminate those duplicates, Oracle performs a sort, and that sort requires time and disk space. Therefore, avoid using DISTINCT if you can tolerate having duplicate rows returned by a query. If you can't tolerate the duplicate rows, or your application can't handle them, use EXISTS in place of DISTINCT.

For example, assume you are trying to find the names of customers who have orders. Your query has to be based on two tables: CUSTOMER and CUST_ORDER. Using DISTINCT, your query would be written as follows:

SELECT DISTINCT C.CUST_NBR, C.NAME
FROM CUSTOMER C, CUST_ORDER O
WHERE C.CUST_NBR = O.CUST_NBR;

The corresponding execution plan for this query is as follows. Note the SORT operation, which is a result of DISTINCT being used.

Query Plan
-----------------------------------------
SELECT STATEMENT   Cost = 3056
  SORT UNIQUE
    MERGE JOIN
      INDEX FULL SCAN IND_ORD_CUST_NBR
      SORT JOIN
        TABLE ACCESS FULL CUSTOMER

To use EXISTS, the query needs to be rewritten as follows:

SELECT C.CUST_NBR, C.NAME
FROM CUSTOMER C
WHERE EXISTS (SELECT 1 FROM CUST_ORDER O WHERE C.CUST_NBR = O.CUST_NBR);

Here is the execution plan for the EXISTS version of the query. Look at the cost of this query versus the earlier DISTINCT query, and notice the performance improvement.

Query Plan
---------------------------------------
SELECT STATEMENT   Cost = 320
  FILTER
    TABLE ACCESS FULL CUSTOMER
    INDEX RANGE SCAN IND_ORD_CUST_NBR

The version of the query using EXISTS is less than one-ninth as costly as the version using DISTINCT. This is because the sort has been avoided.

14.1.2 EXISTS Versus IN

Many SQL books discuss the fact that NOT EXISTS performs better than NOT IN. We have found that with Oracle8i, the EXPLAIN PLAN generated by NOT EXISTS is exactly the same as that generated by NOT IN, and the performance for the two predicates is the same.

However, the comparison between EXISTS and IN is a different story. We've found that EXISTS often performs better than IN. Let's look at an example that demonstrates this. The following query uses IN to delete the orders for customers in region 5:

DELETE FROM CUST_ORDER
WHERE CUST_NBR IN
(SELECT CUST_NBR FROM CUSTOMER
WHERE REGION_ID = 5);

The execution plan for this query is as follows:

Query Plan
------------------------------------
DELETE STATEMENT   Cost = 3
  DELETE  CUST_ORDER
    HASH JOIN
      TABLE ACCESS FULL CUST_ORDER
      TABLE ACCESS FULL CUSTOMER

Now, let's look at that same query, written using EXISTS:

DELETE FROM CUST_ORDER
WHERE EXISTS
(SELECT CUST_NBR FROM CUSTOMER
WHERE CUST_ORDER.CUST_NBR = CUSTOMER.CUST_NBR
AND REGION_ID = 5);

The execution plan for the EXISTS version of the query is:

Query Plan
--------------------------------------------
DELETE STATEMENT   Cost = 1
  DELETE  CUST_ORDER
    FILTER
      TABLE ACCESS FULL CUST_ORDER
      TABLE ACCESS BY INDEX ROWID CUSTOMER
        INDEX UNIQUE SCAN CUSTOMER_PK

Notice the cost difference between the two queries. The IN version of the query has a cost of 3, while the EXISTS version of the query has a cost of only 1. When the EXISTS clause is used, the execution plan is driven by the outer table, whereas when the IN clause is used, the execution plan is driven by the table in the subquery. The EXISTS query will almost always be faster than the IN query, except for cases when the table in the subquery has very few rows as compared to the outer table.

14.1.3 WHERE Versus HAVING

We discussed the GROUP BY and HAVING clauses in Chapter 4. Sometimes, when writing a GROUP BY query, you have a condition that you can specify in either the WHERE clause or the HAVING clause. In situations where you have a choice, you'll always get better performance if you specify the condition in the WHERE clause. The reason is that it's less expensive to eliminate rows before they are summarized than it is to eliminate results after summarization.

Let's look at an example illustrating the advantage of WHERE over HAVING. Here's a query with the HAVING clause that reports the number of orders in the year 2000:

SELECT YEAR, COUNT(*)
FROM ORDERS
GROUP BY YEAR
HAVING YEAR = 2000;

      YEAR   COUNT(*)
---------- ----------
      2000        720

The execution plan for this query is as follows:

Query Plan
-------------------------------------------
SELECT STATEMENT   Cost = 6
  FILTER
    SORT GROUP BY
      INDEX FAST FULL SCAN ORDERS_PK

Now, look at that same query, but with the year restriction in the WHERE clause:

SELECT YEAR, COUNT(*)
FROM ORDERS
WHERE YEAR = 2000
GROUP BY YEAR;

      YEAR   COUNT(*)
---------- ----------
      2000        720

The execution plan for this version of the query is:

Query Plan
-------------------------------------
SELECT STATEMENT   Cost = 2
  SORT GROUP BY NOSORT
    INDEX FAST FULL SCAN ORDERS_PK

With the HAVING clause, the query performs the group operation first, and then filters the groups for the condition specified. The WHERE clause version of the query filters the rows before performing the group operation. The result of filtering with the WHERE clause is that there are fewer rows to summarize, and consequently the query performs better.

However, you should note that not all types of filtering can be achieved using the WHERE clause. Sometimes, you may need to summarize the data first, and then filter the summarized data based upon the summarized values. In such situations, you have to filter using the HAVING clause, because only the HAVING clause can "see" summarized values. Moreover, there are situations when you may need to use the WHERE clause and the HAVING clause together in a query to filter the results the way you want. For details, see Chapter 4.

14.1.4 UNION Versus UNION ALL

We discussed UNION and UNION ALL in Chapter 6. UNION ALL combines the results of two SELECT statements. UNION combines the results of two SELECT statements, and then returns only distinct rows from the combination; duplicates are eliminated. It is, therefore, obvious that to remove the duplicates, UNION performs one extra step than UNION ALL. This extra step is a sort, which is costly in terms of performance. Therefore, whenever your application can handle duplicates or you are certain that no duplicates will result, consider using UNION ALL instead of UNION.

Let's look an example to understand this issue better. The following query uses UNION to return a list of orders where the sale price exceeds $50.00 or where the customer is located in region 5:

SELECT ORDER_NBR, CUST_NBR FROM CUST_ORDER WHERE SALE_PRICE > 50
UNION
SELECT ORDER_NBR, CUST_NBR FROM CUST_ORDER
WHERE CUST_NBR IN 
(SELECT CUST_NBR FROM CUSTOMER WHERE REGION_ID = 5);

 ORDER_NBR   CUST_NBR
---------- ----------
      1000          1
      1001          1
      1002          5
      1003          4
      1004          4
      1005          8
      1006          1
      1007          5
      1008          5
      1009          1
      1011          1
      1012          1
      1015          5
      1017          4
      1019          4
      1021          8
      1023          1
      1025          5
      1027          5
      1029          1

20 rows selected.

The execution plan for this UNION query is:

Query Plan
-----------------------------------------------------------------------------------
SELECT STATEMENT   Cost = 8
  SORT UNIQUE
    UNION-ALL
      TABLE ACCESS FULL CUST_ORDER
      HASH JOIN
        TABLE ACCESS FULL CUSTOMER
        TABLE ACCESS FULL CUST_ORDER

The following query uses UNION ALL instead of UNION to get the same information:

SELECT ORDER_NBR, CUST_NBR FROM CUST_ORDER WHERE SALE_PRICE > 50
UNION ALL
SELECT ORDER_NBR, CUST_NBR FROM CUST_ORDER
WHERE CUST_NBR IN 
(SELECT CUST_NBR FROM CUSTOMER WHERE REGION_ID = 5);

 ORDER_NBR   CUST_NBR
---------- ----------
      1001          1
      1003          4
      1005          8
      1009          1
      1012          1
      1017          4
      1021          8
      1029          1
      1001          1
      1000          1
      1002          5
      1003          4
      1004          4
      1006          1
      1007          5
      1008          5
      1009          1
      1012          1
      1011          1
      1015          5
      1017          4
      1019          4
      1023          1
      1025          5
      1027          5
      1029          1

26 rows selected.

Note the duplicate rows in the output. However, note also that UNION ALL performs better than UNION, as you can see from the following execution plan:

Query Plan
-----------------------------------------------------------------------------------
SELECT STATEMENT   Cost = 4
  UNION-ALL
    TABLE ACCESS FULL CUST_ORDER
    HASH JOIN
      TABLE ACCESS FULL CUSTOMER
      TABLE ACCESS FULL CUST_ORDER

Compare this execution plan with its cost of 4 with the previous plan and its cost of 8. You can see that the extra operation (SORT UNIQUE) in the UNION makes it run slower than UNION ALL.