5.3 Correlated Subqueries

5.3 Correlated Subqueries

A subquery that references one or more columns from its containing SQL statement is called a correlated subquery. Unlike noncorrelated subqueries, which are executed exactly once prior to execution of the containing statement, a correlated subquery is executed once for each candidate row in the intermediate result set of the containing query. For example, consider the following query, which locates all parts supplied by Acme Industries that have been purchased ten or more times since December:

SELECT p.part_nbr, p.name
FROM supplier s, part p
WHERE s.name = 'Acme Industries' 
  AND s.supplier_id = p.supplier_id
  AND 10 <= 
   (SELECT COUNT(*) 
    FROM cust_order co, line_item li
    WHERE li.part_nbr = p.part_nbr 
      AND li.order_nbr = co.order_nbr
      AND co.order_dt >= TO_DATE('01-DEC-2001','DD-MON-YYYY'));

The reference to p.part_nbr is what makes the subquery correlated; values for p.part_nbr must be supplied by the containing query before the subquery can execute. If there are 10,000 parts in the part table, but only 100 are supplied by Acme Industries, the subquery will be executed once for each of the 100 rows in the intermediate result set created by joining the part and supplier tables.^[3]

^[3] It is possible to ask for the subquery to be evaluated earlier in the execution plan using the PUSH_SUBQ hint; once again, we suggest you pick up a good book on Oracle tuning if you are interested in learning more.

Correlated subqueries are often used to test whether relationships exist without regard to cardinality. We might, for example, want to find all parts that have shipped at least once in 2002. The EXISTS operator is used for these types of queries, as illustrated by the following query:

SELECT p.part_nbr, p.name, p.unit_cost
FROM part p
WHERE EXISTS 
 (SELECT 1 FROM line_item li, cust_order co
  WHERE li.part_nbr = p.part_nbr 
    AND li.order_nbr = co.order_nbr
    AND co.ship_dt >= TO_DATE('01-JAN-2002','DD-MON-YYYY'));

As long as the subquery returns one or more rows, the EXISTS condition is satisfied without regard for how many rows were actually returned by the subquery. Since the EXISTS operator returns TRUE or FALSE depending on the number of rows returned by the subquery, the actual columns returned by the subquery are irrelevant. The SELECT clause requires at least one column, however, so it is common practice to use either the literal "1" or the wildcard " * ".

Conversely, we can test whether a relationship does not exist:

UPDATE customer c 
SET c.inactive_ind = 'Y', c.inactive_dt = TRUNC(SYSDATE)
WHERE c.inactive_dt IS NULL 
  AND NOT EXISTS (SELECT 1 FROM cust_order co
    WHERE co.cust_nbr = c.cust_nbr 
      AND co.order_dt > TRUNC(SYSDATE) -- 365);

This statement makes all customer records inactive for those customers who haven't placed an order in the past year. Such queries are commonly found in maintenance routines. For example, foreign key constraints might prevent child records from referring to a nonexistent parent, but it is possible to have parent records without children. If business rules prohibit this situation, we might run a utility each week that removes these records, as in:

DELETE FROM cust_order co
WHERE co.order_dt > TRUNC(SYSDATE) -- 7 
  AND co.cancelled_dt IS NULL
  AND NOT EXISTS 
   (SELECT 1 FROM line_item li 
    WHERE li.order_nbr = co.order_nbr);

A query that includes a correlated subquery using the EXISTS operator is referred to as a semi-join. A semi-join includes rows in table A for which corresponding data is found one or more times in table B. Thus, the size of the final result set is unaffected by the number of matches found in table B. Similar to the anti-join discussed earlier, the Oracle optimizer can employ multiple strategies for executing such queries, including a merge semi-join or a hash semi-join.

While they are very often used together, the use of correlated subqueries does not require the EXISTS operator. If our database design includes denormalized columns, for example, we might run nightly routines to recalculate the denormalized data, as in:

UPDATE customer c 
SET (c.tot_orders, c.last_order_dt) = 
 (SELECT COUNT(*), MAX(co.order_dt) 
  FROM cust_order co
  WHERE co.cust_nbr = c.cust_nbr 
    AND co.cancelled_dt IS NULL);

Because a SET clause assigns values to columns in the table, the only operator allowed is "=". The subquery returns exactly one row (thanks to the aggregation functions), so the results may be safely assigned to the target columns. Rather than recalculating the entire sum each day, a more efficient method might be to update only those customers who placed orders today:

UPDATE customer c SET (c.tot_orders, c.last_order_dt) = 
 (SELECT c.tot_orders + COUNT(*), MAX(co.order_dt) 
  FROM cust_order co
  WHERE co.cust_nbr = c.cust_nbr 
    AND co.cancelled_dt IS NULL
    AND co.order_dt >= TRUNC(SYSDATE))
WHERE c.cust_nbr IN 
 (SELECT co.cust_nbr FROM cust_order co
  WHERE co.order_dt >= TRUNC(SYSDATE) 
    AND co.cancelled_dt IS NULL);

As the previous statement shows, data from the containing query can be used for other purposes in the correlated subquery than just join conditions in the WHERE clause. In this example, the SELECT clause of the correlated subquery adds today's sales totals to the previous value of tot_orders in the customer table to arrive at the new value.