5.2 Noncorrelated Subqueries

5.2 Noncorrelated Subqueries

Noncorrelated subqueries allow each row from the containing SQL statement to be compared to a set of values. Divide noncorrelated subqueries into the following three categories, depending on the number of rows and columns returned in their result set:

Single-row, single-column subqueries
Multiple-row, single-column subqueries
Multiple-column subqueries

Depending on the category, different sets of operators may be employed by the containing SQL statement to interact with the subquery.

5.2.1 Single-Row, Single-Column Subqueries

A subquery that returns a single row with a single column is treated like a scalar by the containing statement; not surprisingly, these types of subqueries are known as scalar subqueries. The subquery may appear on either side of a condition, and the usual comparison operators (=, <, >, !=, <=, >=) are employed. The following query illustrates the utility of single-row, single-column subqueries by finding all employees earning an above-average salary. The subquery returns the average salary, and the containing query then returns all employees who earn more than that amount.

SELECT lname 
FROM employee
WHERE salary > (SELECT AVG(salary) 
                FROM EMPLOYEE);

LNAME
--------------------
Brown
Smith
Blake
Isaacs
Jacobs
King
Fox
Anderson
Nichols
Iverson
Peters
Russell

As this query demonstrates, it can be perfectly reasonable for a subquery to reference the same tables as the containing query. In fact, subqueries are frequently used to isolate a subset of records within a table. For example, many applications include maintenance routines that clean up operational data, such as exception or load logs. Every week, a script might delete all but the latest day's activity. For example:

DELETE FROM load_log 
WHERE load_dt < (SELECT MAX(TRUNC(load_dt)) 
                 FROM load_log);

Noncorrelated subqueries are also commonly found outside the WHERE clause, as illustrated by the following query, which identifies the salesperson responsible for the most orders:

SELECT sales_emp_id, COUNT(*) 
FROM cust_order
GROUP BY sales_emp_id
HAVING COUNT(*) = (SELECT MAX(COUNT(*)) 
                   FROM cust_order 
                   GROUP BY sales_emp_id);

SALES_EMP_ID   COUNT(*)
------------ ----------
          30        121

This subquery calculates the number of orders attributable to each salesperson, and then applies the MAX function to return only the highest number of orders. The containing query performs the same aggregation as the subquery and then keeps only those salespeople whose total sales count matches the maximum value returned by the subquery. Interestingly, the containing query can return more than one row if multiple salespeople tie for the maximum sales count, while the subquery is guaranteed to return a single row and column. If it seems wasteful that the subquery and containing query both perform the same aggregation, it is; see Chapter 13 for more efficient ways to handle these types of queries.

So far, we have seen scalar subqueries in the WHERE and HAVING clauses of SELECT statements, along with the WHERE clause of a DELETE statement. Before we delve deeper into the different types of subqueries, let's explore where else subqueries can and can't be utilized in SQL statements:

The FROM clause may contain any type of noncorrelated subquery.
The SELECT and ORDER BY clauses may contain scalar subqueries.
The GROUP BY clause may not contain subqueries.
The START WITH and CONNECT BY clauses, used for querying hierarchical data, may contain subqueries and will be examined in detail in Chapter 8.

5.2.2 Multiple-Row Subqueries

Now that we know how to use single-row, single-column subqueries, let's explore how to use subqueries that return multiple rows. When a subquery returns more than one row, it is not possible to use only comparison operators, since a single value cannot be directly compared to a set of values. However, a single value can be compared to each value in a set. To accomplish this, the special keywords ANY and ALL may be used with comparison operators to determine if a value is equal to (or less than, greater than, etc.) any members of the set or all members of the set. Consider the following query:

SELECT fname, lname
FROM employee
WHERE dept_id = 3 AND salary >= ALL
 (SELECT salary
  FROM employee 
  WHERE dept_id = 3);

FNAME                LNAME
-------------------- --------------------
Mark                 Russell

The subquery returns the set of salaries for department 3, and the containing query checks each employee in the department to see if her salary is greater or equal to every salary returned by the subquery. Thus, this query retrieves the name of the highest paid person in department 3. While everyone except the lowest paid employee has a salary >= some of the salaries in the departement, only the highest paid employee has a salary >= all of the salaries in the department. If multiple employees tie for the highest salary in the department, multiple names will be returned.

Another way to phrase the previous query is to find the employee whose salary is not less than any other salary in the department. We can do this using the ANY operator:

SELECT fname, lname
FROM employee
WHERE dept_id = 3 AND NOT salary < ANY
 (SELECT salary
  FROM employee 
  WHERE dept_id = 3);

There are almost always multiple ways to phrase the same query. One of the challenges of writing SQL is striking the right balance between efficiency and readability. In this case, I might prefer using AND salary >= ALL over AND NOT salary < ANY because the first variation is easier to understand; however, the latter form might prove more efficient, since each evaluation of the subquery results requires from 1 to N comparisons when using ANY versus exactly N comparisons when using ALL.^[1]

^[1] If there are 100 people in the department, each of the 100 salaries needs to be compared to the entire set of 100. When using ANY, the comparison can be suspended as soon as a larger salary is identified in the set, whereas using ALL requires 100 comparisons to ensure that there are no smaller salaries in the set.

The next query uses the ANY operator to find all employees whose salary exceeds that of any top-level manager:

SELECT fname, lname
FROM employee
WHERE manager_emp_id IS NOT NULL
  AND salary > ANY
 (SELECT salary
  FROM employee
  WHERE manager_emp_id IS NULL);

FNAME                LNAME
-------------------- --------------------
Laura                Peters
Mark                 Russell

The subquery returns the set of salaries for all top-level managers, and the containing query returns the names of non-top-level managers whose salary exceeds any of the salaries returned by the subquery. Any time this query returns one or more rows, rest assured that top-level management will vote themselves a pay increase.

For the previous three queries, failure to include either the ANY or ALL operators will result in the following error:

ORA-01427: single-row subquery returns more than one row

The wording of this error message is a bit confusing. After all, how can a single-row subquery return multiple rows? What the error message is trying to convey is that a multiple-row subquery has been identified where only a single-row subquery is allowed. If we are not absolutely certain that our subquery will return exactly one row, we must include ANY or ALL to ensure our code doesn't fail in the future.

Along with ANY and ALL, we may also use the IN operator for working with multi-row subqueries. Using IN with a subquery is functionally equivalent to using = ANY, and returns TRUE if a match is found in the set returned by the subquery. The following query uses IN to postpone shipment of all orders containing parts which are not currently in stock:

UPDATE cust_order 
SET expected_ship_dt = TRUNC(SYSDATE) + 1
WHERE ship_dt IS NULL AND order_nbr IN
  (SELECT l.order_nbr 
   FROM line_item l, part p
   WHERE l.part_nbr = p.part_nbr AND p.inventory_qty = 0);

The subquery returns the set of orders requesting out-of-stock parts, and the containing UPDATE statement modifies the expected ship date of all orders in the set. We think you will agree that IN is more intuitive than = ANY, which is why IN is almost always used in such situations. Similarly, we can use NOT IN instead of using != ANY as demonstrated by the next query, which deletes all customers who haven't placed an order in the past five years:

DELETE FROM customer
WHERE cust_nbr NOT IN 
  (SELECT cust_nbr 
   FROM cust_order
   WHERE order_dt >= TRUNC(SYSDATE) -- (365 * 5));

The subquery returns the set of customers that have placed an order in the past five years, and the containing DELETE statement removes all customers that are not in the set returned by the subquery.

Finding members of one set that do not exist in another set is referred to as an anti-join. As the name implies, an anti-join is the opposite of a join; rows from table A are returned if the specified data is not found in table B. The Oracle optimizer can employ multiple strategies for executing such queries, including a merge anti-join or a hash anti-join.^[2]

^[2] Since this is not a tuning book, I will refrain from delving into the inner workings of the Oracle optimizer and how the optimizer can be influenced via hints. For more information, please see the Oracle SQL Tuning Pocket Reference by Mark Gurry (O'Reilly).

5.2.3 Multiple-Column Subqueries

While all of the previous examples compare a single column from the containing SQL statement to the result set returned by the subquery, it is also possible to issue a subquery against multiple columns. Consider the following UPDATE statement, which rolls up data from an operational table into an aggregate table:

UPDATE monthly_orders SET
  tot_orders = (SELECT COUNT(*) 
    FROM cust_order 
    WHERE order_dt >= TO_DATE('01-NOV-2001','DD-MON-YYYY') 
      AND order_dt < TO_DATE('01-DEC-2001','DD-MON-YYYY')
      AND cancelled_dt IS NULL),
  max_order_amt = (SELECT MAX(sale_price) 
    FROM cust_order 
    WHERE order_dt >= TO_DATE('01-NOV-2001','DD-MON-YYYY') 
      AND order_dt < TO_DATE('01-DEC-2001','DD-MON-YYYY')
      AND cancelled_dt IS NULL),
  min_order_amt = (SELECT MIN(sale_price) 
    FROM cust_order 
    WHERE order_dt >= TO_DATE('01-NOV-2001','DD-MON-YYYY') 
      AND order_dt < TO_DATE('01-DEC-2001','DD-MON-YYYY') 
      AND cancelled_dt IS NULL),
  tot_amt = (SELECT SUM(sale_price) 
    FROM cust_order 
    WHERE order_dt >= TO_DATE('01-NOV-2001','DD-MON-YYYY') 
      AND order_dt < TO_DATE('01-DEC-2001','DD-MON-YYYY')
      AND cancelled_dt IS NULL)
WHERE month = 11 and year = 2001;

The UPDATE statement modifies four columns in the monthly_orders table, and values for each of the four columns are calculated by aggregating data in the cust_order table. Looking closely, we see that the WHERE clauses for all four subqueries are identical; only the aggregation type differs in the four queries. The next query demonstrates how all four columns can be populated with a single trip through the cust_order table:

UPDATE monthly_orders 
SET (tot_orders, max_order_amt, min_order_amt, tot_amt) =
  (SELECT COUNT(*), MAX(sale_price), MIN(sale_price), SUM(sale_price)
   FROM cust_order
   WHERE order_dt >= TO_DATE('01-NOV-2001','DD-MON-YYYY') 
     AND order_dt < TO_DATE('01-DEC-2001','DD-MON-YYYY') 
     AND cancelled_dt IS NULL)
WHERE month = 11 and year = 2001;

The second statement achieves the same result more efficiently than the first by performing four aggregations during one trip through the cust_order table, rather than one aggregation during each of four separate trips.

Whereas the previous example demonstrates the use of a multiple-column subquery in the SET clause of an UPDATE statement, such subqueries may also be utilized in the WHERE clause of a SELECT, UPDATE, or DELETE statement. The next statement deletes all items from open orders that include discontinued parts:

DELETE FROM line_item
WHERE (order_nbr, part_nbr) IN
 (SELECT c.order_nbr, p.part_nbr
  FROM cust_order c, line_item li, part p
  WHERE c.ship_dt IS NULL AND c.cancelled_dt IS NULL
    AND c.order_nbr = li.order_nbr
    AND li.part_nbr = p.part_nbr
    AND p.status = 'DISCONTINUED');

Note the use of the IN operator in the WHERE clause. Two columns are listed together in parentheses prior to the IN keyword. Values in these two columns are compared to the set of two values returned by each row of the subquery. If a match is found, the row is removed from the line_item table.