Team LiB   Previous Section   Next Section

5.5 Subquery Case Study: The Top N Performers

Certain queries that are easily described in English have traditionally been difficult to formulate in SQL. One common example is the "Find the top five salespeople" query. The complexity stems from the fact that data from a table must first be aggregated, and then the aggregated values must be sorted and compared to one another in order to identify the top or bottom performers. In this section, you will see how subqueries may be used to answer such questions. At the end of the section, we introduce ranking functions, a new feature of Oracle SQL that was specifically designed for these types of queries.

5.5.1 A Look at the Data

Consider the problem of finding the top five sales people. Let's assume that we are basing our evaluation on the amount of revenue each salesperson brought in during the previous year. Our first task, then, would be to sum the dollar amount of all orders booked by each saleperson during the year in question. The following query does this for the year 2001:

SELECT e.lname employee, SUM(co.sale_price) total_sales
FROM cust_order co, employee e
WHERE co.order_dt >= TO_DATE('01-JAN-2001','DD-MON-YYYY')
  AND co.order_dt < TO_DATE('01-JAN-2002','DD-MON-YYYY')
  AND co.ship_dt IS NOT NULL AND co.cancelled_dt IS NULL
  AND co.sales_emp_id = e.emp_id
GROUP BY e.lname
ORDER BY 2 DESC;

EMPLOYEE             TOTAL_SALES
-------------------- -----------
Blake                    1927580
Houseman                 1814327
Russell                  1784596
Boorman                  1768813
Isaacs                   1761814
McGowan                  1761814
Anderson                 1757883
Evans                    1737093
Fletcher                 1735575
Dunn                     1723305
Jacobs                   1710831
Thomas                   1695124
Powers                   1688252
Walters                  1672522
Fox                      1645204
King                     1625456
Nichols                  1542152
Young                    1516776
Grossman                 1501039
Iverson                  1468316
Freeman                  1461898
Levitz                   1458053
Peters                   1443837
Jones                    1392648

It appears that Isaacs and McGowan have tied for fifth place, which, as you will see, adds an interesting wrinkle to the problem.

5.5.2 Your Assignment

It seems that the boss was so tickled with this year's sales that she has asked you, the IT manager, to see that each of the top five salespeople receive a bonus equal to 1% of their yearly sales. No problem, you say. You quickly throw together the following report using your favorite feature, the inline view, and send it off to the boss:

SELECT e.lname employee, top5_emp_orders.tot_sales total_sales,
  ROUND(top5_emp_orders.tot_sales * 0.01) bonus
FROM 
 (SELECT all_emp_orders.sales_emp_id emp_id, 
    all_emp_orders.tot_sales tot_sales
  FROM
   (SELECT sales_emp_id, SUM(sale_price) tot_sales
    FROM cust_order
    WHERE order_dt >= TO_DATE('01-JAN-2001','DD-MON-YYYY')
      AND order_dt < TO_DATE('01-JAN-2002','DD-MON-YYYY')
      AND ship_dt IS NOT NULL AND cancelled_dt IS NULL
    GROUP BY sales_emp_id
    ORDER BY 2 DESC
   ) all_emp_orders
  WHERE ROWNUM <= 5
 ) top5_emp_orders, employee e
WHERE top5_emp_orders.emp_id = e.emp_id;

EMPLOYEE             TOTAL_SALES      BONUS
-------------------- ----------- ----------
Blake                    1927580      19276
Houseman                 1814327      18143
Russell                  1784596      17846
Boorman                  1768813      17688
McGowan                  1761814      17618

The howl emitted by Isaacs can be heard for five square blocks. The boss, looking a bit harried, asks you to take another stab at it. Upon reviewing your query, the problem becomes immediately evident; the inline view aggregates the sales data and sorts the results, and the containing query grabs the first five sorted rows and discards the rest. Although it could easily have been McGowan, since there is no second sort column, Isaacs was arbitrarily omitted from the result set.

5.5.3 Second Attempt

You console yourself with the fact that you gave the boss exactly what she asked for: the top five salespeople. However, you realize that part of your job as IT manager is to give people what they need, not necessarily what they ask for, so you rephrase the boss's request as follows: give a bonus to all salespeople whose total sales ranked in the top five last year. This will require two steps: find the fifth highest sales total last year, and then find all salespeople whose total sales meet or exceed that figure.

SELECT e.lname employee, top5_emp_orders.tot_sales total_sales,
  ROUND(top5_emp_orders.tot_sales * 0.01) bonus
FROM employee e,
 (SELECT sales_emp_id, SUM(sale_price) tot_sales
  FROM cust_order
  WHERE order_dt >= TO_DATE('01-JAN-2001','DD-MON-YYYY')
    AND order_dt < TO_DATE('01-JAN-2002','DD-MON-YYYY')
    AND ship_dt IS NOT NULL AND cancelled_dt IS NULL
  GROUP BY sales_emp_id
  HAVING SUM(sale_price) IN
   (SELECT all_emp_orders.tot_sales
    FROM 
     (SELECT SUM(sale_price) tot_sales
      FROM cust_order
      WHERE order_dt >= TO_DATE('01-JAN-2001','DD-MON-YYYY')
        AND order_dt < TO_DATE('01-JAN-2002','DD-MON-YYYY')
        AND ship_dt IS NOT NULL AND cancelled_dt IS NULL
      GROUP BY sales_emp_id
      ORDER BY 1 DESC
     ) all_emp_orders
    WHERE ROWNUM <= 5)
 ) top5_emp_orders
WHERE top5_emp_orders.sales_emp_id = e.emp_id
ORDER BY 2 DESC;

EMPLOYEE             TOTAL_SALES      BONUS
-------------------- ----------- ----------
Blake                    1927580      19276
Houseman                 1814327      18143
Russell                  1784596      17846
Boorman                  1768813      17688
McGowan                  1761814      17618
Isaacs                   1761814      17618

Thus, there are actually six top five salespeople. The main difference between your first attempt and the second is the addition of the HAVING clause in the inline view. The subquery in the HAVING clause returns the five highest sales totals, and the inline view then returns all salespeople (potentially more than five) whose total sales exist in the set returned by the subquery.

While you are confident in your latest results, there are several aspects of the query that bother you:

  • The aggregation of sales data is performed twice.

  • The query will never contend for Most Elegant Query of the Year.

  • You could've sworn you read about a new feature for handling these types of queries. . . .

In fact, there is a new feature for performing ranking queries that is available in release 8.1.6 and later. That feature is the RANK function.

5.5.4 Final Answer

New in 8.1.6, the RANK function is specifically designed to help you write queries to answer questions like the one posed in this case study. Part of a set of analytic functions (all of which will be explored in Chapter 13), the RANK function may be used to assign a ranking to each element of a set. The RANK function understands that there may be ties in the set of values being ranked and leaves gaps in the ranking to compensate. The following query illustrates how rankings would be assigned to the entire set of salespeople; notice how the RANK function leaves a gap between the fifth and seventh rankings to compensate for the fact that two rows share the fifth spot in the ranking:

SELECT sales_emp_id, SUM(sale_price) tot_sales,
  RANK(  ) OVER (ORDER BY SUM(sale_price) DESC) sales_rank
FROM cust_order
WHERE order_dt >= TO_DATE('01-JAN-2001','DD-MON-YYYY')
  AND order_dt < TO_DATE('01-JAN-2002','DD-MON-YYYY')
  AND ship_dt IS NOT NULL AND cancelled_dt IS NULL
GROUP BY sales_emp_id;

SALES_EMP_ID  TOT_SALES SALES_RANK
------------ ---------- ----------
          11    1927580          1
          24    1814327          2
          34    1784596          3
          18    1768813          4
          25    1761814          5
          26    1761814          5
          30    1757883          7
          21    1737093          8
          19    1735575          9
          20    1723305         10
          27    1710831         11
          14    1695124         12
          15    1688252         13
          22    1672522         14
          29    1645204         15
          28    1625456         16
          31    1542152         17
          23    1516776         18
          13    1501039         19
          32    1468316         20
          12    1461898         21
          17    1458053         22
          33    1443837         23
          16    1392648         24

Leaving gaps in the rankings whenever ties are encountered is critical for properly handling these types of queries.[7] Table 5-1 shows the number of rows that would be returned for this data set for various top-N queries.

[7] If we do not wish to have gaps in the ranking, we can use the DENSE_RANK function intead.

Table 5-1. Rows returned for N = {1,2,3,...,9}

Top-N salespeople

Rows returned

1

1

2

2

3

3

4

4

5

6

6

6

7

7

8

8

9

9

As you can see, the result sets would be identical for both the "top five" and "top six" versions of this query for this particular data set.

By wrapping the previous RANK query in an inline view, we can retrieve the salespeople with a ranking of five or less and join the results to the employee table to generate the final result set:

SELECT e.lname employee, top5_emp_orders.tot_sales total_sales,
  ROUND(top5_emp_orders.tot_sales * 0.01) bonus
FROM
 (SELECT all_emp_orders.sales_emp_id emp_id, 
    all_emp_orders.tot_sales tot_sales
  FROM
   (SELECT sales_emp_id, SUM(sale_price) tot_sales,
      RANK(  ) OVER (ORDER BY SUM(sale_price) DESC) sales_rank
    FROM cust_order
    WHERE order_dt >= TO_DATE('01-JAN-2001','DD-MON-YYYY')
      AND order_dt < TO_DATE('01-JAN-2002','DD-MON-YYYY')
      AND ship_dt IS NOT NULL AND cancelled_dt IS NULL
    GROUP BY sales_emp_id
   ) all_emp_orders
  WHERE all_emp_orders.sales_rank <= 5
 ) top5_emp_orders, employee e
WHERE top5_emp_orders.emp_id = e.emp_id
ORDER BY 2 DESC;

EMPLOYEE             TOTAL_SALES      BONUS
-------------------- ----------- ----------
Blake                    1927580      19276
Houseman                 1814327      18143
Russell                  1784596      17846
Boorman                  1768813      17688
McGowan                  1761814      17618
Isaacs                   1761814      17618

If this query is familiar, that's because it's almost identical to the first attempt, except that the RANK function is used instead of the pseudocolumn ROWNUM to determine where to draw the line between the top five salespeople and the rest of the pack.

Now that you are happy with your query and confident in your results, you show your findings to your boss. "Nice work," she says. "Why don't you give yourself a bonus as well? In fact, you can have Isaacs's bonus, since he quit this morning." Salespeople can be so touchy.

    Team LiB   Previous Section   Next Section