13.2 Ranking FunctionsDetermining the performance of a particular business entity compared to its peers is central to a wide variety of business decisions. Examples include:
Prior to the release of Oracle8i, programmers could use the ORDER BY clause to sort a result set on one or more columns, but any further processing to calculate rankings or percentiles had to be performed using a procedural language. Beginning with Oracle8i, however, developers can take advantage of several new functions to either generate rankings for each row in a result set or to group rows into buckets for percentile calculations. 13.2.1 RANK, DENSE_RANK, and ROW_NUMBERThe RANK, DENSE_RANK, and ROW_NUMBER functions generate an integer value from 1 to N for each row, where N is less than or equal to the number of rows in the result set. The differences in the values returned by these functions revolves around how each one handles ties:
To illustrate the differences, we generate rankings for each customer according to their total yearly sales. Here is the query to generate the sales data for the year 2001: SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY region_id, cust_nbr; REGION_ID CUST_NBR CUST_SALES ---------- ---------- ---------- 5 1 1151162 5 2 1224992 5 3 1161286 5 4 1878275 5 5 1169926 6 6 1788836 6 7 971585 6 8 1141638 6 9 1208959 6 10 1196748 7 11 1190421 7 12 1182275 7 13 1310434 7 14 1929774 7 15 1255591 8 16 1068467 8 17 1944281 8 18 1253840 8 19 1174421 8 20 1412006 9 21 1020541 9 22 1036146 9 23 1224992 9 24 1224992 9 25 2232703 10 26 1808949 10 27 1322747 10 28 986964 10 29 903383 10 30 1216858 Notice that three of the customers (2, 23, and 24) have the same value for total sales ($1,224,992). In the next query, we will add three function calls to generate rankings for each customer across all regions, and we will order the result set by the ROW_NUMBER function to make the difference in rankings easier to observe: SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank, DENSE_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_dense_rank, ROW_NUMBER( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_number FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY 6; REGION_ID CUST_NBR CUST_SALES SALES_RANK SALES_DENSE_RANK SALES_NUMBER ---------- ---------- ---------- ---------- ---------------- ------------ 9 25 2232703 1 1 1 8 17 1944281 2 2 2 7 14 1929774 3 3 3 5 4 1878275 4 4 4 10 26 1808949 5 5 5 6 6 1788836 6 6 6 8 20 1412006 7 7 7 10 27 1322747 8 8 8 7 13 1310434 9 9 9 7 15 1255591 10 10 10 8 18 1253840 11 11 11 5 2 1224992 12 12 12 9 23 1224992 12 12 13 9 24 1224992 12 12 14 10 30 1216858 15 13 15 6 9 1208959 16 14 16 6 10 1196748 17 15 17 7 11 1190421 18 16 18 7 12 1182275 19 17 19 8 19 1174421 20 18 20 5 5 1169926 21 19 21 5 3 1161286 22 20 22 5 1 1151162 23 21 23 6 8 1141638 24 22 24 8 16 1068467 25 23 25 9 22 1036146 26 24 26 9 21 1020541 27 25 27 10 28 986964 28 26 28 6 7 971585 29 27 29 10 29 903383 30 28 30 Don't be confused by the ORDER BY clause at the end of the query and the ORDER BY clauses within each function call; the functions use their ORDER BY clause internally to sort the results for the purpose of applying a ranking. Thus, each of the three functions applies its ranking algorithm to the sum of each customer's sales in descending order. The final ORDER BY clause specifies the results of the ROW_NUMBER function as the sort key for the final result set, but we could have picked any of the six columns as our sort key. Both the RANK and DENSE_RANK functions assign the rank of 12 to the 3 rows with total sales of $1,224,992, while the ROW_NUMBER function assigns the ranks 12, 13, and 14 to the same rows. The difference between the RANK and DENSE_RANK functions manifests itself in the ranking assigned to the next-lowest sales total; the RANK function leaves a gap in the ranking sequence and assigns a rank of 15 to customer number 30, while the DENSE_RANK function continues the sequence with a ranking of 13. Deciding which of the three functions to use depends on the desired outcome. If we want to identify the top 13 customers from this result set, we would use:
While the previous query generates rankings across the entire result set, it is also possible to generate independent sets of rankings across multiple partitions of the result set. The following query generates rankings for customer sales within each region rather than across all regions. Note the addition of the PARTITION BY clause: SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, RANK( ) OVER (PARTITION BY region_id ORDER BY SUM(tot_sales) DESC) sales_rank, DENSE_RANK( ) OVER (PARTITION BY region_id ORDER BY SUM(tot_sales) DESC) sales_dense_rank, ROW_NUMBER( ) OVER (PARTITION BY region_id ORDER BY SUM(tot_sales) DESC) sales_number FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY 1,6; REGION_ID CUST_NBR CUST_SALES SALES_RANK SALES_DENSE_RANK SALES_NUMBER ---------- ---------- ---------- ---------- ---------------- ------------ 5 4 1878275 1 1 1 5 2 1224992 2 2 2 5 5 1169926 3 3 3 5 3 1161286 4 4 4 5 1 1151162 5 5 5 6 6 1788836 1 1 1 6 9 1208959 2 2 2 6 10 1196748 3 3 3 6 8 1141638 4 4 4 6 7 971585 5 5 5 7 14 1929774 1 1 1 7 13 1310434 2 2 2 7 15 1255591 3 3 3 7 11 1190421 4 4 4 7 12 1182275 5 5 5 8 17 1944281 1 1 1 8 20 1412006 2 2 2 8 18 1253840 3 3 3 8 19 1174421 4 4 4 8 16 1068467 5 5 5 9 25 2232703 1 1 1 9 23 1224992 2 2 2 9 24 1224992 2 2 3 9 22 1036146 4 3 4 9 21 1020541 5 4 5 10 26 1808949 1 1 1 10 27 1322747 2 2 2 10 30 1216858 3 3 3 10 28 986964 4 4 4 10 29 903383 5 5 5 Each customer receives a ranking between one and five depending on their relation to other customers in the same region. Of the three customers with duplicate total sales, two of them are in region 9; as before, the RANK and DENSE_RANK functions generate identical rankings for both customers.
13.2.1.1 Handling NULLsAll ranking functions allow the caller to specify where in the ranking order NULL values should appear. This is accomplished by appending either NULLS FIRST or NULLS LAST after the ORDER BY clause of the function, as in: SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales,
RANK( ) OVER (ORDER BY SUM(tot_sales) DESC NULLS LAST) sales_rank
FROM orders
WHERE year = 2001
GROUP BY region_id, cust_nbr;
If omitted, NULL values will either appear last in ascending rankings or first in descending rankings. 13.2.1.2 Top/Bottom-N queriesOne of the most common uses of a ranked data set is to identify the top-N or bottom-N performers. Since we can't call analytic functions from the WHERE or HAVING clauses, we are forced to generate the rankings for all the rows and then use an outer query to filter out the unwanted rankings. For example, the following query uses an inline view to identify the top-5 salespersons for 2001: SELECT s.name, sp.sp_sales total_sales FROM salesperson s, (SELECT salesperson_id, SUM(tot_sales) sp_sales, RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_rank FROM orders WHERE year = 2001 GROUP BY salesperson_id) sp WHERE sp.sales_rank <= 5 AND sp.salesperson_id = s.salesperson_id ORDER BY sp.sales_rank; NAME TOTAL_SALES ------------------------------------------------- ----------- Jeff Blake 1927580 Sam Houseman 1814327 Mark Russell 1784596 John Boorman 1768813 Carl Isaacs 1761814 13.2.1.3 FIRST/LASTWhile there is no function for returning only the top or bottom-N from a ranked result set, Oracle provides functionality for identifying the first (top 1) or last (bottom 1) records in a ranked set. This is useful for queries such as the following: "Find the regions with the best and worst total sales last year." Unlike the top-5 salespeople example from the previous section, this query needs an additional piece of information—the size of the result set—in order to answer the question. Oracle 9i provides the ability to answer such queries efficiently using functions that rank the result set based on a specified ordering, identify the row with the top or bottom ranking, and report on any column available in the result set. These functions are composed of three parts:
The following query uses the MIN aggregate function to find the regions that rank FIRST and LAST by total sales: SELECT MIN(region_id) KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) best_region, MIN(region_id) KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) worst_region FROM orders WHERE year = 2001 GROUP BY region_id; BEST_REGION WORST_REGION ----------- ------------ 7 10 The use of the MIN function in the previous query is a bit confusing: it is only used if more than one region ties for either first or last place in the ranking. If there were a tie, the row with the minimum value for region_id would be chosen. To find out if a tie actually exists, we could call each function twice using MIN for the first and MAX for the second, and see if they return the same results: SELECT MIN(region_id) KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) min_best_region, MAX(region_id) KEEP (DENSE_RANK FIRST ORDER BY SUM(tot_sales) DESC) max_best_region, MIN(region_id) KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) min_worst_region, MAX(region_id) KEEP (DENSE_RANK LAST ORDER BY SUM(tot_sales) DESC) max_worst_region FROM orders WHERE year = 2001 GROUP BY region_id; MIN_BEST_REGION MAX_BEST_REGION MIN_WORST_REGION MAX_WORST_REGION --------------- --------------- ---------------- ---------------- 7 7 10 10 In this case, there are no ties for either first or last place. Depending on the type of data you are working with, using an aggregate function as a tiebreaker can be somewhat arbitrary. 13.2.2 NTILEAnother way rankings are commonly used is to generate buckets into which sets of rankings are grouped. For example, we may want to find those customers whose total sales ranked in the top 25%. The following query uses the NTILE function to group the customers into four buckets (or quartiles): SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, NTILE(4) OVER (ORDER BY SUM(tot_sales) DESC) sales_quartile FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY 4,3 DESC; REGION_ID CUST_NBR CUST_SALES SALES_QUARTILE ---------- ---------- ---------- -------------- 9 25 2232703 1 8 17 1944281 1 7 14 1929774 1 5 4 1878275 1 10 26 1808949 1 6 6 1788836 1 8 20 1412006 1 10 27 1322747 1 7 13 1310434 2 7 15 1255591 2 8 18 1253840 2 5 2 1224992 2 9 23 1224992 2 9 24 1224992 2 10 30 1216858 2 6 9 1208959 2 6 10 1196748 3 7 11 1190421 3 7 12 1182275 3 8 19 1174421 3 5 5 1169926 3 5 3 1161286 3 5 1 1151162 3 6 8 1141638 4 8 16 1068467 4 9 22 1036146 4 9 21 1020541 4 10 28 986964 4 6 7 971585 4 10 29 903383 4 The sales_quartile column in this query specifies NTILE(4) in order to create four buckets. The NTILE function finds each row's place in the ranking, and then assigns each row to a bucket such that every bucket contains the same number of rows. If the number of rows is not evenly divisible by the number of buckets, then the extra rows are distributed so that the number of rows per bucket differs by one at most. In the previous example, there are four buckets allocated for 30 rows, with buckets one and two containg eight rows each, and buckets three and four containing seven rows each. This approach is referred to as equiheight buckets because each bucket contains (optimally) the same number of rows. Just like in the top-N query discussed earlier, we will need to wrap the query in an inline view if we want to filter on the NTILE result: SELECT r.name region, c.name customer, cs.cust_sales FROM customer c, region r, (SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, NTILE(4) OVER (ORDER BY SUM(tot_sales) DESC) sales_quartile FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr) cs WHERE cs.sales_quartile = 1 AND cs.cust_nbr = c.cust_nbr AND cs.region_id = r.region_id ORDER BY cs.cust_sales DESC; REGION CUSTOMER CUST_SALES -------------------- ------------------------------ ---------- NorthWest US Worcester Technologies 2232703 SouthWest US Evans Supply Corp. 1944281 SouthEast US Madden Industries 1929774 New England Flowtech Inc. 1878275 Central US Alpha Technologies 1808949 Mid-Atlantic Spartan Industries 1788836 SouthWest US Malden Labs 1412006 Central US Phillips Labs 1322747 The outer query filters on sales_quartile = 1, which removes all rows not in the top 25% of sales, and then joins the region and customer dimensions to generate the final results. 13.2.3 WIDTH_BUCKETSimilar to the NTILE function, the WIDTH_BUCKET function groups rows of the result set into buckets. Unlike NTILE, however, the WIDTH_BUCKET function attempts to create equiwidth buckets, meaning that the range of values is evenly distributed across the buckets. If your data were distributed across a bell curve, therefore, you could expect the buckets representing the low and high ranges of the bell curve to contain few records, whereas the buckets representing the middle ranges would contain many records. New in Oracle9i, WIDTH_BUCKET can operate on numeric or date types, and takes the following four parameters:
WIDTH_BUCKET uses the values of the second, third, and fourth parameters to generate N buckets containing comparable ranges. If the expression yields values that fall outside the range specified by the second and third parameters, the WIDTH_BUCKET function will generate two additional buckets, numbered 0 and N+1, into which the outliers are placed. If we want to work with the entire result set, we need to make sure our values for the second and third parameters completely enclose the range of values in the result set. However, if we only wish to work with a subset of the data, we can specify values for the second and third parameters that enclose the desired range, and any rows falling outside the range will be placeds into buckets 0 and N+1. To illustrate, we will use the NTILE example from earlier to generate three buckets for the total sales per customer: SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, WIDTH_BUCKET(SUM(tot_sales), 1, 3000000, 3) sales_buckets FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY 3; REGION_ID CUST_NBR CUST_SALES SALES_BUCKETS ---------- ---------- ---------- ------------- 10 29 903383 1 6 7 971585 1 10 28 986964 1 9 21 1020541 2 9 22 1036146 2 8 16 1068467 2 6 8 1141638 2 5 1 1151162 2 5 3 1161286 2 5 5 1169926 2 8 19 1174421 2 7 12 1182275 2 7 11 1190421 2 6 10 1196748 2 6 9 1208959 2 10 30 1216858 2 5 2 1224992 2 9 24 1224992 2 9 23 1224992 2 8 18 1253840 2 7 15 1255591 2 7 13 1310434 2 10 27 1322747 2 8 20 1412006 2 6 6 1788836 2 10 26 1808949 2 5 4 1878275 2 7 14 1929774 2 8 17 1944281 2 9 25 2232703 3 Based on these parameters, the WIDTH_BUCKET function generates three buckets; the first bucket starts at 1, and the third bucket has an upper range of 3,000,000. Since there are three buckets, the ranges for each bucket will be 1 to 1,000,000, 1,000,0001 to 2,000,000, and 2,000,0001 to 3,000,000. When the rows are placed in the appropriate bucket, we find that three rows fall into bucket #1, a single row falls in bucket #3, and the remaining 26 rows fall into the second bucket. The values 1 and 3,000,000 were chosen to guarantee that all rows in the result set would be placed into one of the three buckets. If we want to generate buckets only for rows that have aggregate sales between $1,000,000 and $2,000,000, the WIDTH_BUCKET function will place the remaining rows in the 0th and 4th buckets: SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, WIDTH_BUCKET(SUM(tot_sales), 1000000, 2000000, 3) sales_buckets FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY 3; REGION_ID CUST_NBR CUST_SALES SALES_BUCKETS ---------- ---------- ---------- ------------- 10 29 903383 0 6 7 971585 0 10 28 986964 0 9 21 1020541 1 9 22 1036146 1 8 16 1068467 1 6 8 1141638 1 5 1 1151162 1 5 3 1161286 1 5 5 1169926 1 8 19 1174421 1 7 12 1182275 1 7 11 1190421 1 6 10 1196748 1 6 9 1208959 1 10 30 1216858 1 5 2 1224992 1 9 24 1224992 1 9 23 1224992 1 8 18 1253840 1 7 15 1255591 1 7 13 1310434 1 10 27 1322747 1 8 20 1412006 2 6 6 1788836 3 10 26 1808949 3 5 4 1878275 3 7 14 1929774 3 8 17 1944281 3 9 25 2232703 4 Keep in mind that the WIDTH_BUCKET function does not remove rows from the result set that do not lie within the specified range; rather, they are placed into special buckets that your query can either utilize or ignore as needed. 13.2.4 CUME_DIST and PERCENT_RANKThe final two ranking functions, CUME_DIST and PERCENT_RANK, use the rank of a particular row to calculate additional information. The CUME_DIST function (short for Cumulative Distribution) calculates the ratio of the number of rows that have a lesser or equal ranking to the total number of rows in the partition. The PERCENT_RANK function calculates the ratio of a row's ranking to the number of rows in the partition using the formula: (RRP -- 1) / (NRP -- 1) where:
Both calculations utilize DENSE_RANK for their rankings and can be specified to be in ascending or descending order. The following query demonstrates the use of these two functions (both specifying descending order) with our customer yearly sales query: SELECT region_id, cust_nbr, SUM(tot_sales) cust_sales, CUME_DIST( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_cume_dist, PERCENT_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) sales_percent_rank FROM orders WHERE year = 2001 GROUP BY region_id, cust_nbr ORDER BY 3 DESC; REGION_ID CUST_NBR CUST_SALES SALES_CUME_DIST SALES_PERCENT_RANK ---------- ---------- ---------- --------------- ------------------ 9 25 2232703 .033333333 0 8 17 1944281 .066666667 .034482759 7 14 1929774 .1 .068965517 5 4 1878275 .133333333 .103448276 10 26 1808949 .166666667 .137931034 6 6 1788836 .2 .172413793 8 20 1412006 .233333333 .206896552 10 27 1322747 .266666667 .24137931 7 13 1310434 .3 .275862069 7 15 1255591 .333333333 .310344828 8 18 1253840 .366666667 .344827586 5 2 1224992 .466666667 .379310345 9 23 1224992 .466666667 .379310345 9 24 1224992 .466666667 .379310345 10 30 1216858 .5 .482758621 6 9 1208959 .533333333 .517241379 6 10 1196748 .566666667 .551724138 7 11 1190421 .6 .586206897 7 12 1182275 .633333333 .620689655 8 19 1174421 .666666667 .655172414 5 5 1169926 .7 .689655172 5 3 1161286 .733333333 .724137931 5 1 1151162 .766666667 .75862069 6 8 1141638 .8 .793103448 8 16 1068467 .833333333 .827586207 9 22 1036146 .866666667 .862068966 9 21 1020541 .9 .896551724 10 28 986964 .933333333 .931034483 6 7 971585 .966666667 .965517241 10 29 903383 1 1 Let's walk through a couple of calculations for customer number 1 in the previous result set. With total sales of $1,151,162, customer number 1 ranks 23rd in the set of 30 customers in descending order of sales. Since there are a total of 30 rows, the CUME_DIST is equal to 23/30, or .766666667. The PERCENT_RANK function yields (23 - 1) / (30 - 1) = .75862069. It should come as no surprise that both functions return identical values for the rows that have identical sales totals, since the calculations are based on rank, which is identical for all three rows. 13.2.5 Hypothetical FunctionsFor some types of analysis, determining what might have happened is more revealing than knowing what really happened. With the Oracle9i release, Oracle provides special versions of RANK, DENSE_RANK, CUME_DIST, and PERCENT_RANK that allow rankings and distributions to be calculated for hypothetical data, allowing the user to see what would have happened if a specific value (or set of values) was included in a data set. In order to illustrate this concept, we will first rank our customers by total sales for 2001, and then we will see where a hypothetical sales figure would fall in the ranking. Here is the query that generates the rankings and distributions: SELECT cust_nbr, SUM(tot_sales) cust_sales, RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) rank, DENSE_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) dense_rank, CUME_DIST( ) OVER (ORDER BY SUM(tot_sales) DESC) cume_dist, PERCENT_RANK( ) OVER (ORDER BY SUM(tot_sales) DESC) percent_rank FROM orders WHERE year = 2001 GROUP BY cust_nbr ORDER BY 3; CUST_NBR CUST_SALES RANK DENSE_RANK CUME_DIST PERCENT_RANK ---------- ---------- ---------- ---------- ---------- ------------ 25 2232703 1 1 .033333333 0 17 1944281 2 2 .066666667 .034482759 14 1929774 3 3 .1 .068965517 4 1878275 4 4 .133333333 .103448276 26 1808949 5 5 .166666667 .137931034 6 1788836 6 6 .2 .172413793 20 1412006 7 7 .233333333 .206896552 27 1322747 8 8 .266666667 .24137931 13 1310434 9 9 .3 .275862069 15 1255591 10 10 .333333333 .310344828 18 1253840 11 11 .366666667 .344827586 2 1224992 12 12 .466666667 .379310345 23 1224992 12 12 .466666667 .379310345 24 1224992 12 12 .466666667 .379310345 30 1216858 15 13 .5 .482758621 9 1208959 16 14 .533333333 .517241379 10 1196748 17 15 .566666667 .551724138 11 1190421 18 16 .6 .586206897 12 1182275 19 17 .633333333 .620689655 19 1174421 20 18 .666666667 .655172414 5 1169926 21 19 .7 .689655172 3 1161286 22 20 .733333333 .724137931 1 1151162 23 21 .766666667 .75862069 8 1141638 24 22 .8 .793103448 16 1068467 25 23 .833333333 .827586207 22 1036146 26 24 .866666667 .862068966 21 1020541 27 25 .9 .896551724 28 986964 28 26 .933333333 .931034483 7 971585 29 27 .966666667 .965517241 29 903383 30 28 1 1 Now let's see where a customer with an even million dollars of sales would have ranked: SELECT RANK(1000000) WITHIN GROUP (ORDER BY SUM(tot_sales) DESC) hyp_rank, DENSE_RANK(1000000) WITHIN GROUP (ORDER BY SUM(tot_sales) DESC) hyp_dense_rank, CUME_DIST(1000000) WITHIN GROUP (ORDER BY SUM(tot_sales) DESC) hyp_cume_dist, PERCENT_RANK(1000000) WITHIN GROUP (ORDER BY SUM(tot_sales) DESC) hyp_percent_rank FROM orders WHERE year = 2001 GROUP BY cust_nbr; HYP_RANK HYP_DENSE_RANK HYP_CUME_DIST HYP_PERCENT_RANK ---------- -------------- ------------- ---------------- 28 26 .903225806 .9 The WITHIN GROUP clause has the effect of injecting a fictitious row into the result set before determining the rankings. One possible use of this functionality would be to see how actual sales compare to sales targets. |