13.1 Analytic SQL OverviewThe types of queries issued by Decision Support Systems (DSS) differ from those issued against OLTP systems. Consider the following business queries:
Queries such as these are staples of DSS, and are used by managers, analysts, marketing executives, etc. to spot trends, identify outliers, uncover business opportunities, and predict future business performance. DSS systems typically sit atop data warehouses, in which large quantities of scrubbed, aggregated data provide fertile grounds for researching and formulating business decisions. While all of the previous queries can be easily expressed in English, they have historically been difficult to formulate using SQL for the following reasons:
While it is possible to generate the desired results using such SQL features as self joins, inline views, and user-defined functions, the resulting queries can be difficult to understand and might yield unacceptably long execution times. To illustrate the difficulty in formulating such queries, we will walk through the construction of this query: "Find all customers whose total orders last year exceeded 20% of the aggregate sales for their geographic region." For this and other examples in this chapter, we use a simple star schema consisting of a single fact table (called "orders") containing aggregated sales information across the following dimensions: region, salesperson, customer, and month. There are two main facets to this query, each requiring a different level of aggregation of the same data:
After these two intermediate result sets have been constructed, we need to compare each customer's total to the total for their region and see if it exceeds 20%. The final result set will show the customer names along with their total sales, region name, and the percentage of their region's sales. The query to aggregate sales by region looks as follows: SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.region_id; REGION_ID TOT_SALES ---------- ---------- 5 6585641 6 6307766 7 6868495 8 6853015 9 6739374 10 6238901 The query to aggregate sales by customer would be: SELECT o.cust_nbr cust_nbr, o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.cust_nbr, o.region_id; CUST_NBR REGION_ID TOT_SALES ---------- ---------- ---------- 1 5 1151162 2 5 1224992 3 5 1161286 4 5 1878275 5 5 1169926 6 6 1788836 7 6 971585 8 6 1141638 9 6 1208959 10 6 1196748 11 7 1190421 12 7 1182275 13 7 1310434 14 7 1929774 15 7 1255591 16 8 1068467 17 8 1944281 18 8 1253840 19 8 1174421 20 8 1412006 21 9 1020541 22 9 1036146 23 9 1224992 24 9 1224992 25 9 2232703 26 10 1808949 27 10 1322747 28 10 986964 29 10 903383 30 10 1216858 By placing each of the two queries in an inline view and joining them on region_id, we can locate those customers whose total sales exceeds 20% of their region, as in: SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id, cust_sales.tot_sales cust_sales, region_sales.tot_sales region_sales FROM (SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.region_id) region_sales, (SELECT o.cust_nbr cust_nbr, o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.cust_nbr, o.region_id) cust_sales WHERE cust_sales.region_id = region_sales.region_id AND cust_sales.tot_sales > (region_sales.tot_sales * .2); CUST_NBR REGION_ID CUST_SALES REGION_SALES ---------- ---------- ---------- ------------ 4 5 1878275 6585641 6 6 1788836 6307766 14 7 1929774 6868495 17 8 1944281 6853015 20 8 1412006 6853015 25 9 2232703 6739374 26 10 1808949 6238901 27 10 1322747 6238901 The final step is to join the region and customer dimensions in order to include the customer and region names in the result set: SELECT c.name cust_name, big_custs.cust_sales cust_sales, r.name region_name, 100 * ROUND(big_custs.cust_sales / big_custs.region_sales, 2) percent_of_region FROM region r, customer c, (SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id, cust_sales.tot_sales cust_sales, region_sales.tot_sales region_sales FROM (SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.region_id) region_sales, (SELECT o.cust_nbr cust_nbr, o.region_id region_id, SUM(o.tot_sales) tot_sales FROM orders o WHERE o.year = 2001 GROUP BY o.cust_nbr, o.region_id) cust_sales WHERE cust_sales.region_id = region_sales.region_id AND cust_sales.tot_sales > (region_sales.tot_sales * .2)) big_custs WHERE big_custs.cust_nbr = c.cust_nbr AND big_custs.region_id = r.region_id; CUST_NAME CUST_SALES REGION_NAME PERCENT_OF_REGION ---------------------- ---------- -------------------- ----------------- Flowtech Inc. 1878275 New England 29 Spartan Industries 1788836 Mid-Atlantic 28 Madden Industries 1929774 SouthEast US 28 Evans Supply Corp. 1944281 SouthWest US 28 Malden Labs 1412006 SouthWest US 21 Worcester Technologies 2232703 NorthWest US 33 Alpha Technologies 1808949 Central US 29 Phillips Labs 1322747 Central US 21 Using nothing more exotic than inline views, we can construct a single query that generates the desired results. The solution, however, has the following shortcomings:
Let's see how we can both simplify the query and perform the same work in a single pass through the orders table using one of the new analytic functions. Rather than issuing two separate queries to aggregate sales per region and per customer, we will create a single query that aggregates sales over both region and customer. We can then call an analytic function that performs a second level of aggregation to generate total sales per region: SELECT o.region_id region_id, o.cust_nbr cust_nbr, SUM(o.tot_sales) tot_sales, SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales FROM orders o WHERE o.year = 2001 GROUP BY o.region_id, o.cust_nbr; REGION_ID CUST_NBR TOT_SALES REGION_SALES ---------- ---------- ---------- ------------ 5 1 1151162 6584167 5 2 1223518 6584167 5 3 1161286 6584167 5 4 1878275 6584167 5 5 1169926 6584167 6 6 1788836 6307766 6 7 971585 6307766 6 8 1141638 6307766 6 9 1208959 6307766 6 10 1196748 6307766 7 11 1190421 6868495 7 12 1182275 6868495 7 13 1310434 6868495 7 14 1929774 6868495 7 15 1255591 6868495 8 16 1068467 6853015 8 17 1944281 6853015 8 18 1253840 6853015 8 19 1174421 6853015 8 20 1412006 6853015 9 21 1020541 6726929 9 22 1036146 6726929 9 23 1212547 6726929 9 24 1224992 6726929 9 25 2232703 6726929 10 26 1808949 6238901 10 27 1322747 6238901 10 28 986964 6238901 10 29 903383 6238901 10 30 1216858 6238901 The analytic function can be found in line 3 of the previous query and the result has the alias region_sales. The aggregate function (SUM(o.tot_sales)) in line 2 generates the total sales per customer and region as directed by the GROUP BY clause, and the analytic function in line 3 aggregates these sums for each region, thereby computing the aggregate sales per region. The value for the region_sales column is identical for all customers within the same region and is equal to the sum of all customer sales within that region. We can then wrap the query in an inline view,[1] filter out those customers with less than 20% of their region's total sales, and join the region and customer tables to generate the desired result set:
SELECT c.name cust_name, cust_sales.tot_sales cust_sales, r.name region_name, 100 * ROUND(cust_sales.tot_sales / cust_sales.region_sales, 2) percent_of_region FROM region r, customer c, (SELECT o.region_id region_id, o.cust_nbr cust_nbr, SUM(o.tot_sales) tot_sales, SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales FROM orders o WHERE o.year = 2001 GROUP BY o.region_id, o.cust_nbr) cust_sales WHERE cust_sales.tot_sales > (cust_sales.region_sales * .2) AND cust_sales.region_id = r.region_id AND cust_sales.cust_nbr = c.cust_nbr; CUST_NAME CUST_SALES REGION_NAME PERCENT_OF_REGION ---------------------- ---------- -------------------- ----------------- Flowtech Inc. 1878275 New England 29 Spartan Industries 1788836 Mid-Atlantic 28 Madden Industries 1929774 SouthEast US 28 Evans Supply Corp. 1944281 SouthWest US 28 Malden Labs 1412006 SouthWest US 21 Worcester Technologies 2232703 NorthWest US 33 Alpha Technologies 1808949 Central US 29 Phillips Labs 1322747 Central US 21 Without getting into the details of how the SUM...OVER function works (we will discuss it later in this chapter under Section 13.4), we can see that Oracle is performing an aggregation of an aggregation rather than revisiting the detail rows twice. Thus, the query runs faster and should also prove easier to understand and maintain once the syntax is familiar. Unlike built-in functions such as DECODE, GREATEST, and SUBSTR, Oracle's suite of analytic functions can only be used in the SELECT clause of a query. This is because analytic functions are only executed after the FROM, WHERE, GROUP BY, and HAVING clauses have been evaluated. After the analytic functions have executed, the query's ORDER BY clause is evaluated in order to sort the final result set, and the ORDER BY clause is allowed to reference columns in the SELECT clause generated via analytic functions. The remainder of the chapter introduces the Oracle8i and Oracle9i analytic functions, grouped by functionality. |