13.1 Analytic SQL Overview

13.1 Analytic SQL Overview

The types of queries issued by Decision Support Systems (DSS) differ from those issued against OLTP systems. Consider the following business queries:

Find the top ten salespeople in each sales district last year.
Find all customers whose total orders last year exceeded 20% of the aggregate sales for their geographic region.
Identify the region that suffered the worst quarter-to-quarter sales decline last year.
Find the best and worst selling menu items by state for each quarter last year.

Queries such as these are staples of DSS, and are used by managers, analysts, marketing executives, etc. to spot trends, identify outliers, uncover business opportunities, and predict future business performance. DSS systems typically sit atop data warehouses, in which large quantities of scrubbed, aggregated data provide fertile grounds for researching and formulating business decisions.

While all of the previous queries can be easily expressed in English, they have historically been difficult to formulate using SQL for the following reasons:

They may require different levels of aggregation of the same data.
They may involve intra-table comparisons (comparing one or more rows in a table with other rows in the same table).
They may require an extra filtering step after the result set has been sorted (i.e., finding the top ten and bottom ten salespeople last month).

While it is possible to generate the desired results using such SQL features as self joins, inline views, and user-defined functions, the resulting queries can be difficult to understand and might yield unacceptably long execution times. To illustrate the difficulty in formulating such queries, we will walk through the construction of this query: "Find all customers whose total orders last year exceeded 20% of the aggregate sales for their geographic region."

For this and other examples in this chapter, we use a simple star schema consisting of a single fact table (called "orders") containing aggregated sales information across the following dimensions: region, salesperson, customer, and month. There are two main facets to this query, each requiring a different level of aggregation of the same data:

Sum all sales per region last year.
Sum all sales per customer last year.

After these two intermediate result sets have been constructed, we need to compare each customer's total to the total for their region and see if it exceeds 20%. The final result set will show the customer names along with their total sales, region name, and the percentage of their region's sales.

The query to aggregate sales by region looks as follows:

SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales
FROM orders o
WHERE o.year = 2001
GROUP BY o.region_id;

REGION_ID  TOT_SALES
---------- ----------
         5    6585641
         6    6307766
         7    6868495
         8    6853015
         9    6739374
        10    6238901

The query to aggregate sales by customer would be:

SELECT o.cust_nbr cust_nbr, o.region_id region_id,
  SUM(o.tot_sales) tot_sales
FROM orders o
WHERE o.year = 2001
GROUP BY o.cust_nbr, o.region_id;

  CUST_NBR  REGION_ID  TOT_SALES
---------- ---------- ----------
         1          5    1151162
         2          5    1224992
         3          5    1161286
         4          5    1878275
         5          5    1169926
         6          6    1788836
         7          6     971585
         8          6    1141638
         9          6    1208959
        10          6    1196748
        11          7    1190421
        12          7    1182275
        13          7    1310434
        14          7    1929774
        15          7    1255591
        16          8    1068467
        17          8    1944281
        18          8    1253840
        19          8    1174421
        20          8    1412006
        21          9    1020541
        22          9    1036146
        23          9    1224992
        24          9    1224992
        25          9    2232703
        26         10    1808949
        27         10    1322747
        28         10     986964
        29         10     903383
        30         10    1216858

By placing each of the two queries in an inline view and joining them on region_id, we can locate those customers whose total sales exceeds 20% of their region, as in:

SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id,
  cust_sales.tot_sales cust_sales, region_sales.tot_sales region_sales
FROM
 (SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales
  FROM orders o
  WHERE o.year = 2001
  GROUP BY o.region_id) region_sales,
 (SELECT o.cust_nbr cust_nbr, o.region_id region_id,
    SUM(o.tot_sales) tot_sales
  FROM orders o
  WHERE o.year = 2001
  GROUP BY o.cust_nbr, o.region_id) cust_sales
WHERE cust_sales.region_id = region_sales.region_id
  AND cust_sales.tot_sales > (region_sales.tot_sales * .2);

  CUST_NBR  REGION_ID CUST_SALES REGION_SALES
---------- ---------- ---------- ------------
         4          5    1878275      6585641
         6          6    1788836      6307766
        14          7    1929774      6868495
        17          8    1944281      6853015
        20          8    1412006      6853015
        25          9    2232703      6739374
        26         10    1808949      6238901
        27         10    1322747      6238901

The final step is to join the region and customer dimensions in order to include the customer and region names in the result set:

SELECT c.name cust_name,
  big_custs.cust_sales cust_sales, r.name region_name,
  100 * ROUND(big_custs.cust_sales / 
    big_custs.region_sales, 2)  percent_of_region
FROM region r, customer c,
 (SELECT cust_sales.cust_nbr cust_nbr, cust_sales.region_id region_id,
    cust_sales.tot_sales cust_sales, 
    region_sales.tot_sales region_sales
  FROM
   (SELECT o.region_id region_id, SUM(o.tot_sales) tot_sales
    FROM orders o
    WHERE o.year = 2001
    GROUP BY o.region_id) region_sales,
   (SELECT o.cust_nbr cust_nbr, o.region_id region_id,
      SUM(o.tot_sales) tot_sales
    FROM orders o
    WHERE o.year = 2001
    GROUP BY o.cust_nbr, o.region_id) cust_sales
  WHERE cust_sales.region_id = region_sales.region_id
    AND cust_sales.tot_sales > (region_sales.tot_sales * .2)) big_custs
WHERE big_custs.cust_nbr = c.cust_nbr
  AND big_custs.region_id = r.region_id;

CUST_NAME              CUST_SALES REGION_NAME          PERCENT_OF_REGION
---------------------- ---------- -------------------- -----------------
Flowtech Inc.             1878275 New England                         29
Spartan Industries        1788836 Mid-Atlantic                        28
Madden Industries         1929774 SouthEast US                        28
Evans Supply Corp.        1944281 SouthWest US                        28
Malden Labs               1412006 SouthWest US                        21
Worcester Technologies    2232703 NorthWest US                        33
Alpha Technologies        1808949 Central US                          29
Phillips Labs             1322747 Central US                          21

Using nothing more exotic than inline views, we can construct a single query that generates the desired results. The solution, however, has the following shortcomings:

The query is fairly complex.
Two passes through the same rows of the orders table are required to generate the different aggregation levels needed by the query.

Let's see how we can both simplify the query and perform the same work in a single pass through the orders table using one of the new analytic functions. Rather than issuing two separate queries to aggregate sales per region and per customer, we will create a single query that aggregates sales over both region and customer. We can then call an analytic function that performs a second level of aggregation to generate total sales per region:

SELECT o.region_id region_id, o.cust_nbr cust_nbr,
 SUM(o.tot_sales) tot_sales,
 SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales
FROM orders o
WHERE o.year = 2001
GROUP BY o.region_id, o.cust_nbr;

REGION_ID   CUST_NBR  TOT_SALES REGION_SALES
---------- ---------- ---------- ------------
         5          1    1151162      6584167
         5          2    1223518      6584167
         5          3    1161286      6584167
         5          4    1878275      6584167
         5          5    1169926      6584167
         6          6    1788836      6307766
         6          7     971585      6307766
         6          8    1141638      6307766
         6          9    1208959      6307766
         6         10    1196748      6307766
         7         11    1190421      6868495
         7         12    1182275      6868495
         7         13    1310434      6868495
         7         14    1929774      6868495
         7         15    1255591      6868495
         8         16    1068467      6853015
         8         17    1944281      6853015
         8         18    1253840      6853015
         8         19    1174421      6853015
         8         20    1412006      6853015
         9         21    1020541      6726929
         9         22    1036146      6726929
         9         23    1212547      6726929
         9         24    1224992      6726929
         9         25    2232703      6726929
        10         26    1808949      6238901
        10         27    1322747      6238901
        10         28     986964      6238901
        10         29     903383      6238901
        10         30    1216858      6238901

The analytic function can be found in line 3 of the previous query and the result has the alias region_sales. The aggregate function (SUM(o.tot_sales)) in line 2 generates the total sales per customer and region as directed by the GROUP BY clause, and the analytic function in line 3 aggregates these sums for each region, thereby computing the aggregate sales per region. The value for the region_sales column is identical for all customers within the same region and is equal to the sum of all customer sales within that region. We can then wrap the query in an inline view,^[1] filter out those customers with less than 20% of their region's total sales, and join the region and customer tables to generate the desired result set:

^[1] Using an inline view will save us from having to join the region and customer tables to the orders table; otherwise, we would have to include columns from the region and customer tables in the GROUP BY clause.

SELECT c.name cust_name,
  cust_sales.tot_sales cust_sales, r.name region_name,
  100 * ROUND(cust_sales.tot_sales / 
    cust_sales.region_sales, 2)  percent_of_region
FROM region r, customer c,
 (SELECT o.region_id region_id, o.cust_nbr cust_nbr,
    SUM(o.tot_sales) tot_sales,
    SUM(SUM(o.tot_sales)) OVER (PARTITION BY o.region_id) region_sales
  FROM orders o
  WHERE o.year = 2001
  GROUP BY o.region_id, o.cust_nbr) cust_sales
WHERE cust_sales.tot_sales > (cust_sales.region_sales * .2)
  AND cust_sales.region_id = r.region_id
  AND cust_sales.cust_nbr = c.cust_nbr;

CUST_NAME              CUST_SALES REGION_NAME          PERCENT_OF_REGION
---------------------- ---------- -------------------- -----------------
Flowtech Inc.             1878275 New England                         29
Spartan Industries        1788836 Mid-Atlantic                        28
Madden Industries         1929774 SouthEast US                        28
Evans Supply Corp.        1944281 SouthWest US                        28
Malden Labs               1412006 SouthWest US                        21
Worcester Technologies    2232703 NorthWest US                        33
Alpha Technologies        1808949 Central US                          29
Phillips Labs             1322747 Central US                          21

Without getting into the details of how the SUM...OVER function works (we will discuss it later in this chapter under Section 13.4), we can see that Oracle is performing an aggregation of an aggregation rather than revisiting the detail rows twice. Thus, the query runs faster and should also prove easier to understand and maintain once the syntax is familiar.

Unlike built-in functions such as DECODE, GREATEST, and SUBSTR, Oracle's suite of analytic functions can only be used in the SELECT clause of a query. This is because analytic functions are only executed after the FROM, WHERE, GROUP BY, and HAVING clauses have been evaluated. After the analytic functions have executed, the query's ORDER BY clause is evaluated in order to sort the final result set, and the ORDER BY clause is allowed to reference columns in the SELECT clause generated via analytic functions.

The remainder of the chapter introduces the Oracle8i and Oracle9i analytic functions, grouped by functionality.