4.1 Aggregate Functions

4.1 Aggregate Functions

In essence, an aggregate function summarizes the results of an expression over a number of rows, returning a single value. The general syntax for most of the aggregate functions is as follows:

aggregate_function([DISTINCT | ALL] expression)

The syntax elements are:

aggregate_function: Gives the name of the function, e.g., SUM, COUNT, AVG, MAX, MIN, etc.
DISTINCT: Specifies that the aggregate function should consider only distinct values of the argument expression.
ALL: Specifies that the aggregate function should consider all values, including all duplicate values, of the argument expression. The default is ALL.
expression: Specifies a column, or any other expression, on which we want to perform the aggregation.

Let's look at a simple example. The following SQL uses the MAX function to find the maximum salary of all employees:

SELECT MAX(SALARY) FROM EMPLOYEE;

MAX(SALARY)
-----------
       5000

In subsequent sections, we use a series of slightly more involved examples that illustrate various aspects of aggregate function behavior. For those examples, we use the following CUST_ORDER table:

DESC CUST_ORDER
 Name                             Null?    Type
 -------------------------------- -------- --------------
 ORDER_NBR                        NOT NULL NUMBER(7)
 CUST_NBR                         NOT NULL NUMBER(5)
 SALES_EMP_ID                     NOT NULL NUMBER(5)
 SALE_PRICE                                NUMBER(9,2)
 ORDER_DT                         NOT NULL DATE
 EXPECTED_SHIP_DT                 NOT NULL DATE
 CANCELLED_DT                              DATE
 SHIP_DT                                   DATE
 STATUS                                    VARCHAR2(20)

SELECT ORDER_NBR, CUST_NBR, SALES_EMP_ID, SALE_PRICE,
ORDER_DT, EXPECTED_SHIP_DT
FROM CUST_ORDER;

ORDER_NBR  CUST_NBR SALES_EMP_ID SALE_PRICE ORDER_DT  EXPECTED_ 
---------- -------- ------------ ---------- --------- --------- 
      1001      231         7354         99 22-JUL-01 23-JUL-01
      1000      201         7354            19-JUL-01 24-JUL-01
      1002      255         7368            12-JUL-01 25-JUL-01
      1003      264         7368         56 16-JUL-01 26-JUL-01
      1004      244         7368         34 18-JUL-01 27-JUL-01
      1005      288         7368         99 22-JUL-01 24-JUL-01
      1006      231         7354            22-JUL-01 28-JUL-01
      1007      255         7368         25 20-JUL-01 22-JUL-01
      1008      255         7368         25 21-JUL-01 23-JUL-01
      1009      231         7354         56 18-JUL-01 22-JUL-01
      1012      231         7354         99 22-JUL-01 23-JUL-01
      1011      201         7354            19-JUL-01 24-JUL-01
      1015      255         7368            12-JUL-01 25-JUL-01
      1017      264         7368         56 16-JUL-01 26-JUL-01
      1019      244         7368         34 18-JUL-01 27-JUL-01
      1021      288         7368         99 22-JUL-01 24-JUL-01
      1023      231         7354            22-JUL-01 28-JUL-01
      1025      255         7368         25 20-JUL-01 22-JUL-01
      1027      255         7368         25 21-JUL-01 23-JUL-01
      1029      231         7354         56 18-JUL-01 22-JUL-01

20 rows selected.

4.1.1 NULLs and Aggregate Functions

Notice that the column SALE_PRICE in the CUST_ORDER table is nullable, and that it contains NULL values for some rows. To examine the effect of NULLs in an aggregate function, we execute the following SQL:

SELECT COUNT(*), COUNT(SALE_PRICE) FROM CUST_ORDER;

COUNT(*) COUNT(SALE_PRICE)
-------- -----------------
      20                14

Notice the difference in the output of COUNT(*) and COUNT(SALE_PRICE). This is because COUNT(SALE_PRICE) ignores NULLs, whereas COUNT(*) doesn't. The reason COUNT(*) doesn't ignore NULLs is because it counts rows, not COLUMN values. The concept of NULL doesn't apply to a row as a whole. Other than COUNT(*), there is only one other aggregate function that doesn't ignore NULLs, and that is GROUPING. All other aggregate functions ignore NULLs. We will discuss GROUPING in Chapter 12. For now, let's examine the effect of NULLs when they are ignored.

SUM, MAX, MIN, AVG, etc. all ignore NULLs. Therefore, if we are trying to find a value such as the average sale price in the CUST_ORDER table, the average will be of the 14 rows that have a value for that column. The following example shows the count of all rows, the total of all sale prices, and the average of all sale prices:

SELECT COUNT(*), SUM(SALE_PRICE), AVG(SALE_PRICE)
FROM CUST_ORDER;

       COUNT(*) SUM(SALE_PRICE) AVG(SALE_PRICE)
--------------- --------------- ---------------
             20             788      56.2857143

Note that AVG(SALE_PRICE) is not equal to SUM(SALE_PRICE) / COUNT(*). If it were, the result of AVG(SALE_PRICE) would have been 788 / 20 = 39.4. But, since the AVG function ignores NULLS, it divides the total sale price by 14, and not by 20 (788 / 14 = 56.2857143).

There may be situations where we want an average to be taken over all the rows in a table, not just the rows with non-NULL values for the column in question. In these situations we have to use the NVL function within the AVG function call to assign 0 (or some other useful value) to the column in place of any NULL values. (DECODE or the new COALESCE function can be used in place of NVL. See Chapter 9 for details.) Here's an example:

SELECT AVG(NVL(SALE_PRICE,0)) FROM CUST_ORDER;

AVG(NVL(SALE_PRICE,0))
----------------------
                  39.4

Notice that the use of NVL causes all 20 rows to be considered for average computation, and the rows with NULL values for SALE_PRICE are assumed to have a 0 value for that column.

4.1.2 Use of DISTINCT and ALL

Most aggregate functions allow the use of DISTINCT or ALL along with the expression argument. DISTINCT allows us to disregard duplicate expression values, while ALL causes duplicate expression values to be included in the result. Notice that the column CUST_NBR has duplicate values. Observe the result of the following SQL:

SELECT COUNT(CUST_NBR), COUNT(DISTINCT CUST_NBR), COUNT(ALL CUST_NBR)
FROM CUST_ORDER;

COUNT(CUST_NBR) COUNT(DISTINCTCUST_NBR) COUNT(ALLCUST_NBR)
--------------- ----------------------- ------------------
             20                       6                 20

There are six distinct values in the CUST_NBR column. Therefore, COUNT(DISTINCT CUST_NBR) returns 6, whereas COUNT(CUST_NBR) and COUNT(ALL CUST_NBR) both return 20. ALL is the default, which means that if we don't specify either DISTINCT or ALL before the expression argument in an aggregate function, the function will consider all the rows that have a non-NULL value for the expression.

An important thing to note here is that ALL doesn't cause an aggregate function to consider NULL values. For example, COUNT(ALL SALE_PRICE) in the following example still returns 14, and not 20.

SELECT COUNT(ALL SALE_PRICE) FROM CUST_ORDER;

COUNT(ALLSALE_PRICE)
--------------------
                  14

Since ALL is the default, we can explicitly use ALL with every aggregate function. However, the aggregate functions that take more than one argument as input don't allow the use of DISTINCT. These include CORR, COVAR_POP, COVAR_SAMP, and all the linear regression functions.

In addition, some functions that take only one argument as input don't allow the use of DISTINCT. This category includes STTDEV_POP, STDDEV_SAMP, VAR_POP, VAR_SAMP, and GROUPING.

If we try to use DISTINCT with an aggregate function that doesn't allow it, we will get an error. For example:

SELECT STDDEV_POP(DISTINCT SALE_PRICE)
FROM CUST_ORDER;
SELECT STDDEV_POP(DISTINCT SALE_PRICE)
       *
ERROR at line 1:
ORA-30482: DISTINCT option not allowed for this function

However, using ALL with such a function doesn't cause any error. For example:

SELECT STDDEV_POP(ALL SALE_PRICE)
FROM CUST_ORDER;

STDDEV_POP(ALLSALE_PRICE)
-------------------------
               29.5282639