Team LiB   Previous Section   Next Section

8.5 Calculating a Median

8.5.1 Problem

You want to calculate the median bulb life for all bulbs that have been tested. From the discussion earlier in this chapter, you should recognize that the median bulb life represents the case where the number of bulbs with shorter lives is equivalent to the number of bulbs with longer lives.

8.5.2 Solution

To calculate the median of the light-bulb test results, use the following query:

SELECT x.Hours median
FROM BulbLife x, BulbLife y
GROUP BY x.Hours
HAVING 
   SUM(CASE WHEN y.Hours <= x.Hours 
      THEN 1 ELSE 0 END)>=(COUNT(*)+1)/2 AND
   SUM(CASE WHEN y.Hours >= x.Hours 
      THEN 1 ELSE 0 END)>=(COUNT(*)/2)+1

median      
----------- 
1086

8.5.3 Discussion

This query follows the definition of the median very closely and uses the solution published several years ago by David Rozenshtein, Anatoly Abramovich, and Eugene Birger. Their solution is still regarded as one of the classical solutions to the problem of finding the median value in a sample. To understand their solution, it helps to look at the query in two phases. First, you have a GROUP BY query that returns the number of bulbs for each distinct lifetime. The following is a modified version of the first part of the solution query that returns the occurrence count corresponding to each distinct bulb-life value:

SELECT COUNT(*) occurrences, x.Hours xhours
FROM BulbLife x, BulbLife y
GROUP BY x.Hours

occurrences xhours      
----------- ----------- 
40          1043
20          1057
20          1073
20          1074
20          1077
20          1079
40          1085
20          1086
20          1087
20          1093
20          1096
20          1099
20          1103
20          1109
20          1114
20          1129
20          1131
20          1167

Because these results represent a self-join of the BulbLife table with itself, each group represents a number of detail rows equivalent to the number of rows in the sample. The two groups of 40 occurrences each exist because the data contains 2 cases with values of 1043 and 2 cases with values of 1085. The detail for the 1086 group is as follows:

xhours      yhours      
----------- ----------- 
1086        1043
1086        1043
1086        1057
1086        1073
1086        1074
1086        1077
1086        1079
1086        1085
1086        1085
1086        1086
1086        1087
1086        1093
1086        1096
1086        1099
1086        1103
1086        1109
1086        1114
1086        1129
1086        1131
1086        1167

The question now is whether the value 1086 represents the median. To determine that, follow these steps:

  1. Count the cases where the y.hours value is less than or equal to the x.hours value.

  2. Count the cases where the x.hours value is less than or equal to the y.hours value.

  3. Compare the two results. If they are equal, then 1086 is the median value.

The HAVING clause in our solution query performs the counts for steps 1 and 2 using the following two invocations of the SUM function combined with a CASE statement:

SUM(CASE WHEN y.Hours <= x.Hours 
   THEN 1 ELSE 0 END)
SUM(CASE WHEN y.Hours >= x.Hours 
   THEN 1 ELSE 0 END)

In our example, the values for these 2 sums work out to 10 and 11, respectively. Plug these two values in for the two SUM expressions in the HAVING clause, and you have the following:

10 >= (COUNT(*)+1)/2 AND
11 >= (COUNT(*)/2)+1

At this point, the two COUNT expressions deserve some additional explanation. They have been carefully crafted to allow us to derive a median, even in cases where we have an even number of values in the sample. Let's step back for a moment, and assume that our sample contained 21 values, instead of the 20 that it does contain. If that were the case, the two COUNT expressions would evaluate as follows:

(COUNT(*)+1)/2         (COUNT(*)/2)+1
(21+1)/2               (21/2)+1
22/2                   10+1
11                     11

In SQL Server, 21/2 represents an integer division and, hence, yields an integer value of 10 as the result.

Whenever you have an odd number of values in the sample, the two expressions will yield the same result. Given an even number, however, the first expression will yield a result that is one less than the other. Here is how the HAVING expression works out for the data in our example:

10 >= (20+1)/2 AND
11 >= (20/2)+1

10 >= 10 AND
11 >= 11

For the case where x.Hours = 1086, both expressions are true, so 1086 is returned as the median value. In actual fact, because we have an even number of values, there are 2 candidates for the median: 1086 and 1087. The value 1086 has 9 values below it and 10 above it. The value for 1087 has 10 values below it and 9 above it. Due to how we've written the COUNT expressions, our solution query arbitrarily returns the lower value as the median.

It's possible to use a slightly modified version of our solution query to return the financial median. Recall, from earlier in this chapter, that in the case of an even number of values, the financial median represents the mean of the two inner neighbors. With respect to our example, that would be the mean of 1086 and 1087, which works out to 1086.5. Use the following query to calculate the financial median:

SELECT 
   CASE WHEN COUNT(*)%2=1 
        THEN x.Hours 
        ELSE (x.Hours+MIN(CASE WHEN y.Hours>x.Hours 
                               THEN y.Hours 
                          END))/2.0 
   END median
FROM BulbLife x, BulbLife y
GROUP BY x.Hours
HAVING 
   SUM(CASE WHEN y.Hours <= x.Hours 
      THEN 1 ELSE 0 END)>=(count(*)+1)/2 AND
   SUM(CASE WHEN y.Hours >= x.Hours 
      THEN 1 ELSE 0 END)>=(count(*)/2)+1

The basic query remains the same, the only difference being in the SELECT statement's column list. If there is an odd number of cases, the median is reported directly as the x.Hours value. However, if the number of cases is even, the smallest y.Hoursvalue that is higher than the chosen x.Hours value is identified. This value is then added to the x.Hours value. The result is then divided by 2.0 to return the mean of those two values as the query's result.

The original query reported only the lesser of the two values that were in the middle of the sample. The added logic in the SELECT clause for the financial-median causes the mean of the two values to be calculated and reported. In our example, the result is the following:

median              
------------------- 
1086.5

After you run the query to obtain a financial median, you'll get a warning like this:

Warning: Null value eliminated from aggregate.

This is an important warning, because it's a sign that our query is working correctly. We did not want to write an ELSE clause for the CASE statement inside the MIN function. Because if we wrote such an ELSE clause and had it return a 0, the result of the MIN function would always be 0, and, consequently, the financial-median calculation would always be wrong for any sample with an even number of values.

    Team LiB   Previous Section   Next Section