8.5 Calculating a Median
8.5.1 Problem
You want to calculate the median bulb life for all bulbs that have been
tested. From the discussion earlier in this chapter, you should
recognize that the median bulb life represents the case where the
number of bulbs with shorter lives is equivalent to the number of
bulbs with longer lives.
8.5.2 Solution
To calculate the median of the light-bulb test results, use the
following query:
SELECT x.Hours median
FROM BulbLife x, BulbLife y
GROUP BY x.Hours
HAVING
SUM(CASE WHEN y.Hours <= x.Hours
THEN 1 ELSE 0 END)>=(COUNT(*)+1)/2 AND
SUM(CASE WHEN y.Hours >= x.Hours
THEN 1 ELSE 0 END)>=(COUNT(*)/2)+1
median
-----------
1086
8.5.3 Discussion
This query follows the definition of the median very closely and uses
the solution published several years ago by David Rozenshtein,
Anatoly Abramovich, and Eugene Birger. Their solution is still
regarded as one of the classical solutions to the problem of finding
the median value in a sample. To understand their solution, it helps
to look at the query in two phases. First, you have a GROUP BY query
that returns the number of bulbs for each distinct lifetime. The
following is a modified version of the first part of the solution
query that returns the occurrence count corresponding to each
distinct bulb-life value:
SELECT COUNT(*) occurrences, x.Hours xhours
FROM BulbLife x, BulbLife y
GROUP BY x.Hours
occurrences xhours
----------- -----------
40 1043
20 1057
20 1073
20 1074
20 1077
20 1079
40 1085
20 1086
20 1087
20 1093
20 1096
20 1099
20 1103
20 1109
20 1114
20 1129
20 1131
20 1167
Because these results represent a self-join of the BulbLife table
with itself, each group represents a number of detail rows equivalent
to the number of rows in the sample. The two groups of 40 occurrences
each exist because the data contains 2 cases with values of 1043 and
2 cases with values of 1085. The detail for the 1086 group is as
follows:
xhours yhours
----------- -----------
1086 1043
1086 1043
1086 1057
1086 1073
1086 1074
1086 1077
1086 1079
1086 1085
1086 1085
1086 1086
1086 1087
1086 1093
1086 1096
1086 1099
1086 1103
1086 1109
1086 1114
1086 1129
1086 1131
1086 1167
The question now is whether the value 1086 represents the median. To
determine that, follow these steps:
Count the cases where the y.hours value is less than or equal to the
x.hours value.
Count the cases where the x.hours value is less than or equal to the
y.hours value.
Compare the two results. If they are equal, then 1086 is the median
value.
The HAVING clause in our solution query performs the counts for steps
1 and 2 using the following two invocations of the SUM function
combined with a CASE statement:
SUM(CASE WHEN y.Hours <= x.Hours
THEN 1 ELSE 0 END)
SUM(CASE WHEN y.Hours >= x.Hours
THEN 1 ELSE 0 END)
In our example, the values for these 2 sums work out to 10 and 11,
respectively. Plug these two values in for the two SUM expressions in
the HAVING clause, and you have the following:
10 >= (COUNT(*)+1)/2 AND
11 >= (COUNT(*)/2)+1
At this point, the two COUNT expressions deserve some additional
explanation. They have been carefully crafted to allow us to derive a
median, even in cases where we have an even number of values in the
sample. Let's step back for a moment, and assume
that our sample contained 21 values, instead of the 20 that it does
contain. If that were the case, the two COUNT expressions would
evaluate as follows:
(COUNT(*)+1)/2 (COUNT(*)/2)+1
(21+1)/2 (21/2)+1
22/2 10+1
11 11
|
In SQL Server, 21/2 represents an integer division and, hence, yields
an integer value of 10 as the result.
|
|
Whenever you have an odd number of values in the sample, the two
expressions will yield the same result. Given an even number,
however, the first expression will yield a result that is one less
than the other. Here is how the HAVING expression works out for the
data in our example:
10 >= (20+1)/2 AND
11 >= (20/2)+1
10 >= 10 AND
11 >= 11
For the case where x.Hours = 1086, both expressions are true, so 1086
is returned as the median value. In actual fact, because we have an
even number of values, there are 2 candidates for the median: 1086
and 1087. The value 1086 has 9 values below it and 10 above it. The
value for 1087 has 10 values below it and 9 above it. Due to how
we've written the COUNT expressions, our solution
query arbitrarily returns the lower value as the median.
It's possible to use a slightly modified version of
our solution query to return the financial median. Recall, from
earlier in this chapter, that in the case of an even number of
values, the financial median represents the mean of the two inner
neighbors. With respect to our example, that would be the mean of
1086 and 1087, which works out to 1086.5. Use the following query to
calculate the financial median:
SELECT
CASE WHEN COUNT(*)%2=1
THEN x.Hours
ELSE (x.Hours+MIN(CASE WHEN y.Hours>x.Hours
THEN y.Hours
END))/2.0
END median
FROM BulbLife x, BulbLife y
GROUP BY x.Hours
HAVING
SUM(CASE WHEN y.Hours <= x.Hours
THEN 1 ELSE 0 END)>=(count(*)+1)/2 AND
SUM(CASE WHEN y.Hours >= x.Hours
THEN 1 ELSE 0 END)>=(count(*)/2)+1
The basic query remains the same, the only difference being in the
SELECT statement's column list. If there is an odd
number of cases, the median is reported directly as the x.Hours
value. However, if the number of cases is even, the smallest
y.Hoursvalue that is higher than the chosen
x.Hours value is identified. This value is then added to the x.Hours
value. The result is then divided by 2.0 to return the mean of those
two values as the query's result.
The original query reported only the lesser of the two values that
were in the middle of the sample. The added logic in the SELECT
clause for the financial-median causes the mean of the two values to
be calculated and reported. In our example, the result is the
following:
median
-------------------
1086.5
After you run the query to obtain a financial median,
you'll get a warning like this:
Warning: Null value eliminated from aggregate.
This is an important warning, because it's a sign
that our query is working correctly. We did not want to write an ELSE
clause for the CASE statement inside the MIN function. Because if we
wrote such an ELSE clause and had it return a 0, the result of the
MIN function would always be 0, and, consequently, the
financial-median calculation would always be wrong for any sample
with an even number of values.
|