8.7 Building Confidence Intervals

8.7.1 Problem

You want to check to see whether the calculated sample statistics could be reasonably representative of the population's statistics. With respect to our example, assume that a light bulb's declared lifetime is 1100 hours. Based on a sample of lifetime tests, can you say with 95% probability that the quality of the production significantly differs from the declared measurement? To answer this question, you need to determine whether the confidence interval around the mean of the sample spans across the declared lifetime. If the declared lifetime is out of the confidence interval, then the sample mean does not represent the population accurately, and we can assume that our declared lifetime for the light bulbs is probably wrong. Either the quality has dropped and the bulbs are burning out more quickly, or quality has risen, causing the bulbs to last longer than we claim.

8.7.2 Solution

The solution is to execute a query that implements the calculations described earlier for computing a confidence interval. Recall that the confidence interval was plus or minus a certain amount. Thus, the following solution query computes two values:

SELECT 
   AVG(Hours)-STDEV(Hours)/SQRT(COUNT(*))*MAX(p) in1,
   AVG(Hours)+STDEV(Hours)/SQRT(COUNT(*))*MAX(p) in2
FROM BulbLife, T_distribution
WHERE df=(
   SELECT 
      CASE WHEN count(*)<=29 
      THEN count(*)-1 
      ELSE -1 END FROM BulbLife) 

in1      in2                                                   
-------- -------- 
1077.11  1104.89

Based on the given sample, we cannot say that the quality of production has significantly changed, because the declared value of 1100 hours is within the computed confidence interval for the sample.

8.7.3 Discussion

The solution query calculates the mean of the sample and adds to it the standard error multiplied by the t-distribution coefficient from the T_distribution table. In our sample, the degree of freedom is the number of cases in the sample less 1. The CASE statement ensures that the appropriate index is used in the T_distribution table. If the number of values is 30 or more, the CASE statement returns a -1. In the T_distribution table, the coefficient for an infinite number of degrees of freedom is identified with a -1 degree of freedom value. Expressions in the SELECT clause of the solution query calculate the standard deviation, expand it with the coefficient from the T_distribution table, and then calculate the interval around the mean.

This example is interesting, because it shows you how to refer to a table containing coefficients. You could retrieve the coefficient separately using another query, store it in a local variable, and then use it in a second query to compute the confidence interval, but that's less efficient than the technique shown here where all the work is done using just one query.