8.9 Exploring Patterns with Autocorrelation8.9.1 ProblemYou want to find trend or seasonal patterns in your data. For example, you wish to correlate light-bulb sales data over a number of months. To save yourself some work, you'd like to automatically correlate a number of samples. 8.9.2 SolutionUse the autocorrelation technique described earlier in this chapter, and print a graphical representation of the results. You need to calculate up to 15 correlations of the light-bulb sales data. 15 is a somewhat arbitrary number. In our experience, we rarely find a gain by going beyond that number of correlations. For the first correlation, you want to compare each month's sales data with that from the immediately preceding month. For the second correlation, you want to compare each month's sales data with that from two months prior. You want this pattern to repeat 15 times, with the lag increasing each time, so that the final correlation compares each month's sales data with that from 15 months prior. You then wish to plot the results in the form of a graph. Because you want to compute 15 separate correlations, you need to do more than just join the BulbSales table with itself. You actually want 15 such joins, with the lag between months increasing each time. You can accomplish this by joining the BulbSales table with itself, and then joining those results with the Pivot table. See the Pivot table recipe in Chapter 1 for an explanation of the Pivot table. After you create the necessary Pivot table, you can use the query shown in the following example to generate and graph the 15 correlations that you wish to see: SELECT p.i lag, STUFF( STUFF(SPACE(40), CAST(ROUND((COUNT(*)*SUM(x.Sales*y.Sales)- SUM(x.Sales)*SUM(y.Sales))/ (SQRT(COUNT(*)*SUM(SQUARE(x.Sales))-SQUARE(SUM(x.Sales)))* SQRT(COUNT(*)*SUM(SQUARE(y.Sales))-SQUARE(SUM(y.Sales))))* 20,0)+20 AS INT),1,'*'),20,1,'|') autocorrelation FROM BulbSales x, BulbSales y, Pivot p WHERE x.Id=y.Id-p.i AND p.i BETWEEN 1 AND 15 GROUP BY p.i ORDER BY p.i lag autocorrelation ----------- ---------------------------------------- 1 | * 2 | * 3 | * 4 | * 5 | * 6 | * 7 | * 8 | * 9 | * 10 | * 11 | * 12 | * 13 | * 14 | * 15 | * You can see from this graph that, as the lag increases, the correlation drops towards 0. This clearly indicates a trend. The more distant any two cases are, the less correlated they are. Conversely, the closer any two cases are, the greater their correlation. You can make another observation regarding the 3rd, 6th, 9th, 12th, and 15th lags. Each of those lags shows an increased correlation, which indicates that you have some sort of seasonal pattern that is repeated every three periods. This is true. If you look at the sample data, you will see that sales results increase significantly every three months. 8.9.3 DiscussionThis code demonstrates how you can extend a correlation, or any other kind of formula, so that it is calculated several times on the same data. In our solution, the first 15 correlation coefficients are calculated from the sample. Each coefficient represents the correlation between the sample and the sample lagged by one or more steps. We use the Pivot table to generate all the lagged data sets. Let's look at a simplified version of the query that doesn't contain the plotting code: SELECT p.i lag, (COUNT(*)*SUM(x.Sales*y.Sales)-SUM(x.Sales)*SUM(y.Sales))/ (SQRT(COUNT(*)*SUM(SQUARE(x.Sales))-SQUARE(SUM(x.Sales)))* SQRT(COUNT(*)*SUM(SQUARE(y.Sales))-SQUARE(SUM(y.Sales)))) autocorrelation FROM BulbSales x, BulbSales y, Pivot p WHERE x.Id=y.Id-p.i AND p.i BETWEEN 1 AND 15 GROUP BY p.i ORDER BY p.i When you execute this query, the first few lines of the result should look like this: lag autocorrelation ----------- ------------------- 1 0.727 2 0.703 3 0.936 4 0.650 5 ... The code in the SELECT clause that calculates the correlations is the same as that used in the earlier recipe for calculating a single correlation. The difference here is that we define the matching months differently. Since our research spans a period greater than one year, we cannot match months anymore. Instead, we have to use the period index. Hence, the WHERE clause specifies the condition, x.id=y.id-p.i. For pivot value 1, each x value will be matched with the preceding y value. For pivot value 2, each x value will be matched by the y value from two periods in the past. This pattern continues for all 15 lags. For every calculation, the data must be lagged. We use the Pivot table to generate 15 groups, where the group index is also used as a lag coefficient in the WHERE clause. The result is a list of correlation coefficients, each calculated for a combination of the sample data and correspondingly lagged sampled data. Now that you understand the use of the Pivot table to generate the 15 lags that we desire, you can return your attention to our original query. To make use of autocorrelation, we need to print the data in a graphical format. We do that by printing an asterisk (*) in the autocorrelation column of the output. The greater the correlation value, the further to the right the asterisk will appear. The following query highlights the Transact-SQL expression that we use to plot the results. To simplify things, we use the pseudovariable correlation coefficient (CORR) to represent that part of the code that calculates the correlation values. SELECT p.i lag, STUFF( STUFF(SPACE(40), CAST(ROUND(CORR*20,0)+20 AS INT),1,'*'),20,1,'|') autocorrelation FROM BulbSales x, BulbSales y, Pivot p WHERE x.Id=y.Id-p.i AND p.i BETWEEN 1 AND 15 GROUP BY p.i ORDER BY p.i You have to look at this expression from the inside out. The CORR is multiplied by 20, and any decimal digits are rounded off. This translates our correlation coefficient values from the range -1 to 1 into the range -20 to 20. We add 20 to this result to shift those values into the range 0 to 40. For example, -0.8 is translated into 4, and 0.7 is translated into 34. Since all the values are positive, we can place each asterisk into its correct relative position simply by preceding it with the number of spaces indicated by our result. The SPACE function in our expression generates a string containing 40 spaces. The STUFF function is then used to insert an asterisk into this string of blanks. After that, the outermost STUFF function inserts a vertical bar (|) character to indicate the middle of the range. This allows us to see easily when a correlation coefficient is positive or negative. Any asterisks to the left of the vertical bar represent negative coefficients. If the coefficient came out as zero, it would be overwritten by the vertical bar. |