Wednesday, December 21, 2011

Dangers of Linear Regression


Linear regression is adequate for short-term forecasts, but dangerous over long time periods. Any given regression provides a snap-shot of current conditions. As conditions change over time, the predictive capacity of a regression declines.

Maintaining the predictive capacity requires repeating the process of data collection, cleaning, and regression.  This runs up agains the limitations of data: Statistical methods require statistically significant data samples to function. Long-term data series requires that data collections methods, geography, and metrics remain constant over time, with  'ideal' data-sets are collected at a single point in time. These limitations on the availability of data limit what can be regressed. 

More dangerously, there is an increasing reliance on automatically calculated statistical analytics  as measures of formal statistical validity, without the recognition that these measures are not 'absolute', but rather innovative methods developed to detect known errors in the application of other statistical methods. Successfully applying and interpreting these results requires a separate body of knowledge to identify and explain anomalies.

To improve model quality, there is a strong desire to reduce the number of variables present in a regression analysis. When faced with two highly correlated variables, only one may be included. This becomes extremely problematic if two highly correlated variables diverge over time,  it becomes an open question about which variable actually possessed predictive capacity. Or if either variable did, and the validity of a model actually resulted from a the two variables linkage to a third, unregressed variable. Regression models are only capable of showing correlation between different variables, rather than causal relationships. Without that explicit linkage, it becomes possible to draw conclusions that are statistically valid, but that have limited utility.

No comments:

Post a Comment