Linear
regression is adequate for short-term forecasts, but dangerous over long time
periods. Any given regression provides a snap-shot of current conditions. As
conditions change over time, the predictive capacity of a regression declines.
Maintaining
the predictive capacity requires repeating the process of data collection,
cleaning, and regression. This runs up
agains the limitations of data: Statistical methods require statistically
significant data samples to function. Long-term data series requires that data
collections methods, geography, and metrics remain constant over time,
with 'ideal' data-sets are collected at
a single point in time. These limitations on the availability of data limit
what can be regressed.
More dangerously,
there is an increasing reliance on automatically calculated statistical
analytics as measures of formal
statistical validity, without the recognition that these measures are not
'absolute', but rather innovative methods developed to detect known errors in
the application of other statistical methods. Successfully applying and
interpreting these results requires a separate body of knowledge to identify
and explain anomalies.
To improve
model quality, there is a strong desire to reduce the number of variables
present in a regression analysis. When faced with two highly correlated
variables, only one may be included. This becomes extremely problematic if two
highly correlated variables diverge over time,
it becomes an open question about which variable actually possessed
predictive capacity. Or if either variable did, and the validity of a model
actually resulted from a the two variables linkage to a third, unregressed
variable. Regression models are only capable of showing correlation between
different variables, rather than causal relationships. Without that explicit
linkage, it becomes possible to draw conclusions that are statistically valid,
but that have limited utility.
No comments:
Post a Comment
And your thoughts on the matter?