(We have already made changes in response to feedback - 18 March 2008, and again on 21 March.) Here is the proposed restatement of the principle in full:
13.29 Do not use measures of statistical significance to assess a forecasting method or model.
Description: Even when correctly applied, significance tests are dangerous. Statistical significance tests calculate the probability, assuming the analyst’s null hypothesis is true, that relationships apparent in a sample of data are the result of chance variations that arose in selecting the sample. The probability that is calculated is affected by the size of the sample and the choice of null hypothesis. With large samples, even small differences from what would be expected in the data if the null hypothesis were true will be “statistically significant.” Choosing a different null hypothesis can change the conclusion. Statistical significance tests do not provide useful information on material significance or importance. Moreover, the tests are blind to common problems such as non-response error, response error, and misspecification of relationships. The proper approach to analyzing and communicating findings from empirical studies is to (1) calculate and report effect sizes; (2) estimate the range within which the actual effect size is likely to lie by taking account of prior knowledge and all potential sources of error in measuring the effect; and (3) conduct replications, extensions, and meta-analyses.
Purpose: To avoid the selection of invalid models or methods, and the rejection of valid ones.
Conditions: There are no empirically demonstrated conditions on this principle. Statistical significance tests should not be used unless it can be shown that the measures provide a net benefit in the situation under consideration.
Strength of evidence: Strong logical support and non-experimental evidence. There are many examples showing how significance testing has harmed decision-making. Despite repeated appeals for evidence that statistical significance tests can improve decisions, none has been forthcoming. Tests of statistical significance run contrary to the proper purpose of statistics—which is to help users make sense of data. Experimental studies are needed to identify the conditions, if any, under which tests of statistical significance can improve decision-making.
Source of evidence:
Armstrong, J. S. (2007). Significance tests harm progress in forecasting. International Journal of Forecasting, 23, 321-336, with commentary and a reply.
Hauer, E. (2004). The harm done by tests of statistical significance. Accident Analysis and Prevention, 36, 495-500.
Hubbard, R. & Armstrong J. S. (2006). Why we don't really know what ‘statistical significance’ means: a major educational failure. Journal of Marketing Education, 28, 114-120
Hunter, J.E. & Schmidt, F. L. (1996). Cumulative research knowledge and social policy formulation: The critical role of meta-analysis. Psychology, Public Policy, and Law, 2, 324-347.
Ziliak, S. T. & McCloskey, D. N. (2008). The cult of statistical significance: How the standard error costs us jobs, justice, and lives. Ann Arbor, MI: University of Michigan Press.
Have you ever been told you should "stand in the other person's shoes" in order to predict the decisions they will make? This is a common and plausible recommendation in popular business books and everyday life but, to date, there has been no experimental evidence on its usefulness. Kesten Green and Scott Armstrong found, when they formalized this advice in the form of a method they call "role thinking", that the forecasts were little more accurate than guessing what the decisions might be. This finding is consistent with earlier findings that situations involving interactions between people with different roles are too complicated for experts to make useful predictions about when they rely on trying to think through what will happen. The group forecasting method of simulated interaction, on the other hand, allows realistic representations of group interactions and does provide accurate forecasts. Green and Armstrong's paper has been accepted for publication in a special issue of the International Journal of Forecasting on group forecasting. A copy of their working paper is available here.
Andreas Graefe and Scott Armstrong report on results from an experiment on the relative accuracy of three structured approaches compared to traditional face-to-face meetings. The four methods were compared on a quantitative judgment task that did not involve widely dispersed information among participants.
Overall, Delphi performed best, followed by nominal groups, prediction markets and unstructured meetings. Of the three structured approaches, only Delphi outperformed a simple average of participants' prior individual estimates.
The authors also report participant's ratings of the group methods. Participants preferred personal interaction such as in meetings and nominal groups. Prediction markets were rated least favorable.
The pre-print version of the paper, which will be published by the International Journal of Forecasting, is available here.
For the eighth year running the International Institute of Forecasters is offering two $5,000 grants, funded by the SAS Institute, to support research on how to improve forecasting methods and business forecasting practice. For more information on the SAS Grants, visit the Researchers Page.