Let me follow up on this post.

I try to drill into my high school students that in classical statistics we make probability statements about our methods, not about the unknown real-world parameters. (I can’t believe what I first wrote in the post. I would mark it wrong if a student wrote it.)

When we use a significance level, or alpha, of .05, we are saying that there is a chance of 1 out of 20 that our study will not replicate. So, in the universe of studies published using alpha of .05 as the criterion, we would expect that a number would fail to replicate.

And what does fail to replicate mean? It does not mean that next time you fail to find a result that is significant at the .05 level. It does not mean that next time you fail to find an effect of X and instead only find an effect of half of X. It means that you fail to find any effect at all! If you find any effect at all, then in the ordinary way these methods are used, you have replicated your original study.

Another point that I drill in is that there is a difference between quantitative significance and statistical significance. If an effect of X is significant quantitatively but an effect of 0.5X is not significant quantitatively, then it is not really useful to report that the effect of X is significantly different from zero.

Some of these problems could be alleviated if it were standard to report confidence intervals rather than p-values. If you saw that a 90 percent confidence interval did not include zero but included very low values for the effect, you might be less unnerved by a subsequent study finding a low value for the effect. However, given the various biases I discussed in my previous post, I would want to see a wider confidence interval reported–something like the 99.5 percent interval.

If we continue to use “statistical significance,” then I recommend against a binary “is or is not” classification. Instead, I would suggest three ranges: A p-value greater than 0.15 means that you were unable to control for other factors well enough to have confidence in the effect you found. The experimental method is not promising in that sense. A p-value between .001 and .15 means that your results should be treated as intriguing, justifying an attempt to replicate your study in order to verify that other factors did not cause the result. A p-value below .001 suggests that other factors are unlikely to have caused your result, as long as there was no flaw in the design or execution of the study.

## READER COMMENTS

## jsalvati

## Dec 12 2010 at 4:42am

The Bayesian approach to statistics has the notable advantage of completely avoiding this problem as well as many other problems common in classical statistics. It’s simpler and more intuitive to boot.

## Khoth

## Dec 12 2010 at 7:22am

Isn’t this still wrong? 0.05 is P(study finds result | no real effect), not P(study finds no result | effect actually exists).

I’m not sure what you’re getting at with “other factors” at the end. The p-value can tell you about bad luck, but confounding factors can ruin any study, no matter how small the p-value.

## JPIrving

## Dec 12 2010 at 9:13am

jsalvati makes a good point. Fortunately for the econometric context standard OLS confidence intervals are nearly identical to the bayesian probability interval counterpart. Or so I remember. My gut tells me that they should thus also be similar for IV and panel data results. Is this so?

In general I think that academic economists are forthright when interpreting their results. If only to avoid being savaged by their peers. The policy people…not so much.

## Arnold Kling

## Dec 12 2010 at 9:47am

Khoth, I should have said that an alpha of .05 means that *if* you find a significant effect, there is a 19 out of 20 chance that this finding will replicate (meaning that you will find some effect) in other studies.

By “other factors” I mean factors that are supposed to be random. They are the source of statistical noise in the study. Known confounders that are not random are of course a much worse problem.

## OneEyedMan

## Dec 12 2010 at 11:28am

I still think that Khoth is right. Alpha is just Type I statistical error and has nothing to do with what you describe which is Type II error.

## OneEyedMan

## Dec 12 2010 at 11:30am

Well, 1-alpha is actually type I error.

## Khoth

## Dec 12 2010 at 11:48am

That still isn’t necessarily true. The chance of a finding replicating given that you found a significant effect the first time depends on the original probability that the result was correct, and on the probability of getting an effect when the result is correct, neither of which are captured by the significance value.

## OneEyedMan

## Dec 12 2010 at 12:39pm

Excellent point. My mistake. I think it works out to the following probability of replication failure for sets of independent data with the same DGP:

2*(1-2*Alpha)*Alpha*P(H0=T) + 2*(1-Beta)*Beta*P(H0=F)

## Vacslav

## Dec 12 2010 at 2:18pm

I am shocked to hear “p-value” and “statistical significance” from Arnold who advocates personal choice and innovation via risk-taking.

Whether we accept the results as significant or not must ultimately be the decision based on personal preferences, including risk tolerances.

The alternative of course are the government or philosopher-king imposed p-values everywhere.

## Vacslav

## Dec 16 2010 at 4:02am

From the abstract to McCloskey and Ziliak’s “Signifying Nothing: Reply to Hoover and Siegler” http://www.deirdremccloskey.org/articles/stats/sig.php:

[Comment edited for clarity of source–Econlib Ed.]

Comments are closed.