The Perils of Published Studies
By Arnold Kling
The therapeutic power of the drugs appeared to be steadily falling. A recent study showed an effect that was less than half of that documented in the first trials, in the early nineties.
The article is interesting, but I object to much of its tone. For most of the article, Lehrer wants you to believe that there is something mysterious going on in the world, that scientists are baffled, and we no longer know how to arrive at truth.
In my view, what is going on is somewhat more mundane. There are a number of factors that cause unreliable results to sometimes achieve prominence. They all stem from the fact that non-effects get less attention than effects. If you are looking for an effect and you fail to get it, (a) you are more likely to tinker with your experiment*, (b) you are less likely to report your results, and (c) if you report your results you are less likely to get them published.
The statistical measure of the significance of a result is known as the p-value. It is supposed to be a measure of the probability that, say, the drug has no effect, given the average effect you found in your sample. The lower the p-value, the more confident you are that the drug has an effect that is not due to chance. The standard rule of thumb is to treat a p-value below .05 as significant. We call .05 the significance level.
[UPDATE: Oh, gosh! Did I write this? I have a bad cold, so forgive me. The p-value measures the probability that you would observe your result, assuming that the drug has no effect. The lower the p-value, the more confident you are that you results were not due to chance.|
But if you look at the actual research process, the true p-value is much higher than what is reported. An imperfect but workable fix would be to standardize on a lower significance level. I think that for most ordinary research, the significance level ought to be set at .001. For blind data-mining exercises, such as looking among lots of genes to find correlations with diseases, the level ought to be lower, perhaps .0001 or less.
The impact of a lower significance level would be to make many small-sample studies unpublishable (I take it as given that publication will always be biased in favor of studies that find effects. We can’t fix that problem.) My guess is that this simple fix would have prevented a number of the scientific embarrassments mentioned in Lehrer’s article.
So, if there is a petition calling on researchers to set significance levels at .001 instead of .05, sign my name to it. I think that would be a better norm.
[UPDATE: Robin Hanson really jumps on this topic. He correctly points out that my approach would reduce excess gullibility but risks creating excess skepticism.]
(*Over 25 years ago, Ed Leamer blew the whistle on the phenomenon of specification searches in econometrics, which are an instance of this. It took a while, but his writing has had some influence. As recently as this Spring, the Journal of Economic Perspectives had a relevant symposium, to which Leamer contributed. Russ Roberts talked to Leamer about his work.)