Economists Should Be More Careful With Their Statistics
By Robert P. Murphy
In recent years, researchers in the social sciences have realized that they face a “replication crisis.”1 As major results in experimental psychology fall apart under further scrutiny,2 economists might have taken comfort in the relative rigor of their own field. However, economists, too, have often been careless in their research design and have been overconfident in the strength of their published results.
In a recent episode of EconTalk, Russ Roberts interviewed Stanford University’s John Ioannidis3 to discuss his 2017 publication “The Power of Bias in Economics Research.”4 Ioannidis and his co-authors found that the vast majority of estimates of economic parameters were from studies that were “underpowered,” and this, in turn, meant that the published estimates of the magnitude of the effects were often biased upward.
Unfortunately, many economists (including me) have little training in the concept of “statistical power” and might be unable to grasp the significance of Ioannidis’ discussion. In this article, I give a primer on statistical power and bias that will help the reader appreciate Ioannidis et al.’s shocking results: After reviewing meta-analyses of more than 6,700 empirical studies, they concluded that most studies, by their very design, would often fail to detect the economic relationship under study. Perhaps worse, these “underpowered” studies also provided estimates of the economic parameters that were highly inflated, typically by 100% but in one third of the cases, by 300% or more.
Economists should familiarize themselves with the concept of statistical power to better appreciate the possible pitfalls of existing empirical work and to produce more-accurate research in the future.
A Primer on Power and Bias: Researchers Flipping Coins
Suppose that researchers are trying to determine whether the coins produced by a particular factory are “fair,” in the sense that they turn up Heads or Tails with a 50/50 probability. To that end, a researcher performs an experiment by flipping a coin a certain number of times and recording the sequence of Heads and Tails.
In this experimental setup, the “null hypothesis” is that the coin is fair. Thus, in order to reject the null and conclude that the coin is not fair, our researcher will need to see an outcome that is weighted towards either Heads or Tails.
Our researcher wants to protect himself from committing a “Type I error,” in which he would erroneously reject the null hypothesis. If the researcher committed a Type I error, his experiment would be giving him a “false positive.” That is, the researcher would announce to the world that the coin is not fair, even though it actually is fair.
For example, suppose that the researcher flips the coin only twice, and it comes up “Tails, Tails.” Prima facie, this sequence suggests that the coin is unfair—that it is biased towards Tails. However, even if the coin were perfectly fair, there is a 25% chance that it would generate 2 Tails (or 2 Heads) in a row. Since, after 2 flips, there is a combined 50% chance of seeing 2 of the same outcome (either “Heads, Heads” or “Tails, Tails”), it would be reckless for the researcher to announce, “The coin is unfair!” after only 2 observations.
To safeguard against a Type I error, the researcher adopts the standard convention of insisting on a “5% significance level.” This means that the researcher wants to announce that the coin is unfair only if there is a 5% or smaller probability that in so doing, he has been fooled by a false positive.
How many flips will he need in order to have any chance of rejecting the null at the 5% level of significance? At least 6. Consider: A fair coin will come up Heads 6 times in a row with a probability of (1/2)^6 = (1/64), which is approximately 1.6% of the time. Likewise, a fair coin will generate 6 consecutive Tails about 1.6% of the time. Consequently, the chance of seeing either 6 Heads or 6 Tails—assuming the coin, in reality, is fair—is only (2/64) or 3.125%. That is lower than our 5% significance threshold. Therefore, if our researcher conducts an experiment involving 6 flips and observes either 6 Heads or 6 Tails, he can confidently announce to the world that the coin is not fair. (Note that if the researcher flips the coin 6 times and sees only 5 Heads and 1 Tail or 5 Tails and 1 Head, in any order, then that wouldn’t be sufficient evidence of an unfair coin.5)
The Tradeoff Between Type I and Type II Errors
However, even though his adoption of the 5% significance threshold protects our researcher from Type I errors, the small sample size (of 6 flips) makes him vulnerable to a Type II error. A Type II error occurs when a researcher fails to reject the null hypothesis when it really is false; in this case, the researcher falls prey to a “false negative.”
To quantify the probability of a Type II error, we need to specify exactly how the null hypothesis is false. In our example, suppose that the coin really is unfair and that it comes up Heads 3/4 of the time and Tails only 1/4 of the time. If our researcher flips the coin only 6 times, what is the likelihood that he will correctly announce to the world, “My research shows that this coin is unfair”?
To review our earlier computations, the convention of adopting a 5% threshold for “significance” means that in this small sample, our researcher must observe either 6 Heads or 6 Tails in order to rule out the null hypothesis of a fair coin. So what is the probability of observing such sequences, if in fact—by stipulation in this example—the coin really is unfair—that is, it comes up Heads 3/4 of the time?
With this particular unfair coin, the probability of its coming up Heads 6 times out of 6 flips is (3/4)^6, or a little less than 18%. (There is also a very small probability of its coming up Tails 6 times in a row.) In other words, even though we assumed that this coin is unfair, our researcher has only an 18% chance of concluding this; there is a corresponding 82% probability of a Type II error. Thus, we say that the power of this experiment is only 18%—i.e., it is an underpowered study.
For a given sample size, there is a tradeoff between Type I and Type II errors. Just as it is routine to insist on a maximum 5% for the probability of a Type I error, there is a convention that a study have a power of at least 80%, so that the probability of a Type II error is held to 20% or lower. The only way for our researcher to increase the power of his study (i.e., reduce the chance of a false negative) without leaving himself more vulnerable to a false positive is to increase the sample size—to flip the coin more times.
Underpowered Studies and the Problem of Bias
We can use our contrived but intuitive example to illustrate one more feature: A sample size of 6 flips means that there is a high probability that our researcher will fail to detect an unfair coin. It also means that on the off chance that he does detect it, he might inflate the actual magnitude of the coin’s unfairness.
To see this, suppose that many dozens of researchers are all vying for grants from the Anti-Coin League, an organization that funds efforts to discredit the public’s faith in these coins. Further suppose that we are still dealing with the case in which the coin really is unfair and comes up Heads 3/4 of the time. Now, if all of the researchers in this community conduct experiments involving only 6 coin flips, some of them will eventually observe 6 Heads in a row.6
Note that we are not questioning the integrity of the scientists involved—they aren’t cheating in any way. Each flips the coins 6 times and accurately reports the outcome. Furthermore, the scientific journals likewise have their standards and will publish only results that are “statistically significant.” Even though the Anti-Coin League is doling out the dough, the journals will publish only the results of a researcher who observed 6 Heads because even a result of 5 Heads and 1 Tail could be due to chance.
Yet what happens when the researcher makes an estimate of just how unfair the coin is? Since he observes 6 Heads and 0 Tails—the necessary outcome to be “significant” and worthy of publication—the “sample mean” of the probability of Heads is 100%, while the “sample mean” probability of Tails is 0%. In other words, there will be an entire literature consisting of papers finding statistically significant evidence that the coins are unfair, in which the “best guess” is that the coins come up Heads all the time and never come up Tails.
So we see—in this particular example with a coin coming up Heads 3/4 of the time—that a sample size of 6 flips would mean that a given test had a power of only 18% and that the typical reported magnitude of the effect would be severely inflated.
Larger Sample Size Helps on Both Dimensions
If we want to retain the “5% significance” threshold, then, as already noted, the only way to increase the power of our study is to increase the sample size. For example, suppose that instead of flipping the coin only 6 times, our researcher flips it 10 times. What effect does this have?
First, we need to recalculate how lopsided an outcome we would need to observe in order to reject the null hypothesis (“this is a fair coin”) with at least 95% confidence. It turns out we would still need to see either 9 or 10 of the same side (either Heads or Tails) in order to confidently reject the possibility that a fair coin generated such a sequence.7
Now that we know the observational threshold to reject the null, we can compute the power of our researcher’s new study, which relies on a sample size of 10. As before, suppose that, in reality, the coin is unfair and comes up Heads 3/4 of the time. With our larger sample size, the probability that such a coin will generate at least 9 Heads is a bit more than 24%. That is, by increasing the sample size from 6 to 10, we have boosted the power of our study—if we maintain our assumption that a coin comes up Heads 3/4 of the time—from 18% to 24%.
Furthermore, if we imagine a community of researchers running experiments using 10 flips in each study, then, among the approximately 24% of them who find “statistically significant” evidence that the coins are unfair, about 77% of this batch of studies will report that the coins come up Heads 9 out of 10 flips, while the remaining 23% will estimate that the coins always come up Heads. Thus, these higher-powered studies are still very biased—they overestimate how unfair the coins are—but they are closer to the true value than the previous case, when 100% of the studies finding statistical significance concluded that the coins always came up Heads.
It’s Harder to Detect a Weaker Signal
For more on these topics, see the EconTalk podcast episode Andrew Gelman on Social Science, Small Samples, and the Garden of the Forking Paths. See also Stat Fight, by Bryan Caplan, EconLog, June 2, 2005; and “Job Safety,” by W. Kip Viscusi in the Concise Encyclopedia of Economics.
The final observation to make regarding our hypothetical coin research is that our measures of statistical power and bias depended on our assumption that the coin, in fact, came up Heads 3/4 of the time. Suppose, instead, that the coins were only a little unfair, coming up Heads only 51% of the time.
With our larger sample size of 10 flips, the threshold for ruling out the null hypothesis is the same. However, now that the signal is weaker, it will be much harder to detect the presence of an unfair coin. Specifically, if the coin, in reality, comes up Heads 51% of the time, there is only a 2% chance that a researcher in a given experiment would observe at least 9 Heads or Tails. Thus, with such a weak signal, the power of our 10-flip experiment would drop to 2%. And the problem of bias would be much larger because, among the rare experiments that found statistical significance, most would vastly overstate the coin’s tendency to turn up Heads, but, perversely, 38% of this small batch of “unfair coin” studies would estimate that the coin comes up Tails 9 out of 10 times—getting the magnitude and the sign wrong.
As our exaggerated examples illustrate, we cannot measure the power of a study without prior knowledge about the true value of the parameter being estimated. (In practice, of course, we can’t have this knowledge and so we must always estimate the power of a study as well.) Other things equal, the weaker the “true” effect—though different from zero—the lower the power of a given experiment or study, and the more likely it is that the researcher will end up exaggerating the actual magnitude of the effect should he correctly reject the null.
Back to Ioannidis
The above analysis sets the stage for comprehending the importance of Ioannidis et al.’s recent study. They relied on 159 “meta-analyses” of more than 6,700 empirical studies, which collectively provided more than 64,000 individual estimates of economic parameters. By using various techniques that give more weight to the more-reliable estimates within a literature to estimate the “true” value of a parameter, Ioannidis et al. could then retroactively calculate the statistical power of the studies in each area of inquiry, to see which were “adequately” powered (meaning that they had a power of at least 80%).
Their findings were sobering. When the authors ranked these 159 different “research areas” according to the percentage of their studies that had adequate statistical power, the median outcome was 10.5%, and that finding relies on the most generous of the techniques to estimate the “true value” of an economic parameter. The authors write: “That is, half of the areas of economics have approximately 10% or fewer of their estimates with adequate power.”
“This real-world example leads us to suspect that the figure of $9.5 million is likely to be vastly exaggerated. ”
Furthermore, the underpowered studies also implied very large biases in estimates of the magnitude of economic parameters. For example, of 39 separate estimates of the monetary value of a “statistical life”—a concept used in cost/benefit analyses of regulations—29 (74%) of the estimates were underpowered. For the 10 studies that had adequate power, the estimate of the value of a statistical life was $1.47 million, but the 39 studies collectively gave a mean estimate of $9.5 million. After our hypothetical examples of coin-flipping researchers, this real-world example leads one to suspect that the figure of $9.5 million is likely to be vastly exaggerated.
Conclusion
As a recent paper by Ioannidis et al. illustrates, economists should be more careful with their statistics. At the very least, empirical economists should broaden their horizons to realize that “statistical significance” by itself is not enough; one must also consider the power of a study. Currently, an alarming proportion of empirical investigations of economic parameters are underpowered, meaning that the “significant” results are quite possibly very biased. Listening to Russ Roberts’ recent interview with Ioannidis would help economists avoid pitfalls in future research.
The best timeline I have found of the brewing “replication crisis” in experimental psychology and sociology is provided by Andrew Gelman, professor of statistics and director of the Applied Statistics Center at Columbia University. See Gelman’s post, “What has happened down here is the winds have changed,” AndrewGelman.com, September 21, 2016.
For a description of a major analysis that couldn’t replicate even half of 100 psychology findings published in leading journals, see Benedict Carey, “Many Psychology Findings Not as Strong as Claimed, Study Says”, New York Times, August 27, 2015. As the article explains, “The new analysis, called the Reproducibility Project, found no evidence of fraud or that any original study was definitively false. Rather, it concluded that the evidence for most published findings was not nearly as strong as originally claimed.”
Ioannidis’ full bio is available at https://med.stanford.edu/profiles/john-ioannidis. The January 20, 2018 episode of EconTalk is available at John Ioannidis on Statistical Significance, Economics, and Replication.
John P. A. Ioannidis, T. D. Stanley and Hristos Doucouliagos. (2017) “The Power of Bias In Economics Research,”The Economic Journal, Vol. 127, Issue 605, October 2017.
Specifically, the probability of seeing exactly 5 Heads out of 6 total flips is about 0.0938, and of seeing exactly 6 Heads is 0.0156. So the probability of seeing at least 5 Heads or 5 Tails is about 0.2188, well above the 5% significance cutoff. In other words, our researcher couldn’t confidently reject the hypothesis that the coin is fair if he sees a 5-to-1 lopsided outcome with 6 total flips.
It’s even conceivable that a researcher observes 6 Tails in a row, but this outcome is expected to happen less than once per every 4,000 trials.
Specifically, the borderline sequence is when the coin comes up with one side (let’s say Heads) exactly 2 out of 10 times. The probability of this happening with a fair coin is (1/2)^10 * (10*9) / 2 ≈ 0.0439. Therefore the probability of seeing either exactly 2 Heads or exactly 2 Tails is approximately 0.0879, which is higher than the 5% threshold. In contrast, the probability of seeing either exactly 1 Head or exactly 1 Tail is only 0.0195, meaning that this sequence would lead the researcher to confidently reject the null hypothesis that this is a fair coin. (We also need to include the probability of seeing all 10 Heads or all 10 Tails, because the rule is “reject the null if we see at least 9 Heads or Tails,” but these outcomes are very rare and don’t tip our rule above 5% total probability with a fair coin.)
*I thank Kevin Grier and Alan Murphy for helpful comments on an initial draft.
Robert P. Murphy is Research Assistant Professor with the Free Market Institute at Texas Tech University. He is the author of Choice: Cooperation, Enterprise, and Human Action (Independent Institute, 2015).
For more articles by Robert P. Murphy, see the Archive.