Extreme outliers: How meaningful are they?
By Scott Sumner
This post is about the way I think about extreme outliers. It’s very unscientific, but I hope the comment section will help me to better understand this issue.
Suppose you have two variables, X and Y, which are (supposedly) positively correlated. But the very highest value of X is associated with very the lowest value of Y. Or assume the two variables are supposed to be negatively correlated, but the highest value of X corresponds to the highest value of Y. Should that make us suspicious of the alleged relationship?
I was thinking about this a couple years ago when reading a paper on the impact of Smoot-Hawley on stock prices (unfortunately I’ve forgotten the author). The paper claimed that news making Smoot-Hawley more likely to pass tended to raise stock prices in both 1929 and 1930. My own research reached the same conclusion for 1929, but the opposite conclusion for 1930. Unfortunately, my own research was less systematic, mostly based on reading the NYT and observing that stocks seemed to fall on news that Smoot-Hawley was moving closer to passage in the spring of 1930. So that suggests the other study should be taken more seriously.
But there’s just one data point I just can’t get past—Hoover’s decision to sign the bill, (which was made on a Sunday). Hoover had been under heavy pressure to veto the bill, from US exporters, from foreigners, and from a letter signed by over 1000 economists. He rejected that advice, and the next day the New York market saw the biggest percentage decline in all of 1930. (About 300 trading days–as Saturday was still a workday back then). Even worse, the news media seemed to attribute the crash to the decision by Hoover to sign the bill, quoting stock traders who believed the same.
That’s just one data point, but it’s hard for me to get past that one data point. How could something that was supposedly boosting stock prices, and was by far the major news story, be associated with the biggest stock market crash of the year, immediately after the decision was actually made? It’s possible, but it seems really unlikely.
Suppose you believed that direct democracy led to bad political outcomes, because philosopher kings were much better than mob rule. In that case, how likely it is that the one country with by far the most direct democracy in the entire world, would also be arguably the best governed in Europe, and perhaps the world? Possible, but how likely?
Suppose you noticed that America scores higher on happiness rankings than does Europe, on average. So you developed a hypothesis that social welfare states are less happy. How likely would be that a country which by some measures has the world’s most generous social insurance system, is also the world’s happiest country?
And while we are on the subject, suppose you thought deregulation and privatization made people unhappy. How likely would it be for the world’s most free market economy (excluding level of taxation and government spending) to also be the happiest?
Suppose you thought that the East Asia tiger economies were successful because they rejected the neoliberal agenda coming out of Washington, and instead had state directed development strategies. If that were true, how likely is it that the two very richest East Asian economies would also be number one and two in the world in the Heritage Ranking of Economic Freedom?
Suppose you believed that monetary policy was ineffective at boosting NGDP at the zero bound. In that case, how likely is it that the fastest 4 month stretch of NGDP growth in American history would occur during a period of near-zero interest rates, right after a easily identifiable monetary shock (March-July, 1933).
And while we are at it, suppose you believed that the credit channel explains why growth is slow during and after a banking crisis. How likely is it that the fastest stretch of NGDP growth would occur during a period right after America’s worst banking crisis, and during a period when 1000s of banks were still closed down? Again, not just growth during the financial crisis, but perhaps the fastest NGDP growth ever, during arguably the worst banking crisis ever.
Suppose you thought that inflation was caused by bottlenecks in the economy, and deflation was caused by slack. How likely is it that the price level (WPI) would rise by 20% during a period of 25% unemployment (1933-34)?
And speaking of the zero bound, just how likely is it that the biggest two day stock rally in US history would (just randomly) occur immediately after Hoover announced a proposal to allow the Fed to print more money, for each ounce of gold backing.
Suppose you thought that Mexican-Americans had a propensity to rape and murder. (Hmm, where have we heard that theory?) How likely is it that America’s most Mexican major city (of the top fifty) would also have the lowest murder rate, and perhaps the lowest violent crime rate?
Here’s how I look at it, and I want you to tell me why I’m wrong. If you have only a few observations, then extreme outliers are no big deal—but your study is also not very reliable. If your study includes a large number of observations, the odds of the most extreme value of X and Y being correlated in the opposite direction from the actual relationship seems very low. Am I too suspicious of extreme outliers? What do you think?
PS. You should be suspicious of my “largest two day rally” which sounds like data mining, and a violation of the EMH. But there really was news on the second day, as Congressional leaders agreed to speed the bill through Congress.