Nobelist Thomas Sargent said this, about early tests of rational expectations macro models back in the 70’s: 
I recall [future Nobelists] Bob Lucas and Ed Prescott both telling me that those tests were rejecting too many good models. 
That phrase, “[T]hose tests were rejecting too many good models,” has become a bit of a litmus test.  Either you chuckle slightly because you’ve seen a clear example of myside bias; or you treat it like a Zen koan, worthy of further contemplation.
I’m in the latter category.  
That’s because the tests they were talking about were designed to see whether these skeletal macro models were missing something.  And the data kept telling Prescott, Sargent, and Lucas, “Yes, your model is missing something. The world is not precisely like your model.” 
But who cares whether the world is exactly like your model?   The point of a model is that it’s simple.  You remember Steven Wright’s old joke
I have a map of the United States…actual size.  It says, ‘Scale: 1 mile = 1 mile.’ I spent last summer folding it. 
And now it’s time to pull out the quote by Box:
..all models are wrong, some are useful.
The rise of calibration was one solution to the 1:1 scale problem in macroeconomic theory.  The calibration approach is simple: Build a causal chain with reasonably strong links, and see if the completed chain is able to bear substantial (but not infinite) weight in the real world.  It didn’t always play out that way in practice (strong links are sometimes in the eye of the beholder), but we should have ideals to live up to.  
This isn’t just about real business cycles (making hay while new technology shines)–it’s a story that applies to New Keynesian models too: Is price stickiness big enough to explain how spending shocks can shift real output? [Answer: Depends on the kind of stickiness.  So be sure to grab the right weapon when you’re in any-weapon-to-hand mode.]  
Sargent again:
The idea of calibration is to ignore some of the probabilistic implications of your model but to retain others. Somehow, calibration was intended as a balanced response to professing that your model, though not correct, is still worthy as a vehicle for quantitative policy analysis.

Among the successes of this approach–success at encouraging good science, not at providing correct answers to the fourth significant digit–I’d include:

1. Sargent and Ljungqvist’s work on the link between high European unemployment and rising “economic turbulence“: Helps explain why Europeans have high unemployment only post-1973 even though Europeans had generous unemployment benefits decades earlier.  Hint: Post-73, long-accumulated job skills sometimes become worthless, something that didn’t happen before.
2.  Mehra/Prescott’s Equity Premium Puzzle: A few numbers went a long way in shocking the profession into realizing that stocks aren’t risky enough to justify their high returns unless people are terrified of modest losses. Another claim: The kinds of people who are terrified of modest losses are the kind of people who need a big incentive to shift their consumption from one period to another–yet the safe interest rate is very low.  You barely need to pay people anything to shift consumption across time.  
A puzzle with hundreds, maybe thousands, of resolutions, all of them probably wrong and some of them still useful. 
3.  Lucas’s estimate of the cost of business cycles in terms of human well-being: Consumer spending wiggles little over the business cycle, so basic estimates showed that the average person wouldn’t pay that much for an insurance policy that offered to stabilize average consumer spending.  That’s a sign typical business cycles (not this one) don’t influence human well-being very much.  An early extension (discussed p. 10 here) showed that in a country with a U.S.-sized safety net you can’t use the “people might die” angle to boost the cost of business cycles by much.  A “quit your bellyaching” paper that spurred strong responses partly because it was a “quit your bellyaching” paper, most mostly because the numbers just grated.  
True enough to annoy: A sign Lucas did things right.  
In all 3 cases, there were fights about the numbers, about the setup.  None of these issues are settled (except for #3, Lucas is totally right there, not least because recessions are good for your health); they instead reframed issues in excellent ways.  
Calibration–which also goes by the name “quantitative theory”–is a move forward because it shifts the debate away from “positive or negative?” to “big or small?”  Lots of things might matter a little, but we should focus our attention on things that matter a lot.  
Does your favorite idea matter a little or a lot?  How would you know? 
Coda: It was only in 1971 that medical researchers began to understand why aspirin works.  Beforehand, people knew that it worked as a pain reliever in some settings but not in others, but they didn’t know why it worked. Vane, author of the 1971 paper in Nature, won a Nobel for showing how aspirin works.
Causation first, comprehension later. 
Both are wonderful, but comprehension both sates and invigorates the mind.