I recently came across two good posts from Alex Tabarrok (at Marginal Revolution, via @TimHartford) about the dubious achievement of statistical significance. A standard "rule" for acceptable statistical significance is 5%--If your paper achieves a p-value of 5%, it gets published; otherwise, maybe not. Statistical significance is the probability that a difference of a certain level could have occurred by chance, rather than because of true underlying differences. An example may clarify (skip to the jump if you're familiar with this): Suppose I have two classrooms of 30 kids, and each takes a test. If one classroom scores an average of 78% and the other scores an average of 75%, this might not mean that one classroom is smarter than the other. Even if the kids in both classrooms had exactly the same average ability, there's a certain probability that, just by chance, we'd see a difference at least as big as the one we observed. That probability is called the p-value. Obviously, if the p-value is quite large, say, 70%, we would think it more likely the results occurred by chance. If the p-value is quite small, using the scientific community's rule, under 5%, we would say it seems much more likely that the two groups are in fact different, because otherwise observing such a large difference would be very unlikely.
The problems, as Tabarrok points out, are numerous. For one, the 5% cutoff is completely arbitrary. There's no reason a paper with a significance level of 6% isn't providing almost as much evidence of an effect as one with 5%. But more than that, by having a level of significance that we feel "comfortable" with, we lose sight of what statistical significance is telling us: There's still a 5% probability we're reporting an effect that in fact occurred by chance. That's right, a 5% probability that drug didn't work at all, the school program wasn't effective, or that two groups are exactly the same. Tabarrok concludes, based on the work of Joan Ioannides, that with a very large population of researchers working, it becomes possible that many or most of the published papers report false results. The argument is complicated, so you'll have to read it yourself, but it's certainly worth thinking about.
I, however, find this issue less compelling than another he raises. Generally, if you have reason to believe that the model you're testing is well specified, scientists have a good sense for what is a real effect and what isn't. This gut confidence increases with repetition, as Tabarrok points out, because the likelihood of several studies finding an effect where none exists is even smaller than for each individual study. No, the real problem, to me, is that most studies aren't very well specified, which is what Tabarrok discusses in this piece. A test for statistical significance indeed tests for some difference between the two groups, but if we think X causes that difference, and our model leaves out Z which actually causes the difference, the end-result publication might be badly misleading. We too often expect statistical significance to do our heavy lifting for us, and believe me, you can hear economics grad students celebrating whenever they get across the magic 5% line. But as I said, all statistical significance tells us is that there is a difference, it doesn't tell us why.
For more on this last point, see our earlier piece on correlation not being causation.
Marginal Revolution: Most published research is false and The meaning of statistical significance
[An earlier version of this piece incorrectly attributed the above posts to Tyler Cowen, who co-writes the MR blog. They are by Alex Tabarrok.]