800 scientists say its time to abandon statistical significance
P-values and statistical significance are widely misunderstood. Heres what they actually mean.
- By Brian Resnick
- on March 22, 2019 12:00 pmTheres a strong case that chasing p-values has led science astray.erhui1979/Getty Creative Images
For too long, many scientists careers have been built around the pursuit of a single statistic: p<.05.
In many scientific disciplines, thats the threshold beyond which study results can be declared statistically significant, which is often interpreted to mean that its unlikely the results were a fluke, a result of random chance.
Though this isnt what it actually means in practice. Statistical significance is too often misunderstood and misused. Thats why a trio of scientists writing in Nature this week are calling for the entire concept of statistical significance to be abandoned.
Their biggest argument: Statistically significant or not statistically significant is too often easily misinterpreted to mean either the study worked or the study did not work. A true effect can sometimes yield a p-value of greater than .05. And we know from recent years that science is rife with false-positive studies that achieved values of less than .05 (read my explainer on the replication crisis in social science for more).
The Nature commentary authors argue that the math is not the problem. Instead, its human psychology. Bucketing results into statistically significant and statistically non-significant, they write, leads to a too black-and-white approach to scrutinizing science.
More than 800 other scientists and statisticians across the world have signed on to this manifesto. For now, it seems more like a provocative argument than the start of a real sea change. Nature, for one, is not seeking to change how it considers statistical analysis in evaluation of papers at this time, the journal noted.
But the tides may be rising against statistical significance. This isnt the first time scientists and statisticians have challenged the status quo. In 2016, I wrote about how a large group of them called for raising the threshold to .005, making it much harder to call a result statistically significant. (Concurrently, with the Nature commentary, the journal The American Statistician devoted an entire issue to the problem of statistical significance.) Theres a wide recognition that p-values can be problematic.
I suspect this proposal will be heavily debated (as is everything in science). At least this latest call for radical change does highlight an important fact plaguing science: Statistical significance is widely misunderstood. Let me walk you through it. I think it will help you understand this debate better, and help you see that there are a lot more ways to judge the merits of a scientific finding than p-values.
Wait, what is a p-value? Whats statistical significance?Mick Wiggins/Getty Creative Images
Even the simplest definitions of p-values tend to get complicated, so bear with me as I break it down.
When researchers calculate a p-value, theyre putting to the test whats known as the null hypothesis. First thing to know: This is not a test of the question the experimenter most desperately wants to answer.
Lets say the experimenter really wants to know if eating one bar of chocolate a day leads to weight loss. To test that, they assign 50 participants to eat one bar of chocolate a day. Another 50 are commanded to abstain from the delicious stuff. Both groups are weighed before the experiment and then after, and their average weight change is compared.
The null hypothesis is the devils advocate argument. It states there is no difference in the weight loss of the chocolate eaters versus the chocolate abstainers.
Rejecting the null is a major hurdle scientists need to clear to prove their hypothesis. If the null stands, it means they havent eliminated a major alternative explanation for their results. And what is science if not a process of narrowing down explanations?
So how do they rule out the null? They calculate some statistics.
The researcher basically asks: How ridiculous would it be to believe the null hypothesis is the true answer, given the results were seeing?
Rejecting the null is kind of like the innocent until proven guilty principle in court cases, Regina Nuzzo, a mathematics professor at Gallaudet University, explained. In court, you start off with the assumption that the defendant is innocent. Then you start looking at the evidence: the bloody knife with his fingerprints on it, his history of violence, eyewitness accounts. As the evidence mounts, that presumption of innocence starts to look naive. At a certain point, jurors get the feeling, beyond a reasonable doubt, that the defendant is not innocent.
Null hypothesis testing follows a similar logic: If there are huge and consistent weight differences between the chocolate eaters and chocolate abstainers, the null hypothesis that there are no weight differences starts to look silly and you can reject it.
You might be thinking: Isnt this a pretty roundabout way to prove an experiment worked?
You are correct!
Rejecting the null hypothesis is indirect evidence of an experimental hypothesis. It says nothing about whether your scientific conclusion is correct.
Sure, the chocolate eaters may lose some weight. But is it because of the chocolate? Maybe. Or maybe they felt extra guilty eating candy every day, and they knew they were going to be weighed by strangers wearing lab coats (weird!), so they skimped on other meals.
Rejecting the null doesnt tell you anything about the mechanism by which chocolate causes weight loss. It doesnt tell you if the experiment is well designed, or well controlled for, or if the results have been cherry-picked.
It just helps you understand how rare the results are.
But and this is a tricky, tricky point its not how rare the results of your experiment are. Its how rare the results would be in the world where the null hypothesis is true. That is, its how rare the results would be if nothing in your experiment worked and the difference in weight was due to random chance alone.
Heres where the p-value comes in: The p-value quantifies this rareness. It tells you how often youd see the numerical results of an experiment or even more extreme results if the null hypothesis is true and theres no difference between the groups.
If the p-value is very small, it means the numbers would rarely (but not never!) occur by chance alone. So when the p is small, researchers start to think the null hypothesis looks improbable. And they take a leap to conclude their [experimental] data are pretty unlikely to be due to random chance, Nuzzo explains.
Heres another tricky point: Researchers can never completely rule out the null (just like jurors are not firsthand witnesses to a crime). So scientists instead pick a threshold where they feel pretty confident that they can reject the null. For many disciplines, thats now set at less than .05.
Ideally, a p of .05 means if you ran the experiment 100 times again, assuming the null hypothesis is true youd see these same numbers (or more extreme results) five times.
And one last, super-thorny concept that almost everyone gets wrong: A p<.05 does not mean theres less than a 5 percent chance your experimental results are due to random chance. It does not mean theres only a 5 percent chance youve landed on a false positive. Nope. Not at all.
Again: A p-value of less than .05 means that there is less than a 5 percent chance of seeing these results (or more extreme results), in the world where the null hypothesis is true. This sounds nitpicky, but its critical. Its the misunderstanding that leads people to be unduly confident in p-values. The false-positive rate for experiments at p=.05 can be much higher than 5 percent.
Lets repeat it: P-values dont necessarily tell you if an experiment worked or not
Psychology PhD student Kristoffer Magnusson has designed a pretty cool interactive calculator that estimates the probability of obtaining a range of p-values for any given true difference between groups. I used it to create the following scenario.
Lets say theres a study where the actual difference between two groups is equal to half a standard deviation. (Yes, this is a nerdy way of putting it. But think of it like this: It means 69 percent of those in the experimental group show results higher than the mean of the control group. Researchers call this a medium-size effect.) And lets say there are 50 people each in the experimental group and the control group.
In this scenario, you should only be able to obtain a p-value between .03 and .05 around 7.62 percent of the time.
If you ran this experiment over and over and over again, youd actually expect to see a lot more p-values with a much lower number. Thats what the following chart shows. The x-axis is the specific p-values, and the y-axis is the frequency youd find them repeating this experiment. Look how many p-values youd find below .001.
This is why many scientists get wary when they see too many results cluster around .05. It shouldnt happen that often and raises red flags that the results have been cherry-picked, or, in science-speak, p-hacked. In science, it can be much too easy to game and tweak statistics to achieve significance.
And from this chart, youll see: Yes, you can obtain a p-value of greater than .05 when an experimental hypothesis is true. It just shouldnt happen as often. In this case, around 9.84 percent of all p-values should fall between .05 and .1.
There are better, more nuanced approaches to evaluating science
Many scientists recognize there are more robust ways to evaluate a scientific finding. And they already engage in them. But they, somehow, dont currently hold as much power as statistical significance. They are:
- Concentrating on effect sizes (how big of a difference does an intervention make, and is it practically meaningful?)
- Confidence intervals (whats the range of doubt built into any given answer?)
- Whether a result is novel study or a replication (put some more weight into a theory many labs have looked into)
- Whether a studys design was preregistered (so that authors cant manipulate their results post-test), and that the underlying data is freely accessible (so anyone can check the math)
- There are also alternative statistical techniques like Bayesian analysis that in some ways more directly evaluate a studys results. (P-values ask the question how rare are my results? Bayes factors ask the question what is the probability my hypothesis is the best explanation for the results we found? Both approaches have trade-offs. )
The real problem isnt with statistical significance; its with the culture of science
The authors of the latest Nature commentary arent calling for the end of p-values. Theyd still like scientists to report them where appropriate, but not necessarily label them significant or not.
Theres likely to be argument around this strategy. Some might think its useful to have simple rules of thumb, or thresholds, to evaluate science. And we still need to have phrases in our language to describe scientific results. Erasing statistical significance might just confuse things.
In any case, changing the definition of statistical significance, or nixing it entirely, doesnt address the real problem. And the real problem is the culture of science.
In 2016, Vox sent out a survey to more than 200 scientists asking, If you could change one thing about how science works today, what would it be and why? One of the clear themes in the responses: The institutions of science need to get better at rewarding failure.
One young scientist told us, I feel torn between asking questions that I know will lead to statistical significance and asking questions that matter.
The biggest problem in science isnt statistical significance; its the culture. She felt torn because young scientists need publications to get jobs. Under the status quo, in order to get publications, you need statistically significant results. Statistical significance alone didnt lead to the replication crisis. The institutions of science incentivized the behaviors that allowed it to fester.
Will you support Voxs explanatory journalism?
Millions turn to Vox to understand whats happening in the news. Our mission has never been more vital than it is in this moment: to empower through understanding. Financial contributions from our readers are a critical part of supporting our resource-intensive work and help us keep our journalism free for all. Please consider making a contribution to Vox today.
Next Up In The Latest
- The January 6 hearings showed why its reasonable to call Trump a fascist
- The bad vibes economy
- The sexual abuse scandal rocking the Southern Baptist Convention, explained
- What the deepfake controversy about this Chinese actor says about conspiratorial thinking
- The end of history is history
- Cornel Wests pragmatic America