What a nerdy debate about p-values shows about science and how to fix it – Vox

Theres a huge debate going on in social science right now. The question is simple, and strikes near the heart of all research: What counts as solid evidence?

The answer matters because many disciplines are currently in the midst of a replication crisis where even textbook studies arent holding up against rigorous retesting. The list includes: ego depletion, the idea that willpower is a finite resource; the facial feedback hypothesis, which suggested if we activate muscles used in smiling, we become happier; and many, many more.

Scientists are now figuring out how to right the ship, to ensure scientific studies published today wont be laughed at in a few years.

One of the thorniest issues on this question is statistical significance. Its one of the most influential metrics to determine whether a result is published in a scientific journal.

Most casual readers of scientific research know that for results to be declared statistically significant, they need to pass a simple test. The answer to this test is called a p-value. And if your p-value is less than .05 bingo! you got yourself a statistically significant result.

Now a group of 72 prominent statisticians, psychologists, economists, biomedical researchers, and others want to disrupt the status quo. A forthcoming paper in the journal Nature Human Behavior argues that results should only be deemed statistically significant if they pass a higher threshold.

We propose a change to P< 0.005, the authors write. This simple step would immediately improve the reproducibility of scientific research in many fields.

This may sound nerdy, but its important. If the change is accepted, the hope is that fewer false positives will corrupt the scientific literature. Its become too easy using shady techniques known as p-hacking, and outcome switching to find some publishable result that reaches the .05 significance level.

Theres a major problem using p-values the way we have been using them, says John Ioannidis, a Stanford professor of health research and one of the authors of the paper. Its causing a flood of misleading claims in the literature.

Dont be mistaken: This proposal wont solve all the problems in science. I see it as a dam to contain the flood until we make sure we have the more permanent fixes, Ioannidis says. He calls it a quick fix. Though not everyone agrees its the best course of action.

At best, the proposal is an easy change to implement to protect academic literature from faulty change. At worst, its a patronizing decree that avoids addressing the real problem at the heart of sciences woes.

There is a lot to unpack and understand here. So were going to take it slow.

Even the simplest definitions of p-values tend to get complicated. So bear with me as I break it down.

When researchers calculate a p-value, theyre putting to the test whats known as the null hypothesis. First thing to know: This is not a test of the question the experimenter most desperately wants to answer.

Lets say the experimenter really wants to know if eating one bar of chocolate a day leads to weight loss. To test that, they assign 50 participants to eat one bar of chocolate a day. Another 50 are commanded to abstain from the delicious stuff. Both groups are weighed before the experiment, and then after, and their average weight change is compared.

The null hypothesis is the devils advocate argument. It states: There is no difference in the weight loss of the chocolate eaters versus the chocolate abstainers.

Rejecting the null is a major hurdle scientists need to clear to prove their theory. If the null stands, it means they havent eliminated a major alternative explanation for their results. And what is science if not a process of narrowing down explanations?

So how do they rule out the null? They calculate some statistics.

The researcher basically asks: How ridiculous would it be to believe the null hypothesis is true answer, given the results were seeing?

Rejecting the null is kind of like the innocent until proven guilty principle in court cases, Regina Nuzzo, a mathematics professor at Gallaudet University, explains. In court, you start off with the assumption that the defendant is innocent. Then you start looking at the evidence: the bloody knife with his fingerprints on it, his history of violence, eyewitness accounts. As the evidence mounts, that presumption of innocence starts to look naive. At a certain point, jurors get the feeling, beyond a reasonable doubt, that the defendant is not innocent.

Null hypothesis testing follows a similar logic: If there are huge and consistent weight differences between the chocolate eaters and chocolate abstainers, the null hypothesis that there are no weight differences starts to look silly. And you can reject it.

You are correct!

Rejecting the null hypothesis is indirect evidence of an experimental hypothesis. It says nothing about whether your scientific conclusion is correct.

Sure, the chocolate eaters may lose some weight. But is it the because of the chocolate? Maybe. Or maybe they felt extra guilty eating candy every day, and they knew they were going to be weighed by strangers wearing lab coats (weird!), so they skimped on other meals.

Rejecting the null doesnt tell you anything about the mechanism by which chocolate causes weight loss. It doesnt tell you if the experiment is well designed, or well controlled for, or if the results have been cherry-picked.

It just helps you understand how rare the results are.

But and this is a tricky, tricky point its not how rare the results of your experiment are. Its how rare the results would be in the world where the null hypothesis is true. That is, its how rare the results would be if nothing in your experiment worked, and the difference in weight was due to random chance alone.

Heres where the p-value comes in: The p-value quantifies this rareness. It tells you how often youd see the numerical results of an experiment or even more extreme results if the null hypothesis is true and theres no difference between the groups.

If the p-value is very small, it means the numbers would rarely (but not never!) occur by chance alone. And so, when the p is small, researchers start to think the null hypothesis looks improbable. And they take a leap to conclude their [experimental] data are pretty unlikely to be due to random chance, Nuzzo explains.

And heres another tricky point: Researchers can never completely rule out the null (just like jurors are not firsthand witnesses to a crime). So scientists instead pick a threshold where they feel pretty confident that they reject the null. Thats now set at less than .05.

Ideally, a p of .05 means if you ran the experiment 100 times again, assuming the null hypothesis is true youd see these same numbers (or more extreme results) five times.

And one last, super-thorny concept that almost everyone gets wrong: A p<.05 does not mean theres less than a 5 percent chance your experimental results are due to random chance. It does not mean theres only a 5 percent chance youve landed on a false positive. Nope. Not at all.

Again: A p of .05 means theres a less than 5 percent chance that in the world where the null hypothesis is true, the results youre seeing would be due to random chance. This sounds nitpicky, but its critical. Its is the misunderstanding that leads people to be unduly confident in p-values. The false-positive rate for experiments at p=.05 can be much, much higher than 5 percent.

Okay. Still with me? Its okay if you need to take a break. Grab a soda. Catch up with Mom. Shes wondering why you havent called in a while. Tell her about your summer plans.

Because now were going to dive into...

Generally, p-values should not be used to make conclusions, but rather to identify possibilities like a sniff test, Rebecca Goldin, the director for Stats.org and a math professor at George Mason University, explains in an email.

And for a long while, a sniff of p thats less than .05 smelled pretty good. But over the past several years, researchers and statisticians have realized that a p<.05 is not as strong of evidence as they once thought.

And to be sure, evidence for this is abundant.

Heres the most obvious, easy-to-understand piece of evidence: Many papers that have used the .05 significance threshold have not replicated with more methodologically rigorous designs.

A famous 2015 paper in Science attempted to replicate 100 findings published in a prominent psychological journal. Only 39 percent passed. Other disciplines have fared somewhat better. A similar replication effort in economic papers found 60 percent of findings replicated. Theres a reproducibility crisis in biomedicine too, but it hasnt been so specifically quantified.

The 2015 Science paper on psych studies offered some clues to which papers were more likely to replicate. Studies that yielded highly significant results (less than p=.01) are more likely to reproduce than those that are just barely significant at the .05 level.

Reporting effects that really arent there undermine the credibility of science, says Valen Johnson, a co-author of the Nature Human Behavior proposal who heads the statistics department at Texas A&M. Its important that science adopt these higher standards, before they claim they have made a discovery.

Elsewhere, researchers find evidence of an epidemic of statistical significance. Practically everything that you read in a published paper has a nominally statistically significant result, say Ioannidis. The large majority of these p-values of less than .05 do not correspond to some true effect.

For a long while, scientists thought p<.05 represented something rare. New work in statistics shows that its not.

In a 2013 PNAS paper, Johnson used more advanced statistical techniques to test the assumption researchers commonly make: that a p of .05 means theres a 5 percent chance the null hypothesis is true. His analysis revealed that it didnt. In fact theres a 25 percent to 30 percent chance the null hypothesis is true when the p-value is 05, Johnson said.

Remember: The p-value is supposed to assure researchers that their results are rare. Twenty-five percent is not rare.

For another way to think about all this, lets flip the question around: What if instead of assuming the null hypothesis is true, lets assume an experimental hypothesis is true?

Scientists and statisticians have shown that if assuming experimental hypotheses are true, it should actually be somewhat uncommon for studies to keep churning out p-values of around .05. More often, assuming an effect is true, the p-value should come in lower.

Psychology PhD student Kristoffer Magnusson has designed a pretty cool interactive calculator that estimates the probability of obtaining a range of p-values for any given true difference between groups. I used it to create the following scenario.

Lets say theres a study where the actual difference between two groups is equal to half a standard deviation. (Yes, this is a nerdy way of putting it. But think of it like this: It means 69 percent of those in the experimental group show results higher than the mean of the control group. Researchers call this a medium-sized effect.) And lets say there are 50 people each in the experimental group and the control group.

In this scenario, you should only be able to obtain a p-value between .03 and .05 around 7.62 percent of the time.

If you ran this experiment over and over and over again, youd actually expect to see a lot more p-values with a much lower number. Thats what the following chart shows. The x-axis are the specific p-values, and the y-axis is the frequency youd find them repeating this experiment. Look how many p-values youd find below .001.

(And from this chart youll see: Yes, you can obtain a p-value of greater than .05 while your experimental hypothesis being true. It just shouldnt happen as often. In this case, around 9.84 percent of all p-values should fall between .05 and .1.)

This is a specific, hypothetical scenario. But in general, its weird when so many p-values in the published literature dont match this distribution. Sure, a few studies on a question should get a p-value of .05. But more should find lower numbers.

The biggest change the paper is advocating for is rhetorical: Results that currently meet the .05 level will be called suggestive, and those that reach the stricter standard of .005 will be called statistically significant.

Journals can still publish weak (and of course null) results just like they always could, says Simine Vazire, a personality psychologist who edits Social Psychological and Personality Science (though is not speaking on the behalf of the journal). The language tweak will hopefully trickle down to press releases and news reports, which might avoid buzzwords such as breakthroughs.

The change, Vazire says, should make it so that authors need stronger results before they can make strong claims. That's all.

Historians of science are always quick to point out that Ronald Fisher, the UK statistician who invented the p-value, never intended it to be the final word on scientific evidence. That statistical significance means the hypothesis is worthy of a follow-up investigation. In a way, were proposing to returning to his original vision of what statistical significance means, Daniel Benjamin, a behavioral economist at the University of California and the lead author of the proposal, says.

If labs do want to publish statistically significant results, its going to be much harder.

Most concretely, it mean labs will need to increase the number of participants in their studies by 70 percent. The change essentially requires six times stronger evidence, Benjamin says.

The increased burden of proof the proposal authors hope would nudge labs into adopting other practices science reformers have been calling for, such as sharing data with other labs to reach consensus conclusion and thinking more long-term about their work. Perhaps their first experiment doesnt reach this new threshold. But a second experiment might. The higher threshold encourages labs to reproduce their own work before submitting to a publication.

The proposal has critics. One of them is Daniel Lakens, a psychologist at Eindhoven University of Technology in the Netherlands, who is currently organizing a rebuttal paper with dozens of authors.

Mainly, he says the significance proposal might work to stifle scientific progress.

A good metaphor is driving a car and setting a maximum speed, Lakens says. You can set the maximum speed in your country to 20 miles an hour, and no one is going to get killed. You hit someone, they wont die. So thats pretty good, right? But we dont do this. We set the maximum speed a little higher, because then we actually get somewhere a little bit quicker. ... The same is for science.

Ideally, Lakens says, the level of statistical significance needed to prove a hypothesis depends on how outlandish the hypothesis is.

Yes, youd want a very low p-value in a study that claims mental telepathy is possible. But do you need such an extreme level testing out a well-worn idea? The high standards could impede young PhDs with low budgets from testing out their ideas.

Again, a p-value of .05 doesnt necessarily mean the experiment will be a false positive. A good researcher would know how to follow up and suss out the truth.

Another critique of the proposal: It keeps scientific communities fixated on p-values, which, as discussed in the sections above, dont really tell you much about the merits of a hypothesis.

There are better, more nuanced approaches to evaluating science.

Such as:

Ioannidis admits that statistical significance [alone] doesnt convey much about the meaning, the importance, the clinical value, utility [of research].

Ideally, he says, scientists would retrain themselves not to rely on null-hypothesis testing. But we dont live in the ideal world. In the real world, p-values are a quick and easy tool any scientist can easily use to run their tests. And in our real world, p-values still carry a lot of weight into saying what gets published.

With the proposal, you dont need to train all these millions of people in heavy statistics, Ioannidis says. And it would work. It would help.

Redefining statistical significance is not an ideal solution to the problem of replication. Its a solution that nudges people to adopt the ideal solution.

Though no one I spoke to said it directly, I wouldnt be surprised if some scientists find that a bit patronizing. Why couldnt they learn advanced statistics? Or come to appreciate more nuanced way of evaluating results?

Theres a critique of the proposal the authors whom I spoke to agree completely with: Changing the definition of statistical significance doesnt address the real problem. And the real problem is the culture of science.

In 2016, Vox sent out a survey to more than 200 scientists, asking, If you could change one thing about how science works today, what would it be and why? One of the clear themes in the responses: The institutions of science need to get better at rewarding failure.

One young scientist told us: "I feel torn between asking questions that I know will lead to statistical significance and asking questions that matter.

The biggest problem in science isnt statistical significance. Its the culture. She felt torn because young scientists need publications to get jobs. Under the status quo, in order to get publications, you need statistically significant results. Statistical significance alone didnt lead to the replication crisis. The institutions of science incentivized the behaviors that allowed it to fester.

Keep in mind, this is all just a proposal, something to spark debate. To my knowledge, journals are not rushing to change their editorial standards overnight.

This will continue to be debated.

But if it becomes that case where its still hard to publish suggestive results, and if its still difficult to secure grant money off suggestive results, then the institutions of science will not have learned their lesson. Yes, a lot of this is just tweaking the language of how we talk about science. But we have to make words suggestive and null results matter.

Failures, on average, are more valuable than positive studies, Ioannidis says.

Scientific institutions and journals know this. They dont always act like they do.

See original here:
What a nerdy debate about p-values shows about science and how to fix it - Vox

How Smart Cities Are Redesigning Human Behavior - Lakeland Connect - June 10th, 2025 [June 10th, 2025]
HUMAN TRAFFICKING | 'That was normal behavior': Victim recalls being 'sold' by her mother, then the aftermath of abuse - The Tribune-Democrat - June 10th, 2025 [June 10th, 2025]
Tech company unveils eerie new way to map human behavior: 'We're tokenizing the invisible ones' - The Cool Down - June 1st, 2025 [June 1st, 2025]
Simulating Human Behavior with AI Agents - Stanford HAI - May 21st, 2025 [May 21st, 2025]
'Human behavior is the basis of the energy transition' - ioplus.nl - May 21st, 2025 [May 21st, 2025]
Driverless taxi ride surprises with human-like behavior - Alton Telegraph - May 21st, 2025 [May 21st, 2025]
VeChains Bold Vision to Tokenize Human Behavior - 99Bitcoins - May 21st, 2025 [May 21st, 2025]
Study links most alligator attacks to risky human behavior - Gulf Coast News and Weather - Southwest Florida News - April 27th, 2025 [April 27th, 2025]
UF study finds risky human behavior is the cause for most alligator bites - The Palm Beach Post - April 19th, 2025 [April 19th, 2025]
Study Finds 96% of Gator Bites Are the Result of Risky Human Behavior - Gizmodo - April 19th, 2025 [April 19th, 2025]
A Growing Pathway to Understanding Human Behavior - University of Northern Colorado - April 19th, 2025 [April 19th, 2025]
The Rehearsal S2: Nathan Fielder Explores Human Behavior - Hollywood.com - April 19th, 2025 [April 19th, 2025]
A Bad Rap: Most alligator bites are caused by risky human behavior, UF researchers say - WCJB TV20 - April 19th, 2025 [April 19th, 2025]
AI humanoid robot learns to mimic human emotions and behavior - Fox News - April 19th, 2025 [April 19th, 2025]
INTERVIEW: Dying for Sex Director Shannon Murphy on Portraying Authentic Human Behavior by Blending Comedy & Drama - The Knockturnal - April 10th, 2025 [April 10th, 2025]
7 Must-Read Psychology Books That Will Help You Decode Human Behavior - Times Now - April 10th, 2025 [April 10th, 2025]
Vet shares warning against common human behavior that gives dogs anxiety - The Mirror US - March 30th, 2025 [March 30th, 2025]
BBVA Foundation awards the psychologists who changed the way we understand and predict human behavior - WebWire - March 15th, 2025 [March 15th, 2025]
Human behavior is driven by fifteen key motives - Earth.com - February 25th, 2025 [February 25th, 2025]
Nature Human Behavior is back, this time touting allyship - Why Evolution Is True - February 25th, 2025 [February 25th, 2025]
30 Times Courtrooms Became The Stage For The Strangest Human Behavior - Bored Panda - February 3rd, 2025 [February 3rd, 2025]
The Impact of AI on Human Behavior: Insights and Implications - iTMunch - January 23rd, 2025 [January 23rd, 2025]
Disturbing Wildlife Isnt Fun: IFS Parveen Kaswan Raises Concern Over Human Behavior in Viral Clip - Indian Masterminds - January 15th, 2025 [January 15th, 2025]
The interplay of time and space in human behavior: a sociological perspective on the TSCH model - Nature.com - January 1st, 2025 [January 1st, 2025]
Thinking Slowly: The Paradoxical Slowness of Human Behavior - Caltech - December 23rd, 2024 [December 23rd, 2024]
From smog to crime: How air pollution is shaping human behavior and public safety - The Times of India - December 9th, 2024 [December 9th, 2024]
The Smell Of Death Has A Strange Influence On Human Behavior - IFLScience - October 26th, 2024 [October 26th, 2024]
"WEIRD" in psychology literature oversimplifies the global diversity of human behavior. - Psychology Today - October 2nd, 2024 [October 2nd, 2024]
Scientists issue warning about increasingly alarming whale behavior due to human activity - Orcasonian - September 23rd, 2024 [September 23rd, 2024]
Does AI adoption call for a change in human behavior? - Fast Company - July 26th, 2024 [July 26th, 2024]
Dogs can smell human stress and it alters their own behavior, study reveals - New York Post - July 26th, 2024 [July 26th, 2024]
Trajectories of brain and behaviour development in the womb, at birth and through infancy - Nature.com - June 18th, 2024 [June 18th, 2024]
AI model predicts human behavior from our poor decision-making - Big Think - June 18th, 2024 [June 18th, 2024]
ZkSync defends Sybil measures as Binance offers own ZK token airdrop - TradingView - June 18th, 2024 [June 18th, 2024]
On TikTok, Goldendoodles Are People Trapped in Dog Bodies - The New York Times - June 18th, 2024 [June 18th, 2024]
10 things only introverts find irritating, according to psychology - Hack Spirit - June 18th, 2024 [June 18th, 2024]
32 animals that act weirdly human sometimes - Livescience.com - May 24th, 2024 [May 24th, 2024]
NBC Is Using Animals To Push The LGBT Agenda. Here Are 5 Abhorrent Animal Behaviors Humans Shouldn't Emulate - The Daily Wire - May 24th, 2024 [May 24th, 2024]
New study examines the dynamics of adaptive autonomy in human volition and behavior - PsyPost - May 24th, 2024 [May 24th, 2024]
30000 years of history reveals that hard times boost human societies' resilience - Livescience.com - May 12th, 2024 [May 12th, 2024]
Kingdom of the Planet of the Apes Actors Had Trouble Reverting Back to Human - CBR - May 12th, 2024 [May 12th, 2024]
The need to feel safe is a core driver of human behavior. - Psychology Today - April 15th, 2024 [April 15th, 2024]
AI learned how to sway humans by watching a cooperative cooking game - Science News Magazine - March 29th, 2024 [March 29th, 2024]
We can't combat climate change without changing minds. This psychology class explores how. - Northeastern University - March 11th, 2024 [March 11th, 2024]
Bees Reveal a Human-Like Collective Intelligence We Never Knew Existed - ScienceAlert - March 11th, 2024 [March 11th, 2024]
Franciscan AI expert warns of technology becoming a 'pseudo-religion' - Detroit Catholic - March 11th, 2024 [March 11th, 2024]
Freshwater resources at risk thanks to human behavior - messenger-inquirer - March 11th, 2024 [March 11th, 2024]
Astrocytes Play Critical Role in Regulating Behavior - Neuroscience News - March 11th, 2024 [March 11th, 2024]
Freshwater resources at risk thanks to human behavior - Sunnyside Sun - March 11th, 2024 [March 11th, 2024]
Freshwater resources at risk thanks to human behavior - Blue Mountain Eagle - March 11th, 2024 [March 11th, 2024]
7 Books on Human Behavior - Times Now - March 11th, 2024 [March 11th, 2024]
Euphemisms increasingly used to soften behavior that would be questionable in direct language - Norfolk Daily News - February 29th, 2024 [February 29th, 2024]
Linking environmental influences, genetic research to address concerns of genetic determinism of human behavior - Phys.org - February 29th, 2024 [February 29th, 2024]
Emerson's Insight: Navigating the Three Fundamental Desires of Human Nature - The Good Men Project - February 29th, 2024 [February 29th, 2024]
Dogs can recognize a bad person and there's science to prove it. - GOOD - February 29th, 2024 [February 29th, 2024]
What Is Organizational Behavior? Everything You Need To Know - MarketWatch - February 4th, 2024 [February 4th, 2024]
Overcoming 'Otherness' in Scientific Research Commentary in Nature Human Behavior USA - English - USA - PR Newswire - February 4th, 2024 [February 4th, 2024]
"Reichman University's behavioral economics program: Navigating human be - The Jerusalem Post - January 19th, 2024 [January 19th, 2024]
Of trees, symbols of humankind, on Tu BShevat - The Jewish Star - January 19th, 2024 [January 19th, 2024]
Tapping Into The Power Of Positive Psychology With Acclaimed Expert Niyc Pidgeon - GirlTalkHQ - January 19th, 2024 [January 19th, 2024]
Don't just make resolutions, 'be the architect of your future self,' says Stanford-trained human behavior expert - CNBC - December 31st, 2023 [December 31st, 2023]
Never happy? Humans tend to imagine how life could be better : Short Wave - NPR - December 31st, 2023 [December 31st, 2023]
People who feel unhappy but hide it well usually exhibit these 9 behaviors - Hack Spirit - December 31st, 2023 [December 31st, 2023]
If you display these 9 behaviors, you're being passive aggressive without realizing it - Hack Spirit - December 31st, 2023 [December 31st, 2023]
Men who are relationship-oriented by nature usually display these 9 behaviors - Hack Spirit - December 31st, 2023 [December 31st, 2023]
A look at the curious 'winter break' behavior of ChatGPT-4 - ReadWrite - December 14th, 2023 [December 14th, 2023]
Neuroscience and Behavior Major (B.S.) | College of Liberal Arts - UNH's College of Liberal Arts - December 14th, 2023 [December 14th, 2023]
The positive health effects of prosocial behaviors | News | Harvard ... - HSPH News - October 27th, 2023 [October 27th, 2023]
The valuable link between succession planning and skills - Human Resource Executive - October 27th, 2023 [October 27th, 2023]
Okinawa's ants show reduced seasonal behavior in areas with more human development - Phys.org - October 27th, 2023 [October 27th, 2023]
How humans use their sense of smell to find their way | Penn Today - Penn Today - October 27th, 2023 [October 27th, 2023]
Wrestling With Evil in the World, or Is It Something Else? - Psychiatric Times - October 27th, 2023 [October 27th, 2023]
Shimmying like electric fish is a universal movement across species - Earth.com - October 27th, 2023 [October 27th, 2023]
Why do dogs get the zoomies? - Care.com - October 27th, 2023 [October 27th, 2023]
How Stuart Robinson's misconduct went overlooked for years - Washington Square News - October 27th, 2023 [October 27th, 2023]
Whatchamacolumn: Homeless camps back in the news - News-Register - October 27th, 2023 [October 27th, 2023]
Stunted Growth in Infants Reshapes Brain Function and Cognitive ... - Neuroscience News - October 27th, 2023 [October 27th, 2023]
Social medias role in modeling human behavior, societies - kuwaittimes - October 27th, 2023 [October 27th, 2023]
The gift of reformation - Living Lutheran - October 27th, 2023 [October 27th, 2023]
After pandemic, birds are surprisingly becoming less fearful of humans - Study Finds - October 27th, 2023 [October 27th, 2023]