A case for overcoming the p-value

Photo by Luke Chesser on Unsplash

When is a scientific discovery really a discovery? When does a hunch about something that could be true turn into something mankind collectively holds to be true?

The world is complicated, and it is hard to come by new insights that are both meaningful and can hold their ground against rigorous testing and replication. This is especially true if we enter the messier realms of the social sciences and biomedical sciences: most researchers do not have the luxury of physicists, who can carry out millions of measurements in tightly controlled environments (think the Large Hadron Collider) until there is only a shadow of doubt about the validity of a discovery left.

In recent years, many shadows of doubt have been cast on seemingly well-established results. The replication crisis in the social sciences made these problems more widely known to the public, a crisis that continues to this day. Take psychology as an example: modern paradigms like social priming, thought to be rock-solid for many years, are losing credibility, with this Nature article stating that “many researchers say they now see social priming not so much as a way to sway people’s unconscious behavior, but as an object lesson in how shaky statistical methods fooled scientists into publishing irreproducible results.” This finding is not an exception, but rather the rule: according to a 2018 survey of 200 meta-analyses, “psychological research is, on average, afflicted with low statistical power”.

Maybe this shouldn’t come as a big surprise: experiments that involve humans or other complex living systems are harder to control, and the effort of cranking up sample sizes and finding all possible sources of error to bulletproof them can be quite laborious. This makes it is easy for errors to slip in.

Significance Testing

And so scientists are naturally concerned with finding good criteria to judge if the results they obtained are really worth reporting. The idea of significance testing tries to establish objective measures that help to separate the wheat of science from the chaff of science.

The most popular criteria for statistical significance is probably (I’m 95 percent sure) the p-value. But as we will see, this seemingly “best” criteria can open the door to a whole new set of problems. It can make shady results look solid, can give nothing the air of significance, can hide bad research behind a deceptive sense of objectivity.

The p-value, instead of helping us, might in many cases have its hand in the replication crisis the sciences are facing. Significance testing does not replace good science, but it can make it seem that it does. And that’s where its danger lies.

The p-value

The p-value is a fairly straightforward concept.

If you observe an effect in some data I have taken, and if you have a theory of what caused this effect in the data (say you have given out medication in a double-blind study and have cured a certain amount of patients in the one group and a certain amount in the control group), how likely is it that you would have observed this effect even if the intervention didn’t have any effect at all (null hypothesis)? In more abstract terms, do the variables you are looking at have no relationship with each other?

An example: say you group 100 patients together that get the real medication and 100 patients that get a placebo. After carrying out the trial, in the first group 60 people are cured, while in the other group, only 48 people are cured. How likely is it that this could just be a random effect so that you make this observation even though the medication doesn’t do anything whatsoever?

This probability relates to the p-value, and it has been common practice to set it at p<0.05, meaning that the odds of observing the effect, even if the null hypothesis were correct, need to be smaller than 5 percent for the result to be significant.

The general procedure can be outlined in the following way:

  1. Pick the research hypothesis
  2. State the null hypothesis
  3. Select a probability of error level threshold (p-value)
  4. Compute a statistical significance test

Problems with the p-value

In my article on the Pandemics of Bad Statistics, I’ve talked about the false sense of certainty numbers can give us. Calculating a small p-value for a given hypothesis boils down the success of an experiment and the validity of its hypothesis into one small objective number.

But the objectivity and validity of this number are questionable for a number of reasons.

The p-value is intimately intertwined with the null hypothesis. But where does the null hypothesis come from, and what does it really state? In an ideal environment, the null hypothesis is a sharp, point estimate of zero effect, yet this only makes sense in an absolutely optimal experiment, in which all other possible effects have been put out of the question.

But every real-life study designed to examine real-life effects is generally noisy, and in fact, is systematically so. Every true null hypothesis is a flawed idealization that pretends that no systematic errors/noise sources exist anymore, which is not even true in most highly controlled physics experiments, not to say anything about medicine or social sciences.

Accordingly, in their 2019 paper “Abandon Statistical Significance”, McShane et Al. write

“…both the adequacy of the statistical model used to compute the p-value…as well as any and all forms of systematic or nonsampling error which vary by field but include measurement error; problems with reliability and validity; biased samples; nonrandom treatment assignment; missingness; nonresponse;
failure of double-blinding; noncompliance; and confounding. The combination of these features of the biomedical and social sciences and this sharp point null hypothesis of zero effect and zero systematic error is highly problematic.

Significant p-values can be obtained from pure noise, as has been shown in publicized examples (e.g. in Carney, 2010). Combining an idealized null hypothesis and holding on to the p-value leads to the absurd effect that noisier studies become more likely (or can more easily be made) to result in significant (but false) results than more controlled, cleaner studies, and are, therefore, based to the p-value criteria, more likely to be published!

And then the p-value threshold of 0.05 is entirely arbitrary. There is no real scientific basis for dichotomizing evidence into the two categories of statistically significant and statistically insignificant, and it is naive to assume that a plausible threshold should be the exact same for all kinds of experiments and hypotheses across all kinds of disciplines.

Instantiating an arbitrary threshold is detrimental to scientific thinking, which should always encompass many different, potentially undiscovered explanations.

These undiscovered explanations are, incidentally, also not included in the null hypothesis. As Andrew Gelman writes here, the null hypothesis in itself is not a good hypothesis, but it rather plays the straw-man of an alternative hypothesis. Comparing your hypothesis to a bad hypothesis does not disprove all other, potentially better hypotheses about what is going on in the data.

Last but not least, the p-value is routinely misinterpreted, which only exacerbates the problem. A meta-study of 791 papers showed that 49 percent (!) of them misused the p-value, and classified statistically non-significant results as indicating that no effect is present.

Abandoning Significance Testing

Statistics is hard.

McShane et Al.

As the replication crisis is ongoing, solutions are dearly needed to improve the quality of scientific inquiry. The discourse about the p-value and its misuse has been especially lively.

So what to do? As McShane et Al. recommend, the scientific community should get rid of the p-value, or at least introduce a much more continuous view of it away from any fixed threshholds, and treat it on equal footing with many other criteria that try to judge the outcome of an experiment.

The sciences should reduce the power it has over the publication process, exclusively publishing papers and reporting effects where a p-value threshold of 0.05 is obtained. Many scientists have recently risen up against statistical significance(see this Nature article), and it seems like a paradigm shift is due. Yet this shift won’t happen if a critical perspective on statistical significance is more widely taught in elementary statistics courses.

Little saves us from the trouble of taking a holistic view of the evidence, a view that needs to be hand-tailored to the needs of the experiment and the field. It is a dangerous tendency to conclude an experiment with binary statements about there being “an effect” or “no effect” based on the simple fact of the p-value being over or under a threshold.

But this takes time, and the politics of academia have contributed to the problem. For instance, the pressure to publish publications as first authors discourages pooling data: it is better to “discover” and publish two noisy results rather than combining data set with the competition in order to get a better handle on the noise and share the dubitable fame of having discovered nothing.

The problem is: a statistical mindset is crucial for doing good science, but statistics is hard, and doing good science is hard, especially in noisy and hard-to-control environments. The p-value is a tempting way out. Our brains love turning uncertainties into certainties (as I explore in my article on the Bayesian Brain Hypothesis), but this always runs the risk of introducing biases into our thinking and should be no justification for “uncertainty laundering”, as Gelman beautifully puts it.

Holding on too tightly to statistical significance and dichotomizing evidence based on it has compromised the quality of a lot of science, and ultimately, the faith societies put into science and the value they see in it.

And so it is high time to let it go.