Tuesday, November 4, 2025

Energy and ‘fragile’ p-values


Do ‘fragile’ p values inform us something?

I used to be not too long ago to see this text on p values within the psychology literature float throughout my social media feed. Paul C Bogdan makes the case that the severity of the replication disaster in science will be judged partly by the proportion of p values which might be ‘fragile’,which he defines as between 0.01 and 0.05.

In fact, concern on the proportion of p values which might be ‘important however solely simply’ is a steady function of the replication disaster. One of many standing issues with science is that researchers use questionable analysis practices to someway nudge the p values down to simply under the edge deemed to be “signficant” proof. One other standing concern is that researchers who won’t use these practices within the evaluation themselves won’t publish or not be capable of publish their null outcomes, leaving a bias in the direction of optimistic leads to the revealed literature (the “file-drawer” downside).

Bogdan argues that for research with 80% energy (outlined as 1 minus the chance of accepting the null speculation when there may be the truth is an actual impact within the information), 26% of p values which might be important needs to be on this “fragile” vary, primarily based on simulations.

The analysis Bogdan describes within the article linked above is a intelligent information processing train of revealed psychology literature to see what quantity of p values are the truth is, “fragile” and the way this adjustments over time. He finds that “From earlier than the replication disaster (2004–2011) to at the moment (2024), the general proportion of great p values within the fragile vary has dropped from 32% to just about 26%”. As 26% is about what we’d anticipate, if all of the research had energy of 80%, then that is seen as excellent news.

Is the replication disaster over? (to be truthful, I don’t assume Bogdan claims this final level).

Considered one of Bogdan’s personal citations is this piece by Daniel Lakens, which itself is a critique of an identical try at this earlier. Lakens argues “the adjustments within the ratio of fractions of p-values between 0.041–0.049 through the years are higher defined by assuming the common energy has decreased over time” somewhat than by adjustments in questionable analysis practices. I feel I agree with Lakens on this.

I simply don’t assume the 26% of great p values to be ‘fragile’ is a stable sufficient benchmark to guage analysis pracices on.

Anyway, all this intrigued me sufficient when it was mentioned first in Science (as “a giant win”) after which on Bluesky for me to wish to do my very own simulations to see how adjustments in impact sizes and pattern sizes would change that 26%. My hunch was 26% was primarily based on assumptions that each one research have 80% energy and (given energy must be calculated for some assumed however unobserved true impact dimension) that the precise distinction in the actual world is near the distinction assumed in making that energy calculation. Each these assumptions are clearly extraordinarily brittle, however what’s the affect if they’re improper?

From my tough enjoying out under, the affect is fairly materials. We shouldn’t assume that adjustments within the proportion of signficant p values which might be between 0.01 and 0.05 tells us a lot about questionable analysis practices, as a result of there may be simply an excessive amount of else occurring — pre-calculated energy, how a lot energy calculations and certainly the analysis that’s chosen are primarily based on a superb reflection of actuality, the dimensions of variations we’re searching for, and pattern sizes — confounding the entire thing.

Do your personal analysis simulations

To do that, I wrote a easy operate experiment which pulls two impartial samples from two populations, all observations usually distributed. For my functions the 2 pattern sizes are going to be the identical and the usual deviations the identical in each populations; solely the means differ by inhabitants. However this operate is about up for a extra normal exploration if I’m ever motivated.

The perfect state of affairs – researcher’s energy calculation matches the actual world

With this operate I first performed round a bit to get a state of affairs the place the ability may be very near 80%. I acquired this with pattern sizes of 53 every and a distinction within the technique of the 2 populations of 0.55 (remembering every inhabitants has a normal distribtuion of N(0, 1)).

I then checked this with a broadcast energy package deal, Bulus, M. (2023). pwrss: Statistical Energy and Pattern Measurement Calculation Instruments. R package deal model 0.3.1. https://CRAN.R-project.org/package deal=pwrss. I’ve by no means used this earlier than and simply downloaded it to verify I hadn’t made errors in my very own calculations, and later I’ll use it to hurry up some stuff.

library(pwrss)
library(tidyverse)

experiment <- operate(d, m1 = 0, sd1 = 1, sd2 = 1, n1 = 50, n2 = n1, seed = NULL){
  if(!is.null(seed)){
    set.seed(seed)
  }
  x1 <- rnorm(n1, m1, sd1)
  x2 <- rnorm(n2 ,m1 + d, sd2)
  t.check(x1, x2)$p.worth
}

reps <- 10000
res <- numeric(reps)

for(i in 1:reps){
  res[i] <- experiment(d = 0.55, n1 = 53)
}

Sure, that’s proper, I’m utilizing a for loop right here. Why? As a result of it’s very readable, and really straightforward to put in writing.

Right here’s what that provides us. My simulated energy is 80%, Bulus’ package deal agrees with 80%, and 27% of the ‘signficant’ (at alpha = 0.05) p values are within the fragile vary. This isn’t the identical as 26% however it’s not one million miles away; it’s straightforward to think about a number of adjustments within the experiment that will result in his 26% determine.

> # energy from simulation
> 1 - imply(res > 0.05)
[1] 0.7964
> 
> # energy from Bulus' package deal
> pwrss.t.2means(mu1 = 0.55, sd1 = 1, sd2 = 1, n2 = 53)
 Distinction between Two means 
 (Impartial Samples t Take a look at) 
 H0: mu1 = mu2 
 HA: mu1 != mu2 
 ------------------------------ 
  Statistical energy = 0.801 
  n1 = 53 
  n2 = 53 
 ------------------------------ 
 Different = “not equal” 
 Levels of freedom = 104 
 Non-centrality parameter = 2.831 
 Kind I error charge = 0.05 
 Kind II error charge = 0.199 
> 
> # Of these experiments which have 'important' outcomes, what quantity are in 
> # the so-called fragile vary (i.e. betwen 0.01 and 0.05)
> summ1 <- imply(res > 0.01 & res < 0.05) / imply(res < 0.05)
> print(summ1)
[1] 0.2746107

Modifications in distinction and in pattern dimension

I made some arbitrary calls in that first run — pattern dimension about 50 observations in every group, and the distinction about 0.5 commonplace deviations. What if I let the distinction between the 2 populations be smaller or bigger than this, and simply set the variety of observations to no matter is critical to get 80% energy? What change does this make to the proportion of p values which might be ‘fragile’?

It seems it makes a massive distinction, as we see in these two charts:

These are simulations, nonetheless on the earth the place the researcher occurs to guess the actual world precisely proper after they do their energy calculation and decide a pattern dimension to get 80% energy. We see within the high chart that as the actual world distinction will get greater, with fixed energy, the proportion of great however ‘fragile’ p values goes up markedly. And the second chart reveals the identical simulations, however specializing in the variation in pattern dimension which adjustments in compensation for the actual world distinction in populations, to keep up the identical energy. Larger samples with the identical energy imply that you’re searching for comparatively smaller actual world variations, and the proportion of great p values which might be ‘fragile’ will get smaller.

Right here’s the code that did these simulations:

#--------------varying distinction and pattern sizes---------------
possible_diffs <- 10:200 / 100 # measured in commonplace deviations

# what pattern dimension do we have to have 80% energy
n_for_power <- sapply(possible_diffs, operate(d){
  as.numeric(pwrss.t.2means(mu1 = d, energy = 0.8, verbose = FALSE)$n[1])
})

prop_fragile <- numeric(size(possible_diffs))

# This takes some minutes to run, may very well be higher if parallelized or finished in
# Julia if we thought saving these minutes was essential:
for(j in 1:size(possible_diffs)){
  for(i in 1:reps){
    res[i] <- experiment(d = possible_diffs[j], n1 = n_for_power[j])
  }
  prop_fragile[j] <- imply(res > 0.01 & res < 0.05) / imply(res < 0.05)
}

# Plot 1
tibble(prop_fragile, possible_diffs) |> 
  ggplot(aes(x = possible_diffs,y= prop_fragile)) +
  geom_point()+
  scale_y_continuous(label = p.c) +
  labs(x = "Distinction (in commonplace deviations) between two means",
       y = "Proportion of great p values nthat are between 0.01 and 0.05",
       title = "Two pattern exams for distinction between two means with energy = 80%",
       subtitle = "t check for impartial samples at a mix of pattern dimension and inhabitants differencenneeded to offer the specified energy. Each populations are commonplace regular distributions.")

# Plot 2
tibble(prop_fragile, n_for_power) |> 
  ggplot(aes(x = n_for_power,y = prop_fragile)) +
  geom_point() +
  scale_x_sqrt() +
  scale_y_continuous(label = p.c) +
  labs(x = "Pattern dimension wanted to get 80% energy for given distinction of means",
       y = "Proportion of great p values nthat are between 0.01 and 0.05",
       title = "Two pattern exams for distinction between two means with energy = 80%",
       subtitle = "t check for impartial samples at a mix of pattern dimension and inhabitants differencenneeded to offer the specified energy. Each populations are commonplace regular distributions.")

Stress-free assumptions

OK, in order that was what we get when the ability calculation was primarily based on a real illustration of the world, recognized earlier than we did the experiment. Clearly that is by no means the case (or we’d not must do experiments) — the precise distinction between two populations is likely to be greater or smaller than we anticipated, it’d truly be precisely zero, the form and unfold of the populations will differ from what we thought once we calculated the ability, and many others.

I made a decision to attempt three easy breaks of the assumptions to see what affect they’ve on the 27% of p values that have been fragile:

  • The precise distinction between populations is a random quantity, albeit on common is what is anticipated throughout the energy calculation
  • the precise distinction between populations is a coin flip between precisely what was anticipated (when the ability calculation was made) and nil (ie null speculation seems to be true)
  • the precise distinction between inhabitants is a coin flip between a random quantity with common the identical as anticipated and nil (ie a mix of the primary two situations)
#------------------when true d is not what was expected---------------

reps <- 10000
res <- numeric(reps)

# we're going to let the precise distinction deviate from that which was used
# within the energy calculation, however say that on common the planned-for distinction
# was right
for(i in 1:reps){
  res[i] <- experiment(d = rnorm(1, 0.55, 0.5), n1 = 53)
}

# "precise" energy:
1 - imply(res > 0.05)

# proportion of so-called fragile p values is far much less
summ2 <- imply(res > 0.01 & res < 0.05) / imply(res < 0.05)

#---------when true d is identical as anticipated besides half the time H0 is true---------

for(i in 1:reps){
  res[i] <- experiment(d = pattern(c(0, 0.55), 1), n1 = 53)
}


# proportion of so-called fragile p values is now *extra*
summ3 <- imply(res > 0.01 & res < 0.05) / imply(res < 0.05)

#---------when true d is random, AND half the time H0 is true---------

for(i in 1:reps){
  res[i] <- experiment(d = pattern(c(0, rnorm(1, 0.55, 0.5)), 1), n1 = 53)
}


# proportion of so-called fragile p values is now much less
summ4 <- imply(res > 0.01 & res < 0.05) / imply(res < 0.05)

tibble(`Context` = c(
  "Distinction is as anticipated throughout energy calculation",
  "Distinction is random, however on common is as anticipated",
  "Distinction is as anticipated, besides half the time null speculation is true",
  "Distinction is random, AND null speculation true half the time"
), `Proportion of p-values which might be fragile` = c(summ1, summ2, summ3, summ4)) |> 
  mutate(throughout(the place(is.numeric), (x) p.c(x, accuracy = 1))) 

That will get us these attention-grabbing outcomes:

Context Proportion of p-values which might be fragile
Distinction is as anticipated throughout energy calculation 27%
Distinction is random, however on common is as anticipated 16%
Distinction is as anticipated, besides half the time null speculation is true 29%
Distinction is random, AND null speculation true half the time 20%

There’s a marked variation right here in what quantity of p values is fragile. Arguably, the fourth of those situations is the closest approximation to the actual world (though there may be lots of debate about this, how a lot are exactly-zero variations actually believable?) Both this, or the opposite practical state of affairs (‘distinction is random however on common is as anticipated’) provides a proportion of fragile p values effectively under the 27% we noticed in our base state of affairs.

Conclusion

There’s simply too many components impacting on the proportion of p values that shall be between 0.01 and 0.05 to imagine that variations in it are both an enchancment or a worsening in analysis practices. These items embody:

  • When anticipated variations change and pattern sizes change to go together with them for a given degree of energy, it impacts materially on the proportion of fragile p values we’d anticipate to see
  • When the actual world differs from that anticipated by the researcher after they did their energy calculation, it impacts materially on the proportion of fragile p values we’d anticipate to see
  • Anyway, researchers don’t all set their pattern sizes to offer 80% energy, for numerous causes, a few of them good and a few not so good

Closing thought — not one of the above tells us whether or not we have now a replication disaster or not, and if that’s the case if it’s getting higher or getting worse. Because it occurs, I are likely to assume we do have one and that it’s very critical. I feel the peer assessment course of works very poorly and may very well be improved, and tutorial publishing normally units up horrible — and maybe worsening — incentives. Nonetheless, I feel criticism up to now decade or so has led to enhancements (akin to extra entry to reproducible code and information, extra pre-registration, normal raised consciousness), which is constant actually with Bogdan’s substantive argument right here. I simply don’t assume the ‘fragile’ p values are a lot proof both means, and if we monitor them in any respect we must always achieve this with nice warning.



Related Articles

Latest Articles