Extra on energy and ‘fragile’ p-values

November 1, 2025

57

Final week I posted in regards to the proportion of p values that might be ‘fragile’ below varied kinds of scientific experiments or different inferences. The proportion of p values that’s fragile had been outlined because the proportion which might be between 0.01 and 005. I used to be reacting to the query of whether or not it’s an excellent factor that the proportion fragile within the psychology literature appears to have declined over a decade or two from about 32% to about 26%. I had concluded that this didn’t actually inform us a lot about progress in response to the replication disaster.

There was some good dialogue on BlueSky after I posted on this and I might now qualify my views somewhat. The important thing factors that individuals made that I hadn’t adequately taken into consideration in my presentation have been:

When scientists decide an appropriate pattern measurement by an influence calculation, they’re most frequently basing it on the pattern measurement wanted to provide a certain quantity of energy for the “minimal distinction we need to detect”, moderately than the distinction they’re truly anticipating
One in all my key plots confirmed an obvious enhance in fragile p values even with fixed energy when the planned-for distinction between means was giant — when what was truly occurring was in all probability an artefact of the tiny pattern sizes on this very synthetic state of affairs (to be sincere, nonetheless unsure as I write this precisely what was occurring, however am fairly happy that it’s not terribly vital no matter it’s, and positively doesn’t need to be my important plot on the subject)
Extra basic fascinated by “if this lower in ‘fragile’ p values isn’t exhibiting one thing constructive, then what’s it exhibiting?” which provides me a barely extra nuanced view.

This made me resolve I ought to look extra systematically into the distinction between “deliberate for” distinction between two samples and the “precise” distinction within the inhabitants. I had completed some simulations of random variation of the particular distinction from that deliberate for, however not systematic and complete sufficient.

What I’ve completed now’s systematically examine each mixture of a sampling technique to provide 80% energy for “minimal detectable variations” from 0.05 to 0.60 (12 totally different values, spaced 0.05 aside), and precise variations of imply between 0.0 and 1.0 (11 totally different values, spaced 0.10 aside). With 10,000 simulations at every mixture we now have 12 x 11 x 10,000 = 1.32 million simulations

We are able to see that many cheap mixtures of “deliberate for” and “precise” variations between the 2 populations give fragile proportions of p values which might be fairly totally different from 26%. Specifically, within the state of affairs the place the precise distinction is greater than the “minimal detectable distinction” that the pattern measurement would have been calculated for 80% energy (which might be what most researchers are aiming for), the proportion fragile rapidly will get properly beneath 26%.

That proportion ranges from 80% when the true distinction is zero (i.e. the null speculation is true) and the p worth distribution is uniform over [0,1]; to properly beneath 10% when the pattern measurement is ample for prime (a lot higher than 80%) energy, ample to select up the true distinction between the 2 populations.

I feel that is higher at presenting the problems than my weblog final week. Right here’s how I take into consideration this challenge now:

Round 26% of p values will certainly be ‘fragile’ when the pattern measurement has been set to provide 80% energy on the idea of a detectable distinction between two populations which is certainly roughly what the precise distinction seems to be.
On the whole, a proportion of fragile p values that’s larger than this means that experiments are under-powered by design, or first rate designs with 80% energy turned out to be based mostly on minimum-detectalbe variations which might be typically not precise variations in actuality, or one thing else sinister is happening.
If in truth many or most experiments are based mostly on realities the place the precise distinction between means is greater than the minimum-detectable distinction the samples measurement was chosen for at 80% energy, we’d anticipate a proportion of p values which might be fragile to be considerably lower than 26%.
Taking these collectively, it is cheap to say that p values declining from 32% to 26% is an effective factor; however that 26% in all probability remains to be a bit too excessive and shouldn’t be appreciated to be a really synthetic benchmark; and that we will’t ensure what’s driving the decline.

Right here’s the R code to run these 1.32 million simulations, making utilizing of the foreach and doParallel R packages for parallel computing to hurry issues up a bit:

library(pwrss)
library(tidyverse)
library(glue)
library(foreach)
library(doParallel)

#' Perform to run a two pattern experiment with t check on distinction of means
#' @returns a single p worth
#' @param d distinction in technique of the 2 populations that samples are drawn
#'   from. If sd1 and sd2 are each 1, then d is a proportion of that sd and
#'   all the things is scaleless.
experiment <- perform(d, m1 = 0, sd1 = 1, sd2 = 1, n1 = 50, n2 = n1, seed = NULL){
  if(!is.null(seed)){
    set.seed(seed)
  }
  x1 <- rnorm(n1, m1, sd1)
  x2 <- rnorm(n2, m1 + d, sd2)
  t.check(x1, x2)$p.worth
}


#--------------varying distinction and pattern sizes---------------
pd1 <- tibble(planned_diff = seq(from = 0.05, to = 0.60, by = 0.05)) |> 
  # what pattern measurement do we have to have 80% energy, based mostly on that deliberate
  # "minimal distinction to detect"?:
  mutate(n_for_power = sapply(planned_diff, perform(d){
    as.numeric(pwrss.t.2means(mu1 = d, energy = 0.8, verbose = FALSE)$n[1])
  }))

# the precise variations, which might be from zero to a lot greater than the minimal
# we deliberate energy to detect:
pd2 <- tibble(actual_diff = seq(from = 0, to = 1.0, by = 0.1))

# Variety of simulations we do for every mixture of deliberate energy (based mostly on a
# given 'minimal distinction to detect') and precise energy (based mostly on the true
# distinction). when the true distinction is zero particularly, there'll solely
# be 1/20 of reps that give you a 'vital' distinction, so 10000 reps in
# whole provides us a pattern of 500 signficant exams to get the proportion which might be
# fragile from, so nonetheless not enormous. If I might bothered I might have modified the
# variety of reps to do for every mixture based mostly on some quantity that is actually
# wanted, however I did not trouble.:
reps_each <- 10000

# mix the planned-for and precise variations in an information body with a row
# for every repeated sim we're going to do:
information <- expand_grid(pd1, pd2) |> 
  mutate(hyperlink = 1) |> 
  full_join(tibble(hyperlink = 1,
                   rep = 1:reps_each),
            relationship = "many-to-many", by = "hyperlink") |> 
  choose(-hyperlink) |> 
  mutate(p = NA)


print(glue("Working {nrow(information)} simulations. It will take some time."))

# arrange parallel processing cluster
cluster <- makeCluster(7) 
registerDoParallel(cluster)

clusterEvalQ(cluster, {
  library(foreach)
})

clusterExport(cluster, c("information", "experiment"))

outcomes <- foreach(i = 1:nrow(information), .mix = rbind) %dopar% {
  set.seed(i)
  
  # the row of information for simply this simulation:
  d <- information[i, ]
  
  # carry out the simulation and seize the p worth:
  d$p <- experiment(d = d$actual_diff, 
             n1 = d$n_for_power,
             seed = d$rep)
  
  # return the consequence as a row of information, which might be rbinded right into a single information
  # body of all of the parallel processes:
  return(d)
}

stopCluster(cluster)

#--------------summarise and current results--------------------------

# Summarise and calculate the proportions:
res <- outcomes |> 
  group_by(planned_diff, n_for_power, actual_diff) |> 
  summarise(number_sig = sum(p < 0.05),
            prop_fragile = sum(p > 0.01 & p < 0.05) / number_sig)

Right here’s how I draw two plots to summarise that. I’ve the road plot proven above, and a heatmap that’s proven beneath the code. Total I feel the road plot is less complicated and clearer to learn.

# Line chart plot:
res |> 
  mutate(pd_lab = glue("80% energy deliberate for diff of {planned_diff}")) |> 
  ggplot(aes(x = actual_diff, y = prop_fragile)) +
  facet_wrap(~pd_lab) +
  geom_vline(aes(xintercept = planned_diff), color = "steelblue") +
  geom_hline(yintercept = 0.26, color = "orange") +
  geom_point() +
  geom_line() +
  scale_y_continuous(label = p.c) +
  labs(y = "Proportion of serious p values which might be between 0.01 and 0.05",
       x = "Precise distinction (in normal deviations)",
       subtitle = "Vertical blue line exhibits the place the precise distinction equals the minimal distinction to detect that the 80% energy calculation was based mostly upon.
Horizontal orange line exhibits the noticed common proportion of 'fragile' p values within the latest psychology literature.",
       title = "Fragility of p values in relation to precise and deliberate variations in a two pattern t check.")


# Heatmap:
res |> 
  ggplot(aes(x = actual_diff, y = as.ordered(planned_diff), fill = prop_fragile)) +
  geom_tile() +
  geom_tile(information = filter(res, prop_fragile > 0.25 & prop_fragile < 0.31),
            fill = "white", alpha = 0.1, color = "white", linewidth = 2) +
  scale_fill_viridis_c(label = p.c, route = -1) +
  theme(panel.grid.minor = element_blank()) +
  labs(y= "Smallest detectable distinction for 80% energy",
       x = "Precise distinction (in normal deviations)",
       fill = "Proportion of serious p values which might be between 0.01 and 0.05:",
       subtitle = "Pattern measurement is predicated on 80% energy for the distinction on the vertical axis. White packing containers point out the place the proportion of fragile vital p values is between 25% and 31%.",
       title = "Fragility of p values in relation to precise and deliberate variations in a two pattern t check.")

Right here’s the heatmap we get from that. It’s prettier however I feel not truly as clear because the less complicated, faceted line plot.

That’s all for now. I don’t suppose there’s any massive implications of this. Only a higher understanding of what quantity of p values we’d anticipate to be on this fragile vary and what impacts on it.

I’m planning on some actual life information, on a very totally different subject, in my subsequent submit.

Extra on energy and ‘fragile’ p-values

Related Articles

The Intestine Microbiome and Its Function in Well being and Illness

Environment friendly Calibration for Determination Making

Google Introduces T5Gemma 2: Encoder Decoder Fashions with Multimodal Inputs by way of SigLIP and 128K Context

Latest Articles

The Intestine Microbiome and Its Function in Well being and Illness

Environment friendly Calibration for Determination Making

Google Introduces T5Gemma 2: Encoder Decoder Fashions with Multimodal Inputs by way of SigLIP and 128K Context

This streaming service will hit premium customers with adverts as quickly as they open the app

Earth might have been ravaged by “invisible” explosions from area