Distribution of p-values underneath the null speculation for discrete information

November 9, 2025

4

Motivation

Just a few months again in a facet skirmish throughout the nice p-curve controversy, Richard McElreath talked about that p-values underneath the null speculation are usually not all the time uniformly distributed, as is usually claimed. This prompted me to take a look at the phenomenon. I’ll admit I had in my head the essential concept that p-values are certainly uniformly distributed if the null speculation is true. It seems that is solely ‘typically’ the case, not all the time.

As is often the case for this weblog, the principle motivation was to ensure I understood one thing myself, so there’s nothing significantly new for the world on this weblog. But it surely is likely to be attention-grabbing for some. There have been a couple of odd fishhooks.

I needed to verify the case the place we’re evaluating the variety of “successes” in a comparatively small pattern of binary success/failure outcomes, break up into two teams. The null speculation is that the underlying likelihood for fulfillment is similar in every group. I needed to see the distribution of the p-value for a take a look at of this null speculation with pattern sizes of 10, 100 or 1000 observations in every group; for various values of the underlying likelihood of success, and for when the teams’ pattern dimension is random however imply 100.

Calculating the p worth

The fishhooks have been in learn how to calculate the p worth. I really went for 3 other ways:

My lazy approximation means is to estimate the variance of the distinction between the 2 pattern proportions underneath the null speculation and depend on asymptotic normality to get a likelihood of the worth being as excessive because it really is. This has the drawback of being recognized to be not significantly proper significantly when the pattern is small or the likelihood of success is near 1 or 0. The benefit is I didn’t have to look something up and it was straightforward to vectorise. This methodology known as pval_hand within the plot beneath.
A greater methodology is to make use of the Fisher actual take a look at, as per the well-known lady-tasting-tea evaluation, however the out of the field methodology I used to be utilizing (from the corpora R bundle by Stephanie Evert) doesn’t work when the noticed successes in each samples is zero. This methodology known as pval_fisher within the plot beneath.
Probably the most out-of-the-box methodology of all is to make use of the prop.take a look at() operate from the stats bundle constructed into R. The drawback right here was the hassle in vectorising this operate to run effectively with giant numbers of simulations. This methodology known as pval_proptest within the plot beneath.

As a result of I used to be nervous about whether or not these strategies may give materially completely different outcomes I began by evaluating the outcomes they gave, with completely different pattern sizes and underlying parameters, with simply 1,000 repetitions for every mixture of pattern dimension and parameter. This offers this comparability:

So we are able to see that the pval_fisher and pval_proptest strategies give successfully the identical outcomes, whereas my hand-made methodology has a big variety of discrepancies. Due to this I made a decision to stay with Evert’s corpora::fisher.pval. I simply hardened it up with a wrapper operate to outline the p worth (probability of seeing information as excessive as this, if the null speculation is true) to be 1 if the noticed successes are 0 in each samples:

library(tidyverse)
library(corpora)
library(GGally)
library(glue)
library(scales)
library(frs) # for svg_png()

#' Model of fisher.pval that will not break if k1 and k2 each 0
tough_fisher <- operate(k1, n1, k2, n2, set_both_zero = 1, ...){
  issues <- k1 + k2 == 0
  k1[problems] <- 1
  temp <- corpora::fisher.pval(k1, n1, k2, n2)
  temp[problems] <- set_both_zero
  return(temp)
}
# be aware wanted a call on what to do when k1 and k2 are each zero. I feel
# p worth is 1 right here, as p is the "likelihood of seeing information not less than as excessive
# as this if the null speculation of no distinction is true". Unsure why Fisher's
# actual take a look at returns an error, value trying into that.

Outcomes

So with that downside out of the way in which, I got down to calculate a lot of p-values. I did this within the conditions

the place the 2 teams have been equally sized with 10, 100, or 1,000 observations every; or the place they’re a random variety of observations, imply 100, nonetheless the identical for every of the 2 teams; and
with an underlying likelihood of success of 0.2, 0.5, or 0.8.

The intution for among the cause of why the p-values aren’t uniformly distributed is that with a finite pattern and a a discrete final result, there are solely a finite variety of potential values for the p-value. So after all it might’t have a superbly uniform distribution, which suggests a p-value taking any worth between 0 and 1 with equal likelihood. Additional down we see (roughly) what number of potential values the p-value takes. The smaller the pattern sizes, the extra we’d count on to see divergence from normality.

And that’s precisely what we do see. Right here is the complete distribution of p-values for 100,000 simulations of every mixture:

… and right here is the distribution simply of the p-values which are conventionally “vital” ie beneath 0.05:

I feel there are a couple of attention-grabbing subtleties right here. Specifically:

When the dimensions of the 2 teams is a random variable (however nonetheless equal) then the distribution of p-values is far more uniform (however nonetheless not precisely uniform). Mainly there are numerous extra possibiliities for the p-values to take with this further randomness.
The p-values are extra non-uniform when the underlying likelihood for fulfillment is 0.5 in every case somewhat than 0.2 or 0.8
The p-values can nonetheless be very non-uniform even with a big pattern dimension of 1,000 observations per group (if underlying likelihood of success is near 0.5)

Right here’s a potentialy attention-grabbing little perception into one of many the explanation why this works that means – the quantity of various distinctive p-values that we get for every mixture of pattern dimension and underlying possibilities:

  size_lab    `Prob=0.2` `Prob=0.5` `Prob=0.8`
                          
n=10                32         38         31
n=100              391        533        403
n=1000            2915       3853       2903
n=Pois(100)       9812      12582       9898

I’m not drawing any conclusions in any metascience debates right here. Simply noting this attention-grabbing phenomenon within the distribution of p-values.

Right here’s the remainder of the code for doing these simulations.

# Takes about 30 seconds with 100,000 reps
st <- system.time({
  for(reps in c(1e3, 1e5)){
    set.seed(42)
    
    d <- expand_grid(
      prob = rep(c(0.2, 0.5, 0.8), every = reps),
      dimension = c(10, 100, 1000, NA)
    ) |> 
      mutate(size_lab = ifelse(is.na(dimension), "n=Pois(100)", glue("n={dimension}")),
             prob_lab = glue("Prob={prob}")) |> 
      mutate(dimension = ifelse(is.na(dimension), rpois(n(), 100), dimension)) |> 
      mutate(x1 = rbinom(n(), dimension = dimension, prob = prob),
            x2 = rbinom(n(), dimension = dimension, prob = prob),
            # noticed proportions p1 and p2 from the 2 populations
            p1 = x1 / dimension,
            p2 = x2 / dimension,
            # underneath null speculation, the equal likelihood of each pops:
            pmid = (p1 + p2) / 2,
            # noticed distinction of the 2 proportions:
            delt = abs(p1 - p2),
            # variance of every of p1 and p2: p(1-p)/n
            s1 = pmid * (1 - pmid) / dimension,
            # customary deviation of the sum of two of these variances
            sddelt = sqrt(s1 + s1)) |> 
      # calculate p values
      mutate(pval_hand = 2 * (1 - pnorm(delt / sddelt)),
             pval_fisher = tough_fisher(x1, dimension, x2, dimension),
             pval_proptest = NA)
  
    if(reps < 10000){
      # when the variety of reps is pretty small, I draw a pairs plot simply to
      # evaluate the other ways of calculating p values:
      # - prop.take a look at (out of the field R methodology). I could not discover a straightforward approach to vectorize this, 
      #   is why it is just carried out right here, through a loop. when reps is small
      # - pval_hand (my handmade approximate methodology)
      # - pval_fisher (Fisher actual take a look at, however toughened up as above to offer 1 when each k1 and k2 are 0)
      # The primary conclusion from that is that my by-hand aproximation is not nice!
      system.time({
        for (i in 1:nrow(d)){
          x <- d[i, ]
          d[i, "pval_proptest"] <- prop.take a look at(c(x$x1, x$x2), c(x$dimension, x$dimension))$p.worth
        }
      })
      # 5 seconds for reps=1000
      
      plot1 <- operate()> 
          choose(pval_hand:pval_proptest, size_lab) 
      
      svg_png(plot1, glue("0299-pairs-{reps}"), w = 10, h = 8)      
    }
  
  
    plot2 <- d |> 
      ggplot(aes(x = pval_fisher)) +
      facet_grid(size_lab  ~ prob_lab, scales = "free_y") +
      geom_histogram(fill = "steelblue") +
      scale_y_continuous(label = comma) +
      labs(title = "Distribution of all p values when a null speculation is true",
          subtitle = "Equal dimension binomial samples drawn from two populations with identical underying likelihood",
          x = "P worth from Fisher's actual take a look at
    (when zero constructive circumstances in each samples, p worth is ready to 1)",
          y = glue("Rely of simulations (out of {comma(reps)})"))
  
    plot3 <- d |>
    filter(pval_fisher < 0.05) |> 
    ggplot(aes(x = pval_fisher)) +
    facet_grid(size_lab  ~ prob_lab, scales = "free_y") +
    geom_histogram(fill = "steelblue") +
    scale_y_continuous(label = comma) +
    labs(title = "Distribution of serious (<0.05) p values when a null speculation is true",
         subtitle = "Equal dimension binomial samples drawn from two populations with identical underying likelihood",
         x = "P worth from Fisher's actual take a look at
  (when zero constructive circumstances in each samples, p worth is ready to 1)",
         y = glue("Rely of simulations (out of {comma(reps)})"))
  
      
    svg_png(plot2, glue("0299-histogram-{reps}"), w = 9, h = 6)     
    svg_png(plot3, glue("0299-histogram-sig-only-{reps}"), w = 9, h = 6)      
  
  }
})

print(st)

# Variety of distinctive p-values
d |> 
  group_by(prob_lab, size_lab) |> 
  summarise(number_p_values = size(distinctive(pval_fisher))) |> 
  unfold(prob_lab, number_p_values)

Distribution of p-values underneath the null speculation for discrete information

Motivation

Calculating the p worth

Outcomes

Related Articles

What Is Adobe Firefly? Right here’s Learn how to Use This Highly effective Generative AI Device

Kuru (Prions): Proteins That Kill, Cannibals, Nobel Prizes, and Extra

Customizable tables in Stata 17, half 4: Desk of statistical checks

Latest Articles

What Is Adobe Firefly? Right here’s Learn how to Use This Highly effective Generative AI Device

Kuru (Prions): Proteins That Kill, Cannibals, Nobel Prizes, and Extra

Customizable tables in Stata 17, half 4: Desk of statistical checks

Microsoft lets procuring bots free in a sandbox

Cloning isn’t only for superstar pets like Tom Brady’s canine