Saturday, March 7, 2026

If Non-Customary Errors Are Measuring Actual Uncertainty, Ought to We Report Them?


First — thanks all for following alongside on this sequence. I’ve been writing about Claude Code since December thirteenth, 2025, and whereas right this moment’s publish is related to that sequence, I made a decision to make it a extra common publish as a result of it’s not technically about Claude Code. I imply it’s and it isn’t. It’s a extra common query about statistics that Claude Code made me marvel about I suppose is my level and due to this fact I wished to make this one a extra common open ended one which doesn’t get catalogued as a Claude code publish.

That stated, it has genuinely been a labor of affection, and the assist from paying subscribers has meant loads. And when you’re new to the substack, I wished to say that I believe when you subscribe, you’ll get almost each day updates from me. More often than not it’ll be considered one of 4 sorts of posts too:

  1. Posts about AI, and Claude Code particularly. These are often centered on “utilizing AI brokers for sensible empirical analysis”. They’re not often thought items although typically they’re. However primarily I’m making an attempt for example utilizing CC for precise sensible empirical work,

  2. Causal inference. Historically I write explainers on right here the place I’m speaking about causal inference methodologies or elucidating estimators or primary duties round them. I’ve a brand new ebook popping out this summer time, plus I do talks on causal inference, so that you’ll additionally hear me simply speaking about issues associated to that too.

  3. A protracted record of hyperlinks to articles and what not I’ve been studying that week, often known as “Closing Tabs” as a result of it’s articles I’ve left open in my browser. They’re often hyperlinks to articles about love, relationships, causal inference, popular culture, AI after which random stuff.

  4. Not one of the above. This has included numerous historical past of thought stuff but it surely’s additionally been books I’ve been studying, just like the Braveness to be Disliked, about Adlerian psychology.

In order that’s roughly the deal. The Claude code posts are all the time free; the open tabs are free. The causal inference and the #4 are randomized paywalls on the date of the publish. After which every part finally goes behind a paywall after round 4 days. I additionally publish podcast episodes now and again however I’m behind on season 5 to that, and it’s all the time free.

And that’s it. That’s the gist of this substack. So when you aren’t a paying subscriber but, these Claude Code posts are free for about 4 days earlier than they go behind the paywall, so hopefully that’s sufficient to indicate you what’s happening. But when that is the day you are feeling like changing into a paid subscriber — at $5/mo, I feel it’s a deal.

However right this moment’s publish should do with statistics. Whereas I bought the thought from my Claude code sequence, it isn’t about that. It’s about one thing I’d been eager about for some time and am now extra brazenly questioning in regards to the underlying uncertainty implied by it. So it received’t go along with Claude code.

Many analysts, one dataset, one therapy project

I wish to speak about one thing that’s been sitting behind my thoughts since I began this sequence, and which I feel the experiment I’ve been operating the previous few weeks has began to make concrete. It’s a couple of idea known as the many-analyst design. And it’s about what I feel Claude Code by accident helps you to do with it.

In the event you’ve been following alongside, you recognize I’ve been operating a structured experiment. Identical dataset. Identical estimator — Callaway and Sant’Anna, not-yet-treated comparability group. Identical analysis query. The one factor I let fluctuate was covariate choice and software program package deal. I gave Claude Code 5 packages: two in Python, two in Stata, one in R. Three trials per package deal. Fifteen whole runs of the identical examine, holding nearly every part mounted, and letting just one dimension of researcher discretion fluctuate.

What you get is a forest plot exhibiting the distribution of all of the estimates from the identical dataset coming from totally different researchers. However what’s the variation? Nicely, it’s not the sampling distribution of the estimator as that comes from iid sampling with hypothetically constructed different samples. That’s one of many conventional sources of uncertainty in statistics and it isn’t that one.

It additionally isn’t uncertainty within the therapy project, which is a design based mostly randomization courting again to Fisher 1935, and the girl tasting tea. That’s a second supply of variation in therapy estimates one may construe and it isn’t that one both,

Beneath each iid sampling and design approaches, you’ll be able to assemble intervals and do speculation assessments that assist quantify the uncertainty in your estimates. They feed into both direct analytically derived intervals, or you should utilize computational resampling fashion procedures to get them. They imply various things, however they’re each efforts to quantify uncertainty round level estimates and have an effort to seize confidence round some sought-after reply to a goal query.

It isn’t clear what we could be studying from a forest plot of estimates don’t by many analysts. Besides there does appear to be a distribution of estimates one may construe exists, and which may due to this fact be undertaken with Claude code. That is solely me pondering out loud, however bear with me as I do it.

Three sorts of uncertainty: sampling

There’s a regular method to consider uncertainty in empirical work, and it actually solely has two flavors.

The primary one everybody learns in statistics and econometrics: sampling uncertainty. You drew one pattern from a inhabitants. That’s your dataset. It’s a hard and fast measurement. It has particular individuals inside it. It will appear to you want it’s the solely dataset that might’ve ever existed as a result of it’s the solely dataset that ever existed, however there may’ve been others. Thus there are counterfactuals in sampling based mostly inference however the counterfactuals are the counterfactual samples based mostly on a randomizing technique of establishing samples.

The purpose is that it isn’t the one dataset that might’ve ever existed. As this pattern accommodates actual individuals, picked randomly, from a bigger pool known as “the inhabitants”, you might’ve had a distinct dataset of the identical measurement with someplace between a barely totally different to a wholly totally different group of individuals in it. And when you had achieved your procedures on all of them, you’d have totally different calculations. Every calculation is a continuing at that second in that particular dataset, however because the dataset course of is random, the dataset itself is random, and due to this fact the calculations are random variables too. There exist as many attainable datasets as there are combos of drawing mounted n models from the mounted N “inhabitants” which for each massive n and huge N is huge. However below central restrict theorems, issues calm down at some know charges.

All calculations based mostly on the pattern are due to this fact random variables. Regression coefficients are random variables. Customary errors are random variables. T-statistics are random variables. Something which is a quantity you calculated based mostly on that particular dataset is paradoxically a random variable below iid sampling. And so we will use that supply of randomness to make deductions in regards to the sampling distribution of the estimator throughout the entire hypothetical samples.

That is the cool half about inference. The t-statistic for example tells you in regards to the distance between your customary error scaled coefficient and 0. The p-value tells you the share of likelihood mass {that a} t-statistic in your pattern would seem in a given distribution ordinarily. And so forth and so forth, but it surely’s fairly magical and amazes me this was labored out so rigorously by so many individuals going again centuries.

These strategies are attention-grabbing bc below iid sampling based mostly inference, you’re capable of make some significant statements about your estimates proximity to the inhabitants parameter you care about.

Three sorts of uncertainty: therapy project

The second approach to quantify uncertainty in our estimates is design-based inference, which Fisher and others developed and which has turn into more and more central in fashionable causal work. Right here you maintain the pattern mounted fully and ask: what would have occurred below a distinct therapy project? The randomization is the supply of uncertainty, not the sampling course of.

The well-known story by Fisher of the girl tasting tea seems to be the origin however maybe it’s even older. However this strategy is the muse of randomization inference the place you perturb, not the sampling course of (eg bootstrap, jackknife), however quite you’re employed via the combos of all attainable therapy assignments that might’ve occurred, assume a pointy null therapy impact of zero (or another fixed), after which plot the % of all estimates below different assignments that you might’ve gotten. And as earlier than with the t-statistic distribution, yielding its personal p-value, right here we get the precise p-value that likelihood would’ve given a measured check statistic as massive because the one we bought in the true therapy project. You’ve if nowhere else seen this in artificial management spaghetti plots.

After which there may be work by Abadie, Athey, Imbens and Wooldridge that mix each.

Three sorts of uncertainty: researcher/analyst

Each frameworks are elegant and tremendous attention-grabbing. I’m educating likelihood and statistics this semester and so I’m particularly enamored with the sampling strategy. It’s actually deep, and it’s the supply of so many inventions in statistics just like the central restrict theorem, legislation of enormous numbers, bias and consistency.

However the factor about them which I by no means observed till I learn the many-analyst design papers is that in each sampling and design based mostly inference, the researcher is held mounted. In sampling, you maintain mounted n, and also you pattern it repeatedly in precept which affords us an opportunity to speak about estimators and estimable in exact methods. In design, you maintain the pattern mounted, however you’re employed via the reassignment. Each of those enable for exact statements about uncertainty.

However in each of them, you’re nonetheless holding mounted the researcher. Neither the usual errors nor the following calculations based mostly on it like t or CI or p-values ever think about what would’ve occurred had another person labored on the identical venture as you. Each methodologies deal with the researcher as mounted. Neither one is designed to seize what occurs whenever you fluctuate the researcher.

The numerous-analyst design does. However it isn’t clear to me simply what it’s that we will pull from it besides that there’s certainly sources of uncertainty tracing via our pattern and project to the estimates that comes from the researcher. And never due to publication bias, however quite due to the myriad of selections that should be made below uncertainty all through the creation of the analytical pattern and the estimates achieved to it.

When Silberzahn and colleagues despatched the identical dataset to 29 impartial analysis groups and requested all of them the identical query, they documented one thing the occupation had been reluctant to quantify: the researcher shouldn’t be a clear pipe. The identical knowledge, identical query, produced an expansion of estimates. Not from sampling variation. Not from therapy randomization. However quite from the alternatives analysts make — that are at the very least partially endogenous to who they’re, what software program they educated on, what their advisor informed them, what they learn final month.

Take heed to what they stated too on the finish about that examine:

“These findings recommend that important variation within the outcomes of analyses of complicated knowledge could also be troublesome to keep away from, even by consultants with trustworthy intentions. Crowdsourcing knowledge evaluation, a technique during which quite a few analysis groups are recruited to concurrently examine the identical analysis query, makes clear how defensible, but subjective, analytic decisions affect analysis outcomes.

See that variation is actual. It’s actually a supply of uncertainty — totally different group, identical knowledge, identical query, identical experiment, totally different calculation, totally different outcomes. Completely different details? Completely different reality? Which is it? Is the reply A or is it B?

We’re used to this in some methods. 5 individuals examine the minimal wage and are available to 5 conclusions why? Some used metropolis degree employment knowledge, some used state panel knowledge, some regarded on the UK, some have been centered on the nineteenth century. These have been totally different samples, totally different therapy assignments, and due to this fact needn’t result in the identical estimand.

However when ten researchers engaged on the identical query utilizing the identical dataset and the identical therapy project come to totally different conclusions, it can’t be any of these issues. And the usual errors are appropriate within the sense that they assume the identical group would’ve used each hypothetical pattern, however these customary errors don’t measure this different supply of uncertainty. However the variation in estimates that hypothetically come from perturbing does.

Now right here’s the factor in regards to the many-analyst design as a program: it’s principally theoretical. You can not truly ship your dataset to 185 impartial groups each time you wish to publish a paper. When it has been achieved in these papers, my sense is that it has been to doc sources of bias in science. The objective was to doc a truth in regards to the world — to show this third sort of uncertainty exists — to not suggest a workflow any particular person researcher may comply with.

However now I’m questioning in any other case.

What Claude Code modifications

The combinatorics of empirical analysis are staggering in a method that’s simple to underappreciate. Give it some thought concretely. From uncooked knowledge cleansing via estimation via desk building, you may face ten main choice factors. At each, you might need two cheap choices. That’s 2^10 attainable conditions that might’ve occurred by the point the estimates have been calculated or 1,024.

If there have been 3 choices for every of these ten duties (eg cleansing, measurement), then it’s 3^10 attainable ensuing conditions or 59,049. That’s 59k totally different hypothetical estimates.

None of that’s often reported in a examine. Researchers traditionally didn’t even share their code. They didn’t clearly articulate their design decisions. They didn’t present the robustness of their estimates to dropping this or that in a different way, together with this or that in a different way. Most don’t even in all probability bear in mind the forks within the street they took to get right here. And but if fairly totally different decisions may’ve been made or would’ve been made, by a distinct group, and the calculations on the finish would’ve modified in consequence, then the estimates are random variables for a 3rd motive than sampling or therapy project.

Nicely right here’s the factor. Discovering these sources of attainable variation is difficult robotically. It may be laborious for the researcher to see because it’s their very own footprints they usually could also be too near it. However then working via all these perturbations or a big N randomized pattern of them may also be laborious. There’s no standardized package deal to do this, and it’s not clear why we do it, besides that right here I’m noting it’s a supply of uncertainty, and due to this fact it’s not clear why it wouldn’t be prioritized when establishing intervals.

What Claude Code helps you to do is automate the perturbation of a subset of these forks. Not all of them — enumerating each discretionary node in a examine is itself an unsolved drawback, and I’ll come again to that. However you may choose a node. Covariate choice is a pure one, as a result of it’s genuinely discretionary — there isn’t a algorithm that tells you which ones covariates to incorporate to fulfill parallel tendencies in diff in diff because the parallel tendencies shouldn’t be testable. And so totally different cheap analysts will make totally different cheap decisions. Package deal choice is one other, for causes my experiment made fairly vivid as 75% of the entire variation within the end result got here from whether or not you used python, stata or R (regarding).

The tougher drawback

There’s a query I raised within the draft of this publish after which sort of skated previous, so let me come again to it actually.

Figuring out the discretionary nodes — all of them, not simply covariate choice — is independently a tough drawback. In my experiment I specified the node upfront. I stated: the one factor that varies is covariates. That’s a managed perturbation of 1 dimension. However in an actual examine, the discretionary nodes are in all places and also you typically don’t know you’re at one. You assume you’re making the apparent selection and also you don’t understand there have been ten different cheap decisions you might have made. The numerous-analyst design forces you to see that by various the group.

However I haven’t solved the issue of getting Claude enumerate all of the discretionary nodes in a given examine robotically. That’s one thing I wish to work on, and I feel it’s tractable — one thing like: undergo this pipeline step-by-step and flag each level the place an inexpensive analyst might need achieved one thing totally different — however I haven’t constructed that but.

However I’m pondering when you have been to assemble intervals from analyst uncertainty, then it will be based mostly on the perturbations round endogenous not exogenous nodes. Endogenous nodes are those that totally different researchers would’ve chosen or may’ve chosen. Exogenous nodes don’t fluctuate throughout groups. So coding “African American” as 2 on the race variable may see like a discretionary node but it surely solely can be if there was disagreement. And possibly there wouldn’t be for that.

However Claude Code may theoretically discover all these nodes. It may discover in a pipeline the entire discretionary nodes for you, write the code in a method that perturbs round it after which finds a finite variety of “conditions” on the finish from which estimates are calculated every time. And I feel you need to be capable of work out a p-value. How typically is that node pivotal? How typically do you discover calculations as massive as yours?

I feel Claude Code may do that for us, do it quick, and accurately. I feel we may report a forest plot of those estimates.

What I’m not certain of if the massive pattern properties of enormous N groups. It’s not clear to me why it wouldn’t comply with the identical central restrict theorems as the remainder however I suppose I pause as a result of it’s fully clear what an estimand is that if there may be different measurement / package deal decisions one may make in a given pattern.

Related Articles

Latest Articles