A number of Brokers Auditing Your Callaway and Sant’Anna Diff-in-Diff (Half 2)

February 28, 2026

2

That is a part of my ongoing sequence on utilizing Claude Code for sensible utilized empirical work valued by the quantitative social sciences. And that is particularly going with a difference-in-differences (DiD) concept that I began the opposite day which you will discover right here:

Claude Code 24: A number of Brokers Auditing Your Diff-in-Diff Code (Half 1)

Partly 1, the DiD thread began with a barely totally different experiment of illustrating pure “hostile critic audits of your code”, however after I seemed on the information this week, I made a decision to vary it as I turned much less all in favour of illustrating the “referee2” auditor after I noticed sure issues — at the very least not but. So I’ve determined to pivot this DiD sequence right into a barely totally different path — a scientific investigation of what AI brokers truly do if you hand them an actual empirical drawback and stroll away, which is a variation on the “multi-analyst design” that Nick Huntington-Klein and others have been engaged on these previous couple of years. Should you’re simply becoming a member of, you’ll be able to be taught just a little by reviewing the final publish and video, however you additionally would possibly be capable to simply begin right here as on this video, I am going over the experiment that I did earlier than the video began (working 15 sub brokers to do the replication). However the first video provides you the instinct for why I began working a number of brokers within the first place.

Lastly, the paper being replicated is that this AEJ: Coverage by Dias and Fontes during which a Brazilian psychological well being reform’s impact on municipality-level murder charges was studied utilizing difference-in-differences, particularly the de Chaisemartin and D’Haultfoueille (2020, AER). However on this, I exploit the Callaway and Sant’Anna technique, each of which are sometimes used with staggered remedy adoption.

Thanks once more everybody to your assist of the substack. In case you are a paying subscriber, thanks! In case you are not, get pleasure from! The Claude Code sequence stays free however after a couple of days, it would go behind a paywall. So if you’re simply becoming a member of, contemplate turning into a paying subscriber so as to learn the opposite 25 posts I’ve completed on Claude Code since early to mid December of 2025. The costs is $5/month!

The earlier Claude Code video walkthroughs have been fairly lengthy — typically 60 to 90 minutes. And on this one, I attempted to rein it in in order that it’s at the very least considerably watchable. But it surely nonetheless got here in at 38 minutes. And that required pausing it, too, leaving us with a little bit of a cliffhanger. Nonetheless, I’ll publish the third a part of the sequence subsequent week, so let me for now simply stroll you thru this one.

As I mentioned above, if you happen to watched the primary video on this sequence, you noticed me run a model of at this time’s experiment. However main into at this time’s video, I peaked on the outcomes, and I simply determined I used to be extra all in favour of a distinct factor than I initially did so final night time I redid the entire thing (with Claude Code’s assist). The bones are considerably the identical in that I’m evaluating sub-agent pushed coding up in three languages (python, R and Stata) of 5 packages (csdid, csdid2, did, variations, diff-diff). So these components are the identical. And as I mentioned, all of them run Callaway and Sant’Anna on the Brazilian municipality information.

However I made a decision to tighten the isolation protocol (I’ll clarify that) after reviewing the output from the half 1 experiment. I additionally adjusted the directions, and expanded the forensic evaluation I do afterward. This led to a 52 web page “stunning deck” (construct utilizing the /compiledeck talent I exploit consistently which is predicated on my “rhetoric of decks” essay I feed to Claude Code additionally when creating decks). So consider every part now going ahead as a revision and an extension of the unique model.

As I discussed, there’s a literature I’ve been tremendous all in favour of for a number of years now which is usually referred to as the “multi-analyst design”. As I perceive it, this literature started with Silberzahn et al. in 2018, who gave 29 analysis groups the identical dataset and the identical query — whether or not dark-skinned soccer gamers obtain extra pink playing cards. The estimates ranged from strongly unfavourable to strongly optimistic. Identical information, similar query, wildly totally different solutions.

Nick Huntington-Klein and coauthors did one thing comparable in 2021. They recruited seven economists to independently replicate two printed causal outcomes. Every acquired the identical information and the identical analysis query. No two replicators reported the identical pattern measurement. The usual deviation throughout their estimates was three to 4 occasions the everyday reported customary error. And I discovered that tremendous fascinating for a couple of causes. One, the usual errors we report are supposed to approximate the usual deviation within the sampling distribution of estimator. And but Nick’s crew was reporting a typical deviation that was 4 occasions bigger than the imply customary error, which suggests they have been quantifying a supply of uncertainty that isn’t remotely what customary errors are measuring. And the opposite factor I used to be fascinated by was the concept that the boldness interval from any particular person evaluation dramatically understates the true uncertainty concerning the consequence as what if we had given this similar undertaking to another person? Would they’ve made the identical choices? It relied on the variety of researcher levels of freedom and their relevance, as inputs, within the closing estimates.

Then there’s the Journal of Finance paper from Menkveld and coauthors in 2024, which coined the time period “non-standard errors” — the variation they doc positively doesn’t and can’t from sampling (not even bootstrapping) however from an accumulation of small analytical selections. And Borjas and Breznau in 2025, who discovered that with an immigration query, researcher ideology predicted the signal of the impact.

The frequent thread is that when researchers have discretion, then you may get spreading out of estimates even with the identical uncooked dataset, the identical purpose, the identical analysis query, the identical teams of individuals doing the estimation. Give sensible, well-trained folks the identical information and the identical query, and the unfold of solutions is massive — typically bigger than any particular person analyst’s reported uncertainty. The variation isn’t errors, or p-hacking, too — it’s coming from researcher discretion, and biases.

So, now this undertaking is attempting to do three issues directly, and I need to clarify what that’s now.

First, I need to know whether or not working the identical evaluation with a number of impartial AI brokers might function a robustness audit. That’s my referee2 code audit. However I’ve prolonged and built-in that into each DiD evaluation to verify for the unfold of estimates throughout impartial runs to see if it tells you one thing helpful about how delicate your conclusions are to the alternatives being made underneath the hood. I feel that’s in and of itself fascinating, and it’s not likely the identical factor as what my referee2 code auditor is doing.

Second, I wished to check whether or not AI brokers might approximate a many-analysts design like Nick’s and others. That is form of linked to a distinct sequence I’ve been doing the place I’ve been replicating research utilizing Claude Code and OpenAI gpt-4o-mini to see if you need to use one-shot classification with shopper LLMs. You’ll be able to see the fifth of that five-part sequence right here (and if you happen to click on by you’ll discover the opposite 4):

I assumed these have been fascinating illustrations of what you are able to do with Claude Code, however additionally they have been fascinating functions of LLMs for classification too. The comparability was to a educated RoBERTa mannequin based mostly on hiring ~7 pupil staff to learn and classify ~7500 speeches after which practice one other 200,000 utilizing RoBERTa. I wished to see if you happen to might do it a lot much less expensively utilizing gpt-4o-mini at OpenAI with batch requests in one-shots. And I did that as a result of the human model is highly effective however costly — it’s important to recruit analysts, coordinate them, look ahead to outcomes. And the frontier fashions proceed to get cheaper and higher, so if you are able to do them, you are able to do issues actually inexpensively with out misplaced positive aspects. Effectively it’s the identical right here. If AI brokers produce qualitatively comparable patterns of variation, that’s an affordable diagnostic device. Perhaps we might do them, report forest plots, along with auditing our code and non-standard errors might develop into one thing we report. So that’s the different a part of this train

Third, I wished to map the particular factors the place discretion enters a staggered DiD undertaking. Not discretion basically — I wished a concrete stock. Which choices do analysts agree on? Which of them generate all of the variation? The place precisely does the uncertainty dwell? In order that’s what that is about.

Thanks for studying Scott’s Mixtape Substack! This publish is public so be happy to share it.

Share

So, according to the many-analyst design, all the AI brokers that Claude Code made got the identical dataset, the identical query, the identical estimator, and lots of different discretionary choices fully made for them. So let me evaluate that now.

The dataset covers 5,476 Brazilian municipalities. The remedy is the rollout of CAPS psychological well being facilities — Centros de Atenção Psicossocial — which have been adopted by totally different municipalities at totally different occasions between 2002 and 2016. The end result is murder charges. And there are roughly twenty potential covariates: financial variables like GDP, poverty, and inequality; demographic variables like inhabitants, age construction, and literacy; well being variables like spending {and professional} counts; and geographic variables like temperature, altitude, and distance to the state capital.

The staggered adoption makes this a pure setting for the Callaway and Sant’Anna estimator. And the covariate set is genuinely ambiguous — affordable analysts might disagree about which variables fulfill the conditional parallel developments assumption and which of them are mediators that must be excluded. See part 4.2 of our JEL to be taught extra about conditional parallel developments.

So with Claude Code, we wrote a single directions file — a brief markdown doc — and gave every agent nothing else. The directions specified: use the Callaway and Sant’Anna estimator, use a common base interval (Roth 2026) with not-yet-treated management group, and — and that is the first discretionary level [I tried to make it the only one too, but I’m still wondering if I missed something] — then to pick covariates which might fulfill conditional parallel developments and customary assist. This can be a main factor as a result of just about each DiD makes use of covariates, and the aim of that in DiD is, as I mentioned earlier, to fulfill an untestable assumption referred to as “conditional parallel developments”. So, regardless that conditional parallel developments could be written down as a coherent mathematical object, in follow nobody is aware of what it’s. And so this can be a discretionary node within the chain of resolution factors that take you from the uncooked information to the estimates, and in my very own evaluation, the inclusion of covariates can play large roles in estimation, typically even flipping indicators!

Then the brokers needed to produce a balanced occasion research from 4 durations earlier than remedy to 4 durations after, report a easy ATT averaged over the post-treatment durations, and doc each resolution in structured checkpoint information.

However as I mentioned, the directions did not say which covariates to make use of, which doubly strong variant to decide on, or methods to deal with any information points. These have been left to every agent’s judgment.

Share Scott’s Mixtape Substack

I ran three impartial brokers on every of 5 packages: variations and diff-diff in Python, did in R, and csdid and csdid2 in Stata. Fifteen whole runs over 3 language-specific packages.

Every agent was a contemporary Claude session launched through claude -p with no shared reminiscence, no dialog historical past, and no entry to some other agent’s work. Every noticed precisely two information: the shared directions and a one-page appendix naming its assigned bundle.

The isolation protocol was strict. My very own reference code was moved to a hidden listing. Every agent labored in an remoted temp listing. All prior output was archived exterior the undertaking earlier than every new run. Output was moved to its closing location solely after the agent exited. There was no manner for one agent to see what one other had completed.

Right here was the runtime. Pedro and Brant can be pleased to be taught that their R bundle was the quickest. And csdid in Stata had an enormous outlier (100 minutes) which brought about its imply to get bigger than the remainder, with a bigger customary deviation.

That is the primary consequence that shocked me, and it comes earlier than any occasion research or ATT estimate.

All fifteen brokers agreed on each structural resolution. Management group: not-yet-treated, 15 out of 15. Base interval: common, 15 out of 15. All the time-treated models: excluded as a result of there’s no pre-treatment information, 15 out of 15. Balanced cohorts: 2006 by 2012, 15 out of 15. Trimming: none utilized, 15 out of 15.

The place there was a transparent methodological default — one thing that follows straight from the directions or from the estimator’s design — each agent selected it. The variation lives someplace else totally.

The covariate heatmap tells the story. Log GDP per capita was included by 14 out of 15 brokers. Inhabitants by 12 out of 15. These have been seen as elementary predictors of each remedy adoption and murder developments — near-consensus inclusions.

On the opposite aspect, geographic variables have been rejected by 14 out of 15. Psychological well being professionals — rejected by all 15 as endogenous to CAPS itself. Well being institutions — similar, all 15 excluded.

However then there’s the contested center. Poverty share was included by 10 out of 15. Well being spending by 7 out of 15. Bolsa Familia by solely 2 out of 15. The brokers disagreed on whether or not these have been confounders that wanted to be managed for or potential mediators that will soak up the remedy impact if included.

The reasoning throughout brokers was qualitatively comparable — all of them talked about endogeneity, collinearity, parsimony. However they drew the road in quantitatively totally different locations. The boundary between “confounder” and “mediator” shifted from agent to agent, and that shifting is the place the variation in outcomes comes from.

All fifteen selected doubly strong estimation. However the Stata brokers cut up on which DR variant — DRIPW versus DRIMP — and people differ in finite samples. That cut up is package-driven somewhat than agent-driven; R and Python don’t expose that selection.

The following part of the deck is named “The Occasion Research,” and it’s the place the precise outcomes begin to get fascinating — the per-package occasion research plots, the overlay of all fifteen, a real analytical error that two brokers made, and the connection between covariate rely and the ATT estimate. After that there’s the anatomy of discretion, the sampling distribution evaluation, the comparability to Huntington-Klein’s ratio, and my opinions about every of the 5 packages.

I’m not going to indicate you any of that but.

Partly as a result of I’m attempting to maintain these walkthroughs shorter. Partly as a result of the setup issues greater than folks assume. The literature framing, the experiment design, the isolation protocol, the covariate heatmap — if you happen to don’t perceive why these matter, the occasion research are simply strains on a graph.

But additionally as a result of there’s a real cliffhanger right here. Fifteen brokers agreed on every part structural. They disagreed on covariates. The query is: how a lot does that disagreement matter for the precise estimates? Is the unfold tight sufficient that you just’d really feel comfy reporting any single run? Or is it large sufficient that the boldness interval from one evaluation is principally meaningless?

I do know the reply. You’ll see it in Half 2.

So once more, thanks for supporting the substack! Should you like these items, contemplate turning into a supporter! Right here’s a Spotify playlist to assist juice the pot.

A number of Brokers Auditing Your Callaway and Sant’Anna Diff-in-Diff (Half 2)

Claude Code 24: A number of Brokers Auditing Your Diff-in-Diff Code (Half 1)

Related Articles

But One other Solution to Middle an (Absolute) Aspect

Switching Inference Suppliers With out Downtime

Nothing confirms Headphone (a) launch with daring yellow design and lower cost

Latest Articles

But One other Solution to Middle an (Absolute) Aspect

Switching Inference Suppliers With out Downtime

Nothing confirms Headphone (a) launch with daring yellow design and lower cost

These Offers Can Have You Zipping Round on a New E-Scooter This Spring

The actual fact that this text calls to thoughts a Roald Dahl quick story might be a crimson flag