Wednesday, February 25, 2026

A number of Brokers Auditing Your Diff-in-Diff Code (Half 1)


That is a part of an extended sequence I’m doing on Claude Code for quantitative social sciences. I’m going to try (fingers crossed) to put in writing a shorter submit at present. Earlier than I do, I wished to thank everybody for his or her assist of the substack. It’s a labor of affection. The substack offers me a possibility to put in writing and specific myself creatively whereas additionally sharing what I’ve discovered about this or that, be it causal inference, AI or some random kick I’m on.

I filmed a video walkthrough of me doing this train. You’ll see that video right here. Be aware that throughout the strategy of the code audit, I spotted that there have been two extra packages that I wished to guage. As such we’ll illustrate the code audit thought utilizing 5 diff-in-diff packages: two Stata packages, two python packages, and one R package deal. We will likely be focusing primarily on the Callaway and Sant’Anna estimator, however that is actually agnostic in regards to the estimator as we can even be specializing in the preprocessing phases, too. The opening of this substack explains the thought behind it — impartial errors — and the remainder explains the implementation. The video walks you thru precisely what I did. I hope you discover this beneficial. This’ll be the primary of many diff-in-diff workout routines, so buckle up!

If you’re a standard reader, perhaps think about turning into a paying subscriber. I’ve set the worth on the lowest worth level ($5/mo) that substack permits. Take pleasure in!

Immediately’s substack would be the first of many wherein I illustrate utilizing Claude Code in a challenge the place the duties embrace a pipeline of processing knowledge and estimating common therapy results utilizing the Callaway and Sant’Anna technique. However this one’s fairly slender in focus, which I feel will make it extra normal to all individuals, no matter whether or not they’re utilizing diff-in-diff. Immediately’s substack is about code audits utilizing a number of brokers to duplicate code in a number of languages. Right here is the thought:

  • I feel we should always reap the benefits of Claude Code’s brokers to “audit our code” and achieve this aggressively. Virtually prefer it’s a well being inspector whose objective is to close us down.

  • I feel we are able to use Claude Code’s skill to talk in a number of languages to do that.

And to speak about this, I’ll illustrate it with some easy examples, together with a video stroll via of me utilizing it for some easy duties.

Hallucination as Measurement Error

We should within the social sciences embrace utilizing Claude Code in our workflow to eradicate all errors that it may be used to cease. There are a number of kinds of errors, and their causes are because of many issues — issues which can be completely unrelated to 1 one other. A few of them are reasoning errors, and maybe Claude Code can catch these (I’ve discovered it catches a good quantity), however the ones I need to discuss are coding errors.

As we shift in the direction of AI brokers writing an increasing number of, if not all, of our code, we should always think about the likelihood that AI brokers primarily based on giant language fashions just like the pre-trained generative transformer (GPT) will at all times have issues hallucinating. However what if hallucination may be conceived of as measurement error. That’s, hallucination within the context of writing code is random as a result of LLM simply probabilistically writing down the incorrect executed code.

I’m not saying to you that I know that is the case a lot as I’m saying that it could possibly be a handy fiction for us as quantitative social sciences to speak that approach. For one, it’s a approach of speaking we’re way more conversant in than we’re with the probabilistic nature of the LLMs within the first place. I doubt many people have learn the unique “Consideration is All You Want” by Vaswani, et al. (which has now 232,500 cites since its first look in 2017). However I feel all of us have a minimum of in some level in our life learn in any econometrics textbook the concept there exists a variable that has been recorded incorrectly and as such the variable is classically incorrect. Classically incorrect within the sense that the variable’s values will likely be completely different from the “true worth” by that worth plus some random noise, often centered at zero and often standardized to have some fastened variance, like a standard distribution. In such circumstances, regressions utilizing it is going to have coefficients on that variable that are attenuated in the direction of zero.

I would really like so that you can be open to that language, however utilized to the code Claude Code generates for our evaluation. It could possibly be someplace hidden within the pipeline. It could possibly be someplace within the regression instructions. It could possibly be someplace within the lifting of the regression output into automated tables and figures. Perhaps it’s one thing seemingly small like permitting for the pattern composition to vary as fastened results are added in, not realizing that not all the pattern had these fastened results, inflicting 50% of the pattern to drop. Or perhaps it’s a merge syntax error. It could possibly be even that basic Stata error:

substitute olddog = 10 if olddog>10

which these of us who’re outdated canine will know not solely prime codes all values of olddog at 10 when olddog is bigger than 10. It additionally replaces olddog to be 10 for all lacking values too.

That’s an outdated error, well-known to Stata customers, lots of whom needed to be taught it both the arduous approach or on the Stata listserv from the prolific, terribly useful and Stata legend Nick Cox. However word that that command is itself distinctive to Stata syntax.

So, let’s do that. Let’s simply assume that 9 instances out of 10 Claude Code doesn’t make that mistake. Claude Code is aware of, as a result of it has been educated on each conceivable writing about Stata, together with the manuals, and together with Nick’s personal phrases, that one of many right methods to do it’s this:

substitute olddog = 10 if olddog>10 & olddog~=.

However on today, Claude Code randomly left that final half out. And since on that day, Claude Code randomly left that final half out, your olddog variable has been prime coded at 10 each for these rows that had values larger than 10 (e.g., olddog = 15), in addition to these rows whose olddog variable was lacking (i.e., olddog = .). Why did it make the error then in your code at present? It made that mistake randomly. However you solely ran the pipeline as soon as; you solely generated the pipeline as soon as. And as such, you pulled a foul draw unknowingly, and because it did run, and it didn’t run into an error, the syntax error cascaded down via your pipeline into your evaluation with systematic measurement error inflicting your outcomes to be primarily based on mismeasured variables that could possibly be extreme relying on what number of lacking values there are within the knowledge.

Hallucination Errors are Unbiased Throughout Languages

So here’s what I suggest. I suggest that you simply assume a second factor. I suggest that you simply think about that Claude Code will randomly hallucinate its code. And that since you’re offloading a number of the cognitive work to it, and that your expertise are depreciating on account of that, then you will need to discover a solution to insert verification steps wherever potential utilizing Claude Code in a focused method. And I suggest that you simply think about this:

  • R will hallucinate with some probabilistic error, ε_R

  • Python will hallucinate with some probabilistic error, ε_P

  • Stata will hallucinate with some probabilistic error, ε_S

If the errors are impartial, the chance all three hallucinate the identical incorrect result’s ε_R × ε_P × ε_S — a really small quantity. And I feel it’s affordable to say that they gained’t as a result of if the errors actually are syntax errors, then we shouldn’t anticipate it to point out up on the similar time in the identical place. If all three errors are pairwise impartial, then we are able to write down these three covariance equations and set them to equal to zero:

  1. Cov(ε_R, ε_P) = 0

  2. Cov(ε_R, ε_S) = 0

  3. Cov(ε_P, ε_S) = 0

Recall that these are zero due to the definition of covariance and the way in which wherein the imply of the product of two random variables which can be impartial breaks out into the product of the imply of every one, inflicting your entire covariance equation to be zero.

(textual content{Cov}(varepsilon_i, varepsilon_j)
= E[varepsilon_i varepsilon_j] – E[varepsilon_i]E[varepsilon_j]
= E[varepsilon_i]E[varepsilon_j] – E[varepsilon_i]E[varepsilon_j]
= 0
)

That is the precept I would like you to remember: that if Claude Code or any AI Agent is making errors because of language-specific syntax, and whether it is random, then it’s affordable to imagine that the errors are stochastic and due to this fact impartial of each other. Which lets you justify incorporating not simply code audits into your course of, however replication of your complete challenge in different languages.

Requesting Code Audits To Replicate In A number of Languages

Which leads me to my subsequent level: get Claude Code to audit your code systematically like a well being inspector, in addition to replicate your code in two different languages. These are two separate duties, and whereas many individuals are already integrating into their AI Agent workflow “code audits” by hyper-antagonistic subagents, that won’t essentially imply that they’re getting hyper-antagonistic subagents to duplicate their code within the different languages already put in in your machine first.

What I feel you need due to this fact is a workflow with a pipeline of code that from begin to end is totally replicated in two different languages such that at every stage, your code creates tables and figures which have the very same values for all variables, and check statistics the identical, right down to a number of trailing digits.

This solely works with the code that’s non-random although. It gained’t work with bootstrapping, for example, which is itself primarily based on seeds which can be most likely distinctive to that language. So you might not have the ability to do that to verify bootstrapped normal errors for the reason that resampling is random. Different examples the place the sort of code audit gained’t work is:

  • Simulation-based estimators — simulated MLE, technique of simulated moments (these draw random simulations as a part of the chance approximation)

  • Bayesian MCMC — Gibbs sampling, Hamiltonian Monte Carlo (Stan, brms)

  • EM algorithms with random beginning factors — combination fashions typically randomize preliminary cluster assignments

  • Machine studying — SGD, random forests, neural internet initialization

However it is going to work for a lot of different issues, together with fundamental processing duties (e.g., cleansing variables, merges) in addition to many quite common statistical modeling strategies (e.g., OLS, difference-in-differences, instrumental variables, F assessments, analytical normal errors, R-squared).

So, what you need to do is have Claude Code not solely audit the code utilizing its personal reasoning. You need to even have Claude Code replicate the code, from begin to end (i.e., together with the pre-analysis processing phases in your pipeline), in two different languages, after which have an agent verify that the output produced in tables for all three are equivalent.

Distinction-in-Variations As Case Research

So, right here’s to the video stroll via. We now have 5 language-specific packaged code for implementing each normal difference-in-differences, in addition to extra advanced difference-in-differences utilizing differential timing and the inclusion of covariates. Subsequently, it’s potential to do the form of code audit I’m describing for difference-in-differences. Those we’ll use are csdid (Stata), csdid2 (Stata), did (R), variations (python) and diff-diff (python). And I will likely be specializing in auditing the components of the pipeline and evaluation that’s deterministic.

Our instance will come from this Brazilian examine that I additionally analyze in my forthcoming guide, Causal Inference: The Remix, which will likely be revealed this summer time by Yale College Press. Right here’s the paper in query.

What I do on this video stroll via is straightforward. I merely have Claude Code generate the code to estimate occasion examine plots of the impact of the CAPS deinstitutionalization (i.e., closing psychological well being establishments) in Brazil on homicides, which is without doubt one of the a number of outcomes that the authors, Mateus Dias and Luiz Felipe Fontes, use of their fascinating and vital examine about psychological well being reform and hospitalization.

To do that, I’ll use a program referred to as Brazil.do this I wrote. It’s a prolonged set of code, however we could have Claude Code take solely a portion of it that cleans and estimates the impact utilizing csdid and csdid2 in Stata. I’m utilizing, in different phrases, the Callaway and Sant’Anna technique because it is without doubt one of the extra widespread strategies used when estimating combination results below differential timing. However, csdid can be a person created package deal. The unique command was in R referred to as did. And there are additionally two packages in python. There may be one written by Isaac Gerber in python as effectively. You will discover that one right here.

However there may be really a second python package deal for diff-in-diff referred to as variations. It’s written by Bernardo Dionisi. So we can even replicate the evaluation in his diff-in-diff package deal in python, alongside Isaac’s.

Conclusion

That is going to be the primary of many posts utilizing Claude Code to estimate diff-in-diff, however at present’s was simply in regards to the “code audit” utilizing a really particular model of my referee2 persona. And I need to cease right here as a result of this can be a prolonged submit already as it’s. However within the subsequent submit, I’ll evaluation the outcomes with you and we’ll attempt to get to the underside as as to whether or not there are issues which can be as a result of audit, the packages, each or neither. However the level at present is simply as an instance a selected workflow I’ve been engaged on to implement verification aggressively into evaluation utilizing Claude Code, however doing so with a really slender but extraordinarily frequent and excessive worth job — the estimating of therapy results utilizing diff-in-diff which for the time being may be carried out doing a minimum of 5 completely different packages (2 in Stata, 1 in R, 2 in python). So we’ll see within the subsequent assembly the way it went!

Related Articles

Latest Articles