This isn’t technically a Claude Code put up as it’s an econometrics put up. It’s an econometrics put up about LLMs from a brand new paper by Jens Ludwig, Sendhil Mullainathan, and Ashesh Rambachan entitled “Massive Language Fashions: An Utilized Econometric Framework”, forthcoming in Annual Evaluate of Economics. Yow will discover the summary under.
As that is an econometrics explainer, and never a lot a Claude Code one (despite the fact that it is going to be partly based mostly on earlier posts I’ve achieved utilizing Claude Code to research texts), I flip a coin 3 times to determine on paywalling. Finest two out of three wins, and on this case, that was heads, which suggests it goes past the paywall.
So on this put up, I’m going to stroll by means of a brand new forthcoming paper by that energy home train of coauthors. Rambachan, a lot of you’ll know, is coauthor on the “credible parallel developments” paper with Jon Roth in Restud from a number of years in the past. He has a captivating analysis agenda, and was truly a visitor on my podcast again within the day.
However that is one thing they labored on referring to utilizing LLMs for both prediction or estimation duties. Their article is about utilizing LLMs to automate textual content classification in economics analysis. Particularly, changing costly human annotation with low-cost LLM labels. The manuscript is a deep dialogue of the measurement error issues that come up whenever you do.
The important thing theoretical end result, which I’ll attempt to break down rigorously, is that prime accuracy alone doesn’t shield your regression estimates, as a result of errors can correlate along with your covariates in ways in which destroy inference. Their answer is a small human-coded validation pattern used to not exchange the LLM however quite to debias its labels.
I came across this as I used to be attempting to do extra to write down up the work I did on right here with Claude Code to re-analyze a paper from PNAS that labeled 305,000 Congressional speeches from the late nineteenth century to the current on the subject of the speaker’s sentiment about immigration. Right here’s the primary substack about it, however there have been a complete of three I did. I made a decision over spring break to determine a method for methods to write this up, and I’m studying the Ludwig, et al. (2026) paper now to attempt to see if this is likely to be the angle.
The Paper That Caught My Eye
So, let me again up. There’s a cottage business proper now in writing papers about AI. It jogs my memory of Covid to some extent the place at first there have been just a few papers about Covid, then there have been ten, then there have been 100, then a thousand, then 100 thousand, then it was a blizzard and I couldn’t sustain with something and so simply caught with my regular analysis agenda quite than make any effort at a contribution.
I’m not saying AI is like that now, but it surely’s positively pushing that means. I contemplate myself fortunate that I truly discover this fascinating. I developed a category on the economics of AI at Baylor within the spring of 2025, and have been a reasonably intense energy person of gen AI ever since March 2023. I’ve considered its sensible use for economics, each considering theoretically about work and mixture output, but additionally occupied with the way it might be a device for me to do analysis. I’ve written (in admittedly bizarre papers) about utilizing it for prediction, and I now use Claude Code intensively for my very own analysis initiatives, in addition to to start out several types of analysis initiatives that I in any other case by no means would have begun.
This reclassification of 305,000 speeches from the late nineteenth century to the current is an instance of a venture that I’d’ve by no means began had it not been for Claude Code. One factor resulting in the subsequent factor till I felt like I had a set of findings, and now I wanted to higher perceive what econometricians had been saying about LLMs to see if there was one thing past the rote “replication” train I had been doing.
And that was when this paper caught my eye. It was whereas I began reviewing what financial historians had been doing with giant language fashions that I someway discovered their paper, and now this substack is me taking a stab at explaining it to myself, in addition to to you.
What the Paper Really Does
Right here’s what I perceive it to be, and I wish to be sincere about the truth that I’m working this out as I write.
Ludwig, Mullainathan, and Rambachan make a clear distinction between two methods economists use LLM outputs. The primary is prediction issues and the second is estimation issues. The prediction issues is expounded to utilizing LLMs to forecast some consequence, like me and Van Pham did in our paper. Or me, Jared Black and Coco Solar did with predicting the Harris-Trump election consequence for 100 days utilizing an extension of the strategy that me and Van used (and failing miserably).
However others too. You typically see folks utilizing LLMs to forecast some consequence. Inventory returns from monetary headlines, as an example. Even earlier than Claude Code, that was itself a rising cottage business of utilized work by lecturers and business people. Can LLMs predict and in that case how can we all know? And the way will we use it? And what choices matter and which of them don’t matter? As a result of me and Van discovered even seemingly related data fed to ChatGPT precipitated prediction errors to paradoxically rise.
However the second is, like I stated, about utilizing LLMs for estimation issues. That’s the place you’d use the LLM to automate the measurement of some financial idea in an effort to use that measurement once more downstream. Possibly in a regression.
These two issues sound comparable however they require utterly totally different disciplines. Identical to in prediction and causal inference, we use typically the identical device (e.g., regression) however for very totally different duties, it’s the identical right here.
No Such Factor As A Free Label
Labeling texts is dear, or could be. You should use Mechanical Turk, however experiences have been saying that the standard of MT has been deteriorating the final decade. You could possibly pay college students, however that’s costly as nicely. However when you had the labels, then you definately would possibly wish to estimate some inhabitants estimand, omega, with an estimator just like the inhabitants common or a regression.
The issue is that it’s costly, as I stated, so the researcher substitutes LLM labels for the “true label”. And as a substitute runs that regression.
The query then is clearly about bias. When can I imagine that a budget estimate is dependable?
LLMs For Prediction versus Estimation
For prediction, the important thing requirement is what they name “no coaching leakage”. That is borderline to do with cautious design, and even splitting into check and coaching samples. “No coaching leakage” signifies that the LLM’s coaching knowledge can’t overlap along with your analysis pattern. This sounds apparent. No matter you suppose it possibly means, you in all probability would agree “leakage” doesn’t sound like an incredible factor. However on this case, what it might imply is that your immediate engineering strategies, like telling the mannequin to “ignore data previous this date” or no matter directions, doesn’t truly do it. They’ve examined whether or not prompts can create the mandatory moat such that GPT or whoever actually doesn’t know the coaching knowledge exists, and it doesn’t. Leakage commonly happens. GPT-4o can actually memorize 344 out of 10,000 Congressional invoice descriptions after which full them verbatim from the primary half alone, as an example.
So immediate engineering doesn’t itself assure “no coaching leakage”. In actual fact, it doesn’t simply not assure it. Relatively, that doesn’t work, and thus will not be one thing you should use to fulfill the situation.
For estimation, which is what I’m doing on this PNAS replication concept I’ve been doing on right here for a month or so, the important thing requirement is a validation pattern. That is totally different from the “no coaching leakage” idea, so put that out of your thoughts for now. With validation pattern, it is advisable acquire your measurement (e.g., some labeled sentiment) on a small random subsample utilizing the costly, cautious, human-coded technique. You then use that subsample to debias the LLM’s labels.
There are some issues on this that I really feel like I hear echoes of, however I don’t know but sufficient to make certain that’s what I’m listening to. However it does sound just like the sort of pattern splitting econometrics I affiliate with issues by the group that gave us double debiased machine studying. And I additionally kind of really feel like I hear echoes of Abadie and Imbens (2011) bias correction technique for matching, in addition to the Ben-Michael, et al. paper on augmented artificial management. However like I stated, that would simply be echoes, and as I be taught extra, I’ll strive to determine these connections a bit higher and decide in the event that they’re helpful for my mind.