Friday, March 27, 2026

Do AI Brokers Writing Full Manuscripts on the Social Catalyst Lab P-Hack?


On this submit, I used Claude Code to extract all 651 papers written by AI Brokers on the Social Catalyst Lab’s APE challenge. These papers had been absolutely automated with no human steerage. They’re program analysis papers, that means they estimate causal results with actual information utilizing R scripts and trendy econometric strategies. Moreover, they’re manuscripts presenting that info with phrases, photos and tables. On this substack, I’ll stroll you thru what I discovered once I despatched all 651 of them to OpenAI for evaluation functions alongside many dimensions, with the primary purpose being to reply this query: what do AI brokers assume people assume is the underlying rhetoric for speaking program analysis findings to audiences in in the present day’s day and age?

The principle questions I’m going to try to pursue although is to categorise the papers by analysis design, estimator, varieties of information, the varieties of images they make, the sorts of tables they make, whether or not they state an express estimand, and whether or not they p-hack. I did this by sending the 651 AI generated manuscripts repeatedly to OpenAI GPT-4o (not mini however the massive boy mannequin) for classification functions as I discovered Claude Code repeatedly utilizing unreliable strategies for easy classifications, and that typically even the metadata that the AI brokers wrote about their very own papers had been mistaken. So hold studying to study extra.

Thanks once more for all of your assist of the substack! It’s a actual labor of affection to do that. I say that every time, however what else are you able to say when that’s in reality the reality. It’s a labor of affection. I really like doing this, and it’s work, and it takes time. I typically write these both after midnight, or earlier than 5am, or each. I’m unusually inspired that so lots of you assume my strategy to studying about Claude Code — by way of writing about it publicly within the pursuit of sensible analysis duties — is compelling and fascinating. So thanks to your assist, each the emotional assist and the monetary assist of being a subscriber to the substack. Contemplate paying in the present day! It’s solely $5/month, which is the bottom value that Substack enables you to cost. With it you get some ridiculous variety of articles I’ve written (I feel it’s 700+ or one thing) going again to early 2021. They’re about causal inference, econometrics, popular culture, and now, synthetic intelligence. However the Claude Code sequence, of which there at the moment are 35 entires, are free initially to everybody (although after 4 days they go behind the paywall). However let’s get began! As a result of I’ve some fascinating photos and evaluation to share.

AI Generated Manuscripts on Program Analysis by the Social Catalyst Lab

Paul Goldsmith-Pinkham posted the opposite a substack the place he famous how frequent diff-in-diff is on the AI printed papers that the Social Catalyst Lab is operating an experiment on. The experiment known as APE and their purpose is automate 1,000 papers utilizing AI Brokers. Right here is Paul’s submit of what he discovered when pulled the papers down from their GitHub and seemed into what they had been.

On Twitter/X and LinkedIn, there was a big dialogue of the flexibility to “one-shot” coverage analysis, which David Yanagizawa-Drott has been pursuing in an fascinating manner: ape.socialcatalystlab.org…

Learn extra

6 days in the past · 33 likes · 4 feedback · Paul Goldsmith-Pinkham

Paul’s submit obtained me pondering. I ponder simply what we will study what AI Brokers assume that human researchers assume concerning the implied rhetoric of empirical papers? I imply, they’ve written virtually 700 manuscripts at this level. Totally fashioned, based mostly on actual information, actual estimators, R scripts that executed and made precise tables of estimates, made precise figures, even included robustness assessments. Papers with a thesis, estimating causal results. Did they articulate a inhabitants estimand? Once they write a paper utilizing diff-in-diff versus after they write one utilizing RDD, how do these papers seem to vary from each other? Do they present indicators of p-hacking like people do?

The one manner I may assume to do that, although, was to have clone the repo, and have Claude Code set up the information and submit them to OpenAI in batch requests. There are 651 manuscripts, and whereas I may’ve simply seemed on the metadata, what we had been studying collectively in doing that was that the metadata (written by the AI agent about what they thought the paper was doing) was not at all times correct itself. Spot checks by two referee2 personas discovered the AI agent mistakenly calling one thing one factor that was one other. Subsequently I opted to ship al 651 manuscripts to OpenAI to be extra formally analyzed. For this I made a decision to pay high greenback too — not gpt-4o-mini, however gpt-4o. Claude mentioned going as much as 4.1 wasn’t in all probability price it, so I didn’t.

Breaking down the designs

First, I discover barely various things than what Paul discovered, however notice that this isn’t truly “me” discovering something. That is what OpenAI’s gpt-4o classification mentioned after studying the article with none coaching in anyway (“zero shot”). And I discover 61% did vs Paul discovering 70%. GPT additionally labeled 87 of the papers as triple diffs, which is fascinating given I didn’t give triple diff as a attainable class. Nonetheless, I’m unsure, as a result of take a look at the ways in which the estimators fluctuate for the diff-in-diff versus the triple diff papers.

The Callway and Sant’Anna will not be the most typical, but it surely’s shut. Over 40% of the DiD papers AI was utilizing are utilizing CS, and barely extra are utilizing TWFE. Okay advantageous. I wasn’t stunned by that. What I’m stunned by is that gpt-4o is saying a paper is utilizing triple diff and the methodology used is an RDD estimator, rdrobust. Moreover, CS doesn’t have a triple diff implementation, in order that too is odd.

So, what I did was I requested Claude Code to learn these triple diffs utilizing CS and the triple diffs utilizing rdrobust to inform me what’s going on. Right here’s what’s going on. There are 7 CS DDD papers within the APE dataset. These are papers that GPT-4o labeled as triple diff as a result of the analysis design is triple-diff. They’re evaluating apparently throughout three dimensions: handled versus untreated sectors, in adopting versus non-adopting states, earlier than versus after.

But weirdly sufficient, the estimator is CS utilized to a type of dimensions. And Claude Code checked for me. The paper titled “apep_0462” runs att_gt() on a staggered therapy, although the paper’s conceptual design is a triple-diff. GPT-4o nonetheless labeled the design as DDD however the estimator as CS, and each are arguably right. GPT says apparently that the paper thinks of itself as a triple-diff however implements it by operating CS on a pre-differenced consequence or a particular subgroup. Which is unorthodox to say the least, and so I requested him to extract the code for me right here.

So in actuality although GPT-4o referred to as this a triple diff, it did so due to the design. Apparently, it was a staggered diff-in-diff estimated with CS, as a result of accidents (the end result right here) was by division and the therapy hit the departments at totally different time durations. And the rationale GPT referred to as it a triple diff and never simply diff in diff was apparently as a result of the paper’s narrative described itself as evaluating departmental roads vs autoroutes inside the identical division. The third distinction wanted for the triple diff was truly a conceptual framing the AI made concerning the paper itself, not the estimator.

The rdrobust ones have comparable issues. These are 3 papers and what they really are will not be a standard RDD, however fairly a “difference-in-discontinuities” design. The manuscript referred to as apep_0611 actually says “RDD and diff-in-disc estimation.” So what apparently occurred is that GPT-4o labeled the design as triple diff as a result of it’s differencing throughout a discontinuity. Recall that the diff-in-disc is an RDD run twice — as soon as earlier than the coverage change, as soon as after, after which differenced. However the estimator they used was rdrobust. That is little question a judgment name, because it may very well be both RDD or DDD, however GPT thought greatest to name it triple diff.

In order that’s type of what’s going on in these papers. Their classifications in keeping with GPT are based mostly on the rhetoric of the paper itself, not simply the estimator.

Knowledge Visualization

Fashionable causal inference is extraordinarily visible. RDD papers present plots of outcomes alongside the operating variable. Diff-in-diff papers present occasion examine coefficients with 95% confidence intervals. IV papers typically present each the numerator and denominator representations of the Wald estimator — a graph of the primary stage, a graph of the correlation between the instrument and the end result. And clearly artificial management is nothing however photos.

So then, what do the AI Brokers who wrote the 651 papers at APE do of their causal research? In addition they make photos, and listed here are the varieties of images they make.

It’s each stunning and never stunning in any respect that overwhelmingly, the triple diff and diff-in-diff papers use occasion examine plots. It’s virtually disqualifying to be both of these and never achieve this. However that mentioned, over 80% of the triple diffs are displaying occasion examine plots, and given I’ve actually by no means seen one triple diff occasion examine plot aside from those I’ve included myself in my e book, Causal Inference: the Remix, that comes out this summer season, that appears unreal. I positively wish to pull these photos out. My hunch tells me that these are the occasion research for the diff-in-diff, not the occasion examine coefficients on the triple interplay. However I’d have to dig into that extra, and I’ll — simply not now.

Apparently, each RDD and IV present first stage plots. RDD reveals a density take a look at (as does a minimum of one triple diff? I gotta learn these modern triple diffs!). RDD reveals covariate stability plots and once more density assessments however not by the Justin McCrary shorthand title for them. You’ve additionally obtained therapy rollout maps, which s fairly fascinating to me.

I do know I mentioned this was not stunning, however you already know what? It’s completely stunning. As a result of the Social Catalyst Lab by no means informed these AI brokers to do any of this, which implies they’re able to sample match on a degree that’s thoughts boggling. It is aware of that within the rhetoric of a diff-in-diff paper, you produce occasion research. No, take that again. You estimate them utilizing R and panel information with coverage occasions that adjust over time for some however not all models. Simply let that sink in — this senseless machine is aware of to do that.

I feel we’re too fast to gloss over stuff like this. If this had been the very first thing we had discovered that AI may do, we’d’ve been shocked, however I feel we’re so accustomed it now that we can’t see it for what it truly is which is downright eerie.

Which Tables Do AI Brokers Produce?

We don’t actually at all times consider tables as rhetoric however they’re. We used them to speak to different folks within the hopes of persuading them. And over time, by way of repetition, by way of profitable papers and failed papers, a sequence of canonical tables are just about boilerplate displaying up in all our papers — together with our causal estimation ones. And right here is the distribution of the kind of tables that seem within the 651 papers by analysis design.

Overwhelmingly, practically everybody produces abstract statistics, and naturally they produce a desk referred to as “Foremost outcomes”. Nevertheless it’s fascinating that additionally they overwhelmingly present robustness as its personal tables. Steadiness covariate assessments are quite common in RDD, however apparently much less so in DD and DDD. My hunch is that this is because of a considerably skepticism about covariates position within the diff-in-diff designs, and so although in our JEL (part 4.2), we talk about covariates at size, and counsel stability assessments, I don’t assume that’s but the most well-liked kind of desk you will notice in diff-in-diff.

Apparently, a number of the diff-in-diff papers have tables of occasion examine coefficients. I’d jot advocate that, however the truth that it’s right here is no surprise as I repeatedly see it too, and strongly discourage it once I do.

Inhabitants Estimand

This semester I’m instructing two lessons at Harvard — an undergraduate stats class referred to as Gov 51 and a PhD class on chance. And in each of them, I’ve been speaking at size about estimands, estimators, repeated sampling, sampling distributions of estimators, and estimates finished in a single pattern. And I’ll say this — it’s completely not one thing you train as soon as after which transfer on from. It’s one thing that you need to train repeatedly, or that I’ve to show repeatedly anyway, and that over time, throughout weeks even, over drawback units and exams, when issues are requested a sure manner again and again, you begin to see earnest questions requested at school that clearly point out they perceive this stuff, in any other case they wouldn’t be asking the questions they’re asking. And so I exploit photos like this loads for each lessons — these hub and spoke photos for instance the inhabitants estimand (right here the conditional expectation operate on the heart), the spokes being the only draw from the inhabitants making a pattern that’s similar in measurement to all the opposite samples. And all of those are hypothetical besides one — solely one of many samples exist. The opposite are counterfactual samples crammed with precise folks from the inhabitants.

Properly, the causal parameters are additionally estimands. They too sit on the heart. Now I gained’t go down the rabbit gap now of associated sampling versus design based mostly inference. I’ll simply say that each design has related to it a inhabitants estimand, but it surely has not been traditionally practically as frequent as one issues to obviously articulate what it’s or the way it pertains to the pattern’s personal calculations.

And so I requested GPT-4o — what number of of those papers explicitly establish the estimand of curiosity? And the reply is 81%. 528 out of 651 of the AI generated papers explicitly establish the goal parameter, and people are listed right here.

The ATT is apparently the most typical parameter of curiosity. Which is fascinating in a way as a result of by revealed desire, one may argue that AI brokers imagine that the ATT is an important coverage parameter we may very well be excited about. And admittedly, I’m in all probability in that camp too.

I’m unsure I’d say that the DDD is an estimand, although one may within the inhabitants in fact run a triple diff and that triple interplay would have a calculation behind it that may be a inhabitants amount with none uncertainty behind it. However I’m unsure if in reality that’s what GPT-4o means right here. Nonetheless, 422 of the papers point out the ATT, however 45 point out the LATE — which once more is kind of unusual given GPT-4o additionally mentioned solely 20 of those papers ought to be labeled as IV.

So numerous thriller, and I’m curious what’s going on, however for now I’m going to have to save lots of that for one more day. The extra normal factor I’ll say is that it’s decidedly trendy when you consider it that AI Brokers are being this exact up entrance about what they’re going after. As a result of lots of people traditionally didn’t. They’d affiliate the regression coefficient because the estimand, which is certainly a calculation, however it is usually an estimator in all probability, and it is usually in all probability an estimate in a pattern. However right here we do seem, by way of the causal jargon, getting an expression of the estimand, which I feel is itself pretty scientifically and methodologically right to do it that manner. The papers seems to be being clear about what they’re operating after.

I additionally had GPT-4o do a second evaluation of the papers simply to get a bit extra details about them on this level. And right here’s what it discovered the second time.

So apparently, the overwhelming case is that AI brokers title the estimand.

Items of Measurements in DD

So what’s the unit of commentary within the diff-in-diff papers — each the panel unit and the time unit. And this was fairly fascinating and confirmed my priors. These are principally vanilla state-year panels. Why? As a result of 148 of them are state panels, which just about definitely are additionally yearly information.

However however there may be truly extra variation in right here than I used to be anticipating. There are 44 papers utilizing county degree information, and 22 utilizing metropolis degree information. 9 of them obtained their arms on particular person degree information. All of this was publicly obtainable information thoughts you.

The panel time dimension is way extra skewed in the direction of guide information, although there may be some variation there too.

Reminder of Statistical Significance Rules

On my third request to OpenAI, I requested GPT-4o to extract the coefficients and commonplace errors from all of the regressions and place them in a easy JSON. Why didn’t I do this on the primary two pulls? Properly shaggy dog story — I assumed I did. Anyway, I didn’t, so I needed to do it once more. This price round 70 cents.

Earlier than I get into this let me share slightly about one of many extra frequent ways in which researchers attempt to examine if there may be proof for p-hacking. They take a look at density plots and examine whether or not there may be bunching on the 95% essential values related to the conventional distribution.

Particularly, 95% of all chance mass within the regular is positioned inside 1.96 commonplace deviations from the imply. The t-statistic has a sampling distribution and when that worth exceeds 1.96, you’ll be able to say it’s within the tails of the conventional (on one facet, it’s within the high 2.5%). Not unattainable however thought of to be excessive and because it’s measurable, a manner now we have traditionally acknowledged one thing may or couldn’t reject a null speculation.

And the p-value is the chance of observing a t-statistic as excessive or extra excessive than the one you bought, assuming the null is true. So if |t| = 1.96 in a two-sided take a look at, then 5% of the sampling distribution lies in each tails mixed — 2.5% above 1.96 and a pair of.5% beneath -1.96. That’s the p-value: 0.05.

So, in these forensic model plots, if we see an uncommon bunching of estimates round 1.96, it’s considered an indication of p-hacking because it’s probably not obtained another clarification as to why estimates would bunch at 1.96 however not 1.76.

Are AI Brokers p-hacking? Sure.

Okay, then what did we discover. Are AI Brokers p-hacking? And the reply is they’re p-hacking the crap out of those papers! What I see. Are AI brokers p-hacking? In a phrase: sure. Or a minimum of the sample is in step with it. The median t-statistic throughout all papers is 1.94 — virtually precisely the importance threshold. Solely 49.8% of coefficients are important at 5%, which sounds affordable till you discover that the distribution has a suspicious discontinuity proper on the threshold.

This zoomed determine is placing. There’s a very seen spike proper at t = 1.96. The general density is declining easily by way of t = 1.5, 1.6, 1.7, however then POW — a noticeable leap proper on the threshold for standard statistical significance. Let me be much more particular than that.

In a 0.2-wide window, there are 135 t-statistics slightly below 1.96 and 205 simply above — a ratio of 1.52. Which means 52% extra mass simply above the importance threshold than simply beneath. For comparability, Brodeur et al. (2020) discovered a ratio of about 1.4 for human papers in high economics journals. So it’s p-hacking barely worse than us!

However right here’s the actually fascinating half. Whenever you break it down by methodology, which is one thing Brodeur has finished earlier than, the IV papers have the worst bunching ratio: 3.5. There are 3.5 occasions extra t-statistics simply above 1.96 than simply beneath for IV papers. Now the pattern is for IV is small. Solely 20 papers out of 651 are labeled as IV, and but these 20 papers present by far the strongest bunching signature. The AI discovered from a literature the place IV papers had been probably the most aggressively chosen, and it reproduced that choice faithfully.

However the actually dangerous ones are DiD and RDD. The ratio for DiD is 1.5, and it’s 1.87 for RDD. That is precisely the ordering you’d predict if the AI discovered from the human literature as a result of Brodeur et al. discovered that IV papers had the worst p-hacking, which is what we discover her. However I feel Brodeur et al. didn’t discover this extreme of RDD bunching on the essential values.

Now right here’s the loopy factor. AI is stating its express estimand. It’s selecting a design. It’s utilizing publicly obtainable information. It’s cleansing it. Its code is printed on GitHub. And but it seems that it’s p-hacking as a result of now we have the total distribution of the papers, and now we have bunching at statistical significance thresholds with none motivation to publish!

Take into consideration that for a second. These brokers are usually not being incentivized to publish. They aren’t being informed to search out statistically important outcomes. They’re merely being informed by way of a immediate to work on explicit matters utilizing actual information which should be cleaned, and so forth, and but there seems to be a collective course of at work that’s inflicting p-hacking.

The first vs. robustness determine tells a associated story. The first specs cluster extra tightly close to 1.96 than the robustness checks, that are extra dispersed. Possibly that is what you’d count on as a result of perhaps the agent picks its “major consequence” partially based mostly on significance, and the robustness checks are no matter they occur to be. We don’t actually know all that went into the precise coding a part of this — we simply know the ultimate consequence, and the ultimate code itself. We have no idea how the sausage was made, solely that it was made.

Now, this isn’t p-hacking in the way in which people do it. A human researcher would possibly run 50 specs and report the one which’s important. The AI agent doesn’t have that latitude, although; its pipeline is deterministic. However what it does have is a coaching corpus. It seems to have discovered what “good” outcomes appear to be from hundreds of thousands of printed papers, and printed papers are already p-hacked. So when the AI agent writes a paper and selects which ends up to emphasise, it’s reproducing the distributional signature of p-hacking even when the mechanism is totally different. It’s mimicking the output of a p-hacked literature as a result of that’s what it thinks a paper seems like.

Conclusion

I requested Claude to calculate how a lot we spent on this, as a result of I misplaced monitor of what number of occasions I despatched these papers again to OpenAI for classification. To get 651 full manuscripts labeled by GPT-4o in batch at OpenAI price $12.28.

That’s $12.28 for the whole lot on this Substack submit.

It’s $11.59 for GPT-4o to learn all 651 papers and classify every paper’s analysis design, their estimators, descriptions of the paper’s rhetoric, information sources, and the acknowledged (or not) estimand. And it price one other $0.69 for GPT-4o-mini to extract 3,510 coefficients and commonplace errors for the p-hacking evaluation.

What a loopy world it’s. P-hacking AI brokers writing papers similar to us.

Related Articles

Latest Articles