In Elements 1 and a couple of, I confirmed you the setup and the punchline: gpt-4o-mini agreed with the unique RoBERTa classifier on solely 69% of particular person speeches, however the combination tendencies — partisan polarization, country-of-origin patterns, the entire historic arc — had been just about similar. Over 100,000 labels modified and but the unique story didn’t.
That consequence was attention-grabbing at first, however then stored bugging me. How will you reclassify 100,000 speeches, roughly saying that the unique RoBERTa mannequin was fallacious, and but all subsequent evaluation finds nearly the very same issues? What does that even suggest about measurement itself?
So yesterday I spent an hour working with Claude Code to increase the evaluation by classifying the speeches a second time at OpenAI to check a conjecture. I had two conjectures the truth is — that these speeches being reclassified had been the “marginal speeches” and that they had been canceling out as a result of they had been roughly symmetric from anti to impartial, and from professional to impartial. And I needed to test if that was the case, did that subsequently imply this was a particular case of utilizing one shot LLMs over human annotation w/ RoBERTa that utilized when there was a built-in cancelation mechanism like there may be with labels which can be (-1, 0 and 1)? Would it not work with 4 classes that don’t cancel out (e.g., race classes)?
So right this moment I spent one other hour with Claude Code attempting to determine why. I don’t discover the final query in right this moment’s video, however I observe that Claude Code did internet crawl till it discovered 4 new datasets with categorized textual content that may let me consider the “three physique drawback”. However for this it’s simply going to be every little thing else however.
Thanks on your help! This substack is a labor of affection, and the Claude Code sequence stays free for the primary a number of days. So if you wish to maintain studying it totally free, simply be sure to maintain your eyes peeled for updates! However perhaps contemplate changing into a paying subscriber too because it’s solely $5/month, which is the value of a cup of espresso!
Jason Fletcher’s Query
My good friend Jason Fletcher — a well being economist at Wisconsin — requested a very good query after I confirmed him the outcomes: does the settlement break down for older speeches? Congressional language within the Eighties is nothing just like the 2010s. If gpt-4o-mini is a creature of recent textual content, you’d anticipate it to wrestle with Nineteenth-century rhetoric.
We constructed two exams. The shock: total settlement barely strikes. It’s 70% within the Eighties and 69% within the fashionable period. The LLM handles Nineteenth-century speech about in addition to Twenty first-century speech.
However beneath that steady floor, the composition rotates dramatically. Professional-immigration settlement rises from 44% within the early interval to 68% within the fashionable period. Impartial settlement falls from 91% to 80%. They cancel in combination — a unique sort of balancing act, hiding in plain sight.
Yow will discover this “stunning deck” right here if you wish to peruse it your self.
My Conjecture: Marginal Instances
Right here’s the speculation I stored coming again to. The important thing measure in Card et al. is web tone — the share of pro-immigration speeches minus the share of anti-immigration speeches. It’s a distinction. And when the LLM reclassifies, it’s overwhelmingly pulling speeches towards impartial from either side. 33% of Professional goes to Impartial. 44% of Anti goes to Impartial. Direct flips between Professional and Anti are uncommon — solely about 4-5%.
So consider it like two graders scoring essays as A, B, or C. They disagree on a 3rd of the essays, however the class common is similar each semester. That solely works if the disagreements cancel — if the strict grader downgrades borderline A’s to B’s and borderline C’s to B’s at roughly equal charges. The B pile grows. The typical doesn’t transfer.
I had Claude Code construct two formal exams. A one-sample t-test rejects good symmetry — the imply delta in web tone is about 5 proportion factors, and the symmetry ratio is 0.82 fairly than 1.0. The LLM pulls more durable from Anti than from Professional. However 5 factors is small relative to the 40-60 level partisan swings that outline the story. The mechanism is uneven however correlated, and large-sample averaging absorbs what’s left.
The Thermometer
To push this additional, I needed to see the place on the spectrum the reclassified speeches really fall. So we despatched all 305,000 speeches again to OpenAI — identical speeches, identical mannequin — however this time asking for a steady rating from -100 (anti-immigration) to +100 (pro-immigration), with 0 as impartial.
The prediction: if reclassification is admittedly about marginal circumstances, the speeches that bought reclassified ought to cluster close to zero. They had been all the time borderline. The LLM simply known as them in a different way.
Getting the info again from OpenAI was its personal journey. The batch submission stored hitting SSL errors round batch 17 — in all probability Dropbox syncing interfering with the uploads. Claude Code identified this, added retry logic with exponential backoff, and pushed all 39 batches by way of. One other ~$11, one other ~2.6 hours of processing time. The batch API continues to be absurdly low-cost.
As soon as the outcomes got here again, we merged three datasets: the unique RoBERTa labels, the LLM tripartite labels, and the brand new thermometer scores. Then we examined the speculation 3 ways.
First, the distributions. We plotted thermometer scores individually for speeches the place the classifiers agreed versus speeches that bought reclassified. The reclassified Professional-to-Impartial speeches cluster close to zero from the fitting. The reclassified Anti-to-Impartial speeches cluster close to zero from the left. The speeches the place each classifiers agreed sit additional out towards the poles. Precisely what the speculation predicts.
Second, the means. Reclassified speeches have thermometer scores dramatically nearer to zero than agreed speeches. The marginal-cases story holds up quantitatively, not simply visually.
Third, and most formally: we ran logistic regressions asking whether or not proximity to zero on the thermometer predicts the likelihood of reclassification. It does. Speeches close to the boundary are much more prone to get reclassified than speeches on the poles. The connection is monotonic and robust.
And right here we see a abstract of the tendencies for all three — the unique RoBERTa mannequin, the LLM tripartite reclassification from final week, and the brand new thermometer classification from right this moment. Identical factor. All of them agree, although RoBERTa used 7500 annotated (by college students) speeches for its coaching, however I simply did a one shot methodology and spent $10-11 per go at it utilizing OpenAI’s batch requests that are 50% off for those who submit in batches.
The Three-Physique Downside
However right here’s what I can’t cease fascinated about. This cancellation mechanism has a really particular construction: two poles and a middle. Professional and Anti are +1 and -1 on a one-dimensional scale, and Impartial is the absorbing center. Losses from each poles wash towards the middle, and since the measure is a distinction, they cancel.
What occurs with 4 classes? Or 5? Or twenty? If there’s no single absorbing heart, does the entire thing crumble?
I known as this the three-body drawback — partly as a joke, partly as a result of I believe there’s one thing genuinely structural about having precisely three classes with a symmetric setup.
To check this, I had Claude Code — working in a separate terminal with --dangerously-skip-permissions — search on-line for publicly obtainable datasets with 4+ human-annotated classes. It discovered 4: AG Information (4 classes), SST-5 sentiment (5 classes on an ordinal scale), 20 Newsgroups (20 classes), and DBpedia-14 (14 ontological classes). It downloaded all of them, wrote READMEs for every, and arranged them within the undertaking listing.
I haven’t run the evaluation but. That’s tomorrow. However the plan is to categorise all 4 datasets with gpt-4o-mini, examine with the unique human labels, and see whether or not combination distributions are preserved the best way they had been for the immigration speeches. If the three-category setup is particular, we should always see distribution preservation break down because the variety of classes will increase.
What’s Straightforward and What’s Exhausting with Claude Code
This sequence began as an experiment in what Claude Code can really do. Three periods in, I’m growing a clearer image.
What’s straightforward now: writing evaluation scripts that comply with established patterns, submitting batch API jobs, producing publication-quality figures, constructing Beamer decks, managing file group, and debugging infrastructure issues like SSL errors. Claude Code handles all of this quicker than I might.
What’s nonetheless laborious: the pondering. The conjecture about marginal circumstances — that was mine. The connection to the three-body drawback — mine. The choice to make use of a thermometer to check it — mine. Jason’s query about temporal stability — his. Claude Code is extraordinary at executing concepts, however the concepts nonetheless have to come back from someplace.
The most efficient workflow I’ve discovered is what I’d name conversational course. I believe out loud. Claude Code listens, proposes, executes. I steer. It builds. The dialogue is the pondering course of.
What’s Subsequent
Subsequent week, after Valentine’s, I’ll run the exterior dataset evaluation and see if the three-body speculation holds up. I’ll additionally construct a correct deck for the thermometer outcomes — following the rhetoric of decks ideas I’ve been growing, with assertion titles, TikZ instinct diagrams, and exquisite figures.
If you wish to see the place this goes, stick round.
Thanks for following alongside. All Claude Code posts are free once they first exit, although every little thing goes behind the paywall finally. Usually I flip a coin on what will get paywalled, however for Claude Code, each new submit begins free. For those who like this sequence, I hope you’ll contemplate changing into a paying subscriber — it’s solely $5/month or $50/12 months, the minimal Substack permits.












