All Courses - Page 86 of 394 - Analytics Campus

Anthropic Releases Claude Opus 4.6 With 1M Context, Agentic Coding, Adaptive Reasoning Controls, and Expanded Security Tooling Capabilities

Artificial Intelligence

Dr. Mike

February 6, 2026

Anthropic Releases Claude Opus 4.6 With 1M Context, Agentic Coding, Adaptive Reasoning Controls, and Expanded Security Tooling Capabilities

Anthropic has launched Claude Opus 4.6, its most succesful mannequin thus far, centered on long-context reasoning, agentic coding, and high-value data work. The mannequin builds on Claude Opus 4.5 and is now obtainable on claude.ai, the Claude API, and main cloud suppliers below the ID claude-opus-4-6.

Mannequin focus: agentic work, not single solutions

Opus 4.6 is designed for multi-step duties the place the mannequin should plan, act, and revise over time. As per the Anthropic staff, they use it in Claude Code and report that it focuses extra on the toughest elements of a process, handles ambiguous issues with higher judgment, and stays productive over longer periods.

The mannequin tends to assume extra deeply and revisit its reasoning earlier than answering. This improves efficiency on tough issues however can improve value and latency on easy ones. Anthropic exposes a /effort parameter with 4 ranges — low, medium, excessive (default), and max — so builders can explicitly commerce off reasoning depth towards velocity and price per endpoint or use case.

Past coding, Opus 4.6 targets sensible knowledge-work duties:

operating monetary analyses
doing analysis with retrieval and shopping
utilizing and creating paperwork, spreadsheets, and displays

Inside Cowork, Anthropic’s autonomous work floor, the mannequin can run multi-step workflows that span these artifacts with out steady human prompting.

Lengthy-context capabilities and developer controls

Opus 4.6 is the primary Opus-class mannequin with a 1M token context window in beta. For prompts above 200k tokens on this 1M-context mode, pricing rises to $10 per 1M enter tokens and $37.50 per 1M output tokens. The mannequin helps as much as 128k output tokens, which is sufficient for very lengthy experiences, code opinions, or structured multi-file edits in a single response.

To make long-running brokers manageable, Anthropic ships a number of platform options round Opus 4.6:

Adaptive pondering: the mannequin can determine when to make use of prolonged pondering primarily based on process issue and context, as a substitute of all the time operating at most reasoning depth.
Effort controls: 4 discrete effort ranges (low, medium, excessive, max) expose a clear management floor for latency vs reasoning high quality.
Context compaction (beta): the platform mechanically summarizes and replaces older elements of the dialog as a configurable context threshold is approached, lowering the necessity for customized truncation logic.
US-only inference: workloads that should keep in US areas can run at 1.1× token pricing.

These controls goal a standard real-world sample: agentic workflows that accumulate a whole bunch of hundreds of tokens whereas interacting with instruments, paperwork, and code over many steps.

Product integrations: Claude Code, Excel, and PowerPoint

Anthropic has upgraded its product stack in order that Opus 4.6 can drive extra practical workflows for engineers and analysts.

In Claude Code, a brand new ‘agent groups’ mode (analysis preview) lets customers create a number of brokers that work in parallel and coordinate autonomously. That is geared toward read-heavy duties corresponding to codebase opinions. Every sub-agent will be taken over interactively, together with by way of tmux, which inserts terminal-centric engineering workflows.

Claude in Excel now plans earlier than appearing, can ingest unstructured information and infer construction, and might apply multi-step transformations in a single go. When paired with Claude in PowerPoint, customers can transfer from uncooked information in Excel to structured, on-brand slide decks. The mannequin reads layouts, fonts, and slide masters so generated decks keep aligned with present templates. Claude in PowerPoint is at the moment in analysis preview for Max, Group, and Enterprise plans.

Benchmark profile: coding, search, long-context retrieval

Anthropic staff positions Opus 4.6 as cutting-edge on a number of exterior benchmarks that matter for coding brokers, search brokers, {and professional} resolution assist.

https://www.anthropic.com/information/claude-opus-4-6

Key outcomes embrace:

GDPval-AA (economically beneficial data work in finance, authorized, and associated domains): Opus 4.6 outperforms OpenAI’s GPT-5.2 by round 144 Elo factors and Claude Opus 4.5 by 190 factors. This means that, in head-to-head comparisons, Opus 4.6 beats GPT-5.2 on this analysis about 70% of the time.
Terminal-Bench 2.0: Opus 4.6 achieves the very best reported rating on this agentic coding and system process benchmark.
Humanity’s Final Examination: on this multidisciplinary reasoning check with instruments (internet search, code execution, and others), Opus 4.6 leads different frontier fashions, together with GPT-5.2 and Gemini 3 Professional configurations, below the documented harness.
BrowseComp: Opus 4.6 performs higher than some other mannequin on this agentic search benchmark. When Claude fashions are mixed with a multi-agent harness, scores improve to 86.8%.

Lengthy-context retrieval is a central enchancment. On the 8-needle 1M variant of MRCR v2 — a ‘needle-in-a-haystack’ benchmark the place information are buried inside 1M tokens of textual content — Opus 4.6 scores 76%, in comparison with 18.5% for Claude Sonnet 4.5. Anthropic describes this as a qualitative shift in how a lot context a mannequin can really use with out context rot.

Extra efficiency good points in:

root trigger evaluation on advanced software program failures
multilingual coding
long-term coherence and planning
cybersecurity duties
life sciences, the place Opus 4.6 performs nearly 2× higher than Opus 4.5 on computational biology, structural biology, natural chemistry, and phylogenetics evaluations

On Merchandising-Bench 2, a long-horizon financial efficiency benchmark, Opus 4.6 earns $3,050.53 greater than Opus 4.5 below the reported setup.

Key Takeaways

Opus 4.6 is Anthropic’s highest-end mannequin with 1M-token context (beta): Helps 1M enter tokens and as much as 128k output tokens, with premium pricing above 200k tokens, making it appropriate for very lengthy codebases, paperwork, and multi-step agentic workflows.
Specific controls for reasoning depth and price by way of effort and adaptive pondering: Builders can tune /effort (low, medium, excessive, max) and let ‘adaptive pondering’ determine when prolonged reasoning is required, exposing a transparent latency vs accuracy vs value trade-off for various routes and duties.
Sturdy benchmark efficiency on coding, search, and financial worth duties: Opus 4.6 leads on GDPval-AA, Terminal-Bench 2.0, Humanity’s Final Examination, BrowseComp, and MRCR v2 1M, with massive good points over Claude Opus 4.5 and GPT-class baselines in long-context retrieval and tool-augmented reasoning.
Tight integration with Claude Code, Excel, and PowerPoint for actual workloads: Agent groups in Claude Code, structured Excel transformations, and template-aware PowerPoint era place Opus 4.6 as a spine for sensible engineering and analyst workflows, not simply chat.

Try the Technical particulars and Documentation. Additionally, be happy to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be part of us on telegram as nicely.

Max is an AI analyst at MarkTechPost, primarily based in Silicon Valley, who actively shapes the way forward for expertise. He teaches robotics at Brainvyne, combats spam with ComplyEmail, and leverages AI day by day to translate advanced tech developments into clear, comprehensible insights

Spain’s Ministry of Science shuts down programs after breach claims

Technology

Dr. Mike

February 6, 2026

Spain’s Ministry of Science shuts down programs after breach claims

Spain’s Ministry of Science (Ministerio de Ciencia) introduced a partial shutdown of its IT programs, affecting a number of citizen- and company-facing companies.

Ministerio de Ciencia, Innovación y Universidades is the Spanish authorities physique answerable for science coverage, analysis, innovation, and better schooling.

Amongst others, it maintains administrative programs utilized by researchers, universities, and college students that deal with high-value, delicate data.

The Ministry said that the choice was in response to a “technical incident,” however didn’t present extra particulars. Nonetheless, a risk actor is claiming an assault on the establishment’s programs and printed knowledge samples as proof of the breach.

“On account of a technical incident at the moment underneath evaluation, the digital headquarters of the Ministry of Science, Innovation and Universities has been partially closed,” reads an announcement on the principle web page of the ministry’s web site.

“All ongoing administrative procedures are suspended, whereas safeguarding the rights and legit pursuits of all individuals affected by this short-term closure.”

Notice on the Ministry's website — **Discover on the Ministry’s web site**
*Supply: BleepingComputer*

To mitigate the influence of the disruption, the Ministry will lengthen all deadlines for affected procedures, in accordance with Article 32 of Legislation 39/2015.

A risk actor utilizing the alias ‘GordonFreeman’ from the Half-Life sport title supplied to the best bidder knowledge allegedly stolen from the Spanish ministry.

The alleged hacker leaked on underground boards knowledge samples that embrace private information, electronic mail addresses, enrollment functions, and screenshots of paperwork and different official paperwork.

Threat actor's post — **Menace actor’s put up**
*Supply: Kela*

The risk actor states that they breached Spain’s Ministry of Science by exploiting a important Insecure Direct Object Reference (IDOR) vulnerability that gave them legitimate credentials for “full- admin-level entry.”

It’s value noting that the discussion board the place the data appeared is now offline, and the info has not appeared on various platforms but.

The leaked photographs seem professional, though BleepingComputer has no option to affirm their authenticity or any of the attacker’s different claims. We now have contacted Ministerio de Ciencia about these allegations, however an announcement wasn’t instantly out there.

In the meantime, Spanish media retailers report {that a} ministry spokesperson confirmed that the IT programs disruption is said to a cyberattack.

Fashionable IT infrastructure strikes sooner than guide workflows can deal with.

On this new Tines information, find out how your workforce can scale back hidden guide delays, enhance reliability by way of automated response, and construct and scale clever workflows on high of instruments you already use.

The ‘mono’ virus raises the danger of MS and most cancers in some. 22 genes trace at why.

Science

Dr. Mike

February 6, 2026

The ‘mono’ virus raises the danger of MS and most cancers in some. 22 genes trace at why.

Round 90% of persons are contaminated with Epstein-Barr virus sooner or later of their lifetimes. For many of them, the virus causes a light, transient sickness or no signs in any respect. However for a subset of individuals, Epstein-Barr can finally contribute to persistent sicknesses, comparable to lupus and a number of sclerosis, or to the event of most cancers.

Now, new analysis uncovers 22 human genes that may make an Epstein-Barr an infection extra more likely to flip right into a persistent situation.

Researchers cannot but definitively say whether or not these genes immediately make Epstein-Barr extra harmful, or whether or not they’re a part of an underlying immune suppression that enables the virus to persist at greater ranges within the physique than typical. However the brand new examine ought to present a jumping-off level, mentioned Jill Hollenbach, a professor of neurology on the College of California, San Francisco, who was not concerned within the examine.

mononucleosis, higher often called mono, a short lived sickness notable for producing excessive fatigue. However even as soon as the signs of mono disappear, the virus lies latent within the physique, largely within the immune system’s B cells, which keep in mind and defend in opposition to particular germs.

For most individuals, this latent Epstein-Barr virus causes no issues. However in different individuals, the virus persists at a better, extra lively degree. In these circumstances, it may increase the danger of sure nasopharyngeal cancers and lymphomas, and will gasoline autoimmune issues comparable to a number of sclerosis. Power, lively Epstein-Barr has additionally been linked to coronary heart and lung illness.

To know why just some individuals appear to expertise these persistent results, Ryan Dhindsa on the Baylor Faculty of Drugs and colleagues turned to an underexplored supply of data: human DNA biobanks. These biobanks acquire full gene sequencing knowledge and well being information for a whole lot of 1000’s of people. In sequencing the human genome, in addition they occur to scoop up the DNA of any viruses that occur to be in residence inside cells.

“Usually, once we’re analyzing human genome sequence knowledge we ignore the reads that do not map again to a human reference genome. We simply form of throw them away,” Dhindsa informed Dwell Science. “Right here, we determined perhaps we might undergo these reads that we usually throw away and see if we might recuperate viral DNA.”

By combing by way of tossed-aside Epstein-Barr sequences from 750,000 individuals within the UK Biobank and the U.S. Nationwide Institutes of Well being’s All of Us biobank, the researchers had been capable of determine people — about 11% of the full — who had very excessive ranges of Epstein-Barr DNA. They discovered that these excessive ranges of viral DNA had been related to well being circumstances beforehand linked to Epstein-Barr, together with ailments of the spleen and Hodgkin lymphoma.

The presence of viral DNA was additionally related to circumstances considered linked to Epstein-Barr, though much less definitively: rheumatoid arthritis, persistent obstructive pulmonary illness (COPD), and lupus. Different associations within the knowledge reinforce even much less well-studied connections, together with hyperlinks between Epstein Barr and coronary heart illness, kidney failure, stroke and depressive episodes.

As well as, the researchers discovered 22 genes tied to a better chance that somebody could be within the 11% of individuals with persistent Epstein-Barr. Many of those genes had been in a area of the genome known as the human leukocyte antigen (HLA) locus, which is understood to code for the immune cells that current antigens — immune-response-triggering international molecules — to different immune cells.

“It looks as if these variants modified the best way a person’s immune response truly presents Epstein-Barr virus to the immune system,” Dhindsa mentioned, probably making it more durable for the physique to suppress viral replication. That mentioned, the information has solely proven a hyperlink between these genes and protracted an infection — extra analysis is required to show cause-and-effect.

In individuals with excessive ranges of Epstein-Barr, the researchers additionally noticed variations in genes that regulate the immune system. One, the SLAMF7 gene, usually encodes for a cell-surface protein that helps the immune system’s pure killer cells assault tumors. One other, known as CTLA4, encodes for a receptor on T cells that helps hold the immune system from attacking the physique.

“They discovered some actually attention-grabbing outcomes,” Hollenbach mentioned.

She and her group are actually taken with wanting deeper on the mechanisms that hyperlink the genetic variation to the immune response to Epstein-Barr. In the meantime, Dhindsa and his colleagues are taken with utilizing biobank knowledge to seek for different viruses which have long-term impacts on human well being. Some examples are the cancer-causing viruses Merkel cell polyomavirus and human T-cell lymphotropic virus kind 1.

The researchers are additionally desperate to increase their strategies to extra various international datasets of human genes. Whereas the All of Us dataset contains individuals from quite a lot of backgrounds, the U.Ok. Biobank is predominantly made up of individuals of European ancestry.

“We’d like to have the ability to take a look at genetic variations throughout extra consultant samples in future work,” he mentioned.

Nyeo, S. S., Cumming, E. M., Burren, O. S., Pagadala, M. S., Gutierrez, J. C., Ali, T. A., Kida, L. C., Chen, Y., Chu, H., Hu, F., Zou, X. Z., Hollis, B., Fabre, M. A., MacArthur, S., Wang, Q., Ludwig, L. S., Dey, Ok. Ok., Petrovski, S., Dhindsa, R. S., & Lareau, C. A. (2026). Inhabitants-scale sequencing resolves determinants of persistent EBV DNA. Nature. https://doi.org/10.1038/s41586-025-10020-2

Deno Sandbox launched for working AI-generated code

Dr. Mike

February 5, 2026

Deno Sandbox launched for working AI-generated code

Deno Land, maker of the Deno runtime, has launched Deno Sandbox, a safe surroundings constructed for code generated by AI brokers. The corporate additionally introduced the long-awaited basic availability of Deno Deploy, a serverless platform for working JavaScript and TypeScript purposes. Each have been introduced on February 3.

Now in beta, Deno Sandbox gives light-weight Linux microVMs working as protected environments within the Deno Deploy cloud. Deno Sandbox defends towards immediate injection assaults, the corporate mentioned, the place a person or AI makes an attempt to run malicious code. Secrets and techniques reminiscent of API keys by no means enter the sandbox and can solely seem when an outbound HTTP request is distributed to a pre-approved host, in response to the corporate.

Deno Sandbox was created in response to the rise in AI-driven growth, defined Deno co-creator Ryan Dahl, as extra LLM-generated code is being launched with the flexibility to name exterior APIs utilizing actual credentials, with out human overview. On this situation, he wrote, “Sandboxing the compute isn’t sufficient. You should management community egress and defend secrets and techniques from exfiltration.” Deno Sandbox gives each, in response to Dahl. It makes a speciality of workloads the place code should be generated, evaluated, or safely executed on behalf of an untrusted person.

China joins race to develop space-based information facilities with 5-year plan

Science

Dr. Mike

February 5, 2026

China joins race to develop space-based information facilities with 5-year plan

It appears like China is getting in on the race to launch information facilities into house.

The state-run China World Tv Community (CGTN) reported on Thursday (Jan. 29) that the primary Chinese language house firm, the state-owned China Aerospace Science and Expertise Company (CASC), will work on space-based information facilities as part of a bigger five-year plan to increase the nation’s already vital presence in house.

The a part of the plan that focuses on information facilities will goal an “built-in house system structure combining cloud, edge and terminal applied sciences,” in response to CGTN. CACS says this can permit computing energy, storage and transmission from house.

The information comes as U.S. corporations are working to launch information facilities into house, as terrestrial-based energy is turning into extra costly and restricted in some components of the world. That is partially because of the speedy enlargement of knowledge facilities used to host the huge info programs underlying synthetic intelligence.

Elon Musk’s SpaceX, for instance, plans to launch space-based information facilities. These will initially be modified variations of Starlink broadband satellites. However Musk’s usually formidable long-term plans additionally embody constructing AI-satellite factories on the moon, which will probably be launched from the lunar floor by way of railguns.

Houston-based firm Axiom Area launched the primary parts for its orbiting information middle final 12 months, and Google is wanting into launching a knowledge middle to help its personal AI infrastructure. Tech corporations are so sizzling on placing information facilities in orbit as a result of competitors for the sources that help them — power and land, for example — are heating up on Earth. Solar energy is plentiful in house, as is actual property, the pondering goes.

Orbiting information facilities additionally got here up on the World Financial Discussion board’s annual assembly in Davos, Switzerland, final month. A panel that included European Area Company Director Common Josef Aschbacher mentioned methods to make sure that the fast-moving technological developments that underpin society, equivalent to web infrastructure, are correctly protected, as safety for brand spanking new expertise usually trails its growth.

I Requested Claude to Replicate a PNAS Paper Utilizing OpenAI’s Batch API. This is What Occurred (Half 1)

Econometrics

Dr. Mike

February 5, 2026

I Requested Claude to Replicate a PNAS Paper Utilizing OpenAI’s Batch API. This is What Occurred (Half 1)

I’ve been experimenting with Claude Code for months now, utilizing it for every thing from writing lecture slides to debugging R scripts to managing my chaotic tutorial life. However most of these duties, if I’m being sincere, are issues I might do myself. They’re simply sooner with Claude.

Final week I made a decision to attempt one thing completely different. I needed to see if Claude might assist me do one thing I genuinely didn’t know learn how to do: pull in a replication package deal from a printed paper, run NLP classification on 300,000 textual content information utilizing OpenAI’s Batch API, and evaluate the outcomes to the unique findings.

That is Half 1 of that story. Half 2 could have the outcomes—however I’m scripting this earlier than the batch job finishes, so we’re all in suspense collectively. Right here’s the video stroll via. I’ll submit the second as soon as that is finished and we’ll test it collectively.

Thanks once more for all of your assist! These Claude Prices, and the substack extra typically, are labors of affection. Please contemplate turning into a paying subscriber! It’s $5/month, $50/yr or founding member costs of $250! For which I may give you full and complete consciousness in your dying mattress in return.

The paper I selected was Card, et al. (PNAS 2022): “Computational evaluation of 140 years of US political speeches reveals extra constructive however more and more polarized framing of immigration.” Let me dive in and inform you about it. Should you haven’t learn it, the headline findings are hanging:

General sentiment towards immigration is MORE constructive at present than a century in the past. The shift occurred between WWII and the 1965 Immigration Act.
However the events have polarized dramatically. Democrats now use unprecedentedly constructive language about immigrants. Republicans use language as damaging as the common legislator throughout the Nineteen Twenties quota period.

The authors categorised ~200,000 congressional speeches and ~5,000 presidential communications utilizing a RoBERTa mannequin fine-tuned on human annotations. Every speech phase was labeled as PRO-IMMIGRATION, ANTI-IMMIGRATION, or NEUTRAL. However my query was can we replace this paper utilizing a contemporary massive language mannequin to do the classification? And may we do it stay with out me doing something aside from dictating to Claude Code the duty?

If the reply to this query is sure, then it means researchers can use off-the-shelf LLMs for textual content classification at scale—cheaper and sooner than coaching customized fashions—and for many people, that’s an ideal lesson to be taught. However I believe this train additionally doubles to indicate that even when you really feel intimidated by such a activity, you shouldn’t, as a result of I mainly do that whole factor by typing my directions in, and letting Claude Code do the complete factor, together with discovering the replication package deal, unzipping and extracting the speeches!

If no, we be taught one thing about the place human-annotated coaching knowledge nonetheless issues. And we be taught possibly that this use of Claude Code to do all this through “dictation” is possibly additionally not all it’s cracked as much as be.

Let me be clear about what makes this troublesome:

Scale. We’re speaking about 304,995 speech segments. You possibly can’t simply paste these into ChatGPT one by one.
Infrastructure. OpenAI’s Batch API is the appropriate device for this—it’s 50% cheaper than real-time API calls and may deal with huge jobs. However setting it up requires understanding file codecs, authentication, job submission, outcome parsing, and error dealing with.
Methodology. Even when you get the API working, you must think twice about immediate design, label normalization, and learn how to evaluate your outcomes to the unique paper’s.
Coordination. The replication knowledge lives on GitHub. The API key lives someplace safe. The code must be modular and well-documented. The outcomes have to be interpretable.

I needed to see if Claude Code might deal with the entire pipeline—from downloading the info to submitting the batch job—whereas I watched and requested questions.

I began by telling Claude what I needed to do utilizing one thing referred to as “plan mode”. Plan mode is a button you pull down on the desktop app. I’ve a protracted forwards and backwards with Claude Code about what I need finished, he works it out, I evaluate it, after which we’re prepared, I agree and he does it. If nothing else, watching the video, and skipping to plan mode, you may see what I did.

So what I did was I saved the paper myself regionally (as I had a sense he may not might get into the PNAS button factor to get it however who is aware of), then defined precisely what I needed finished. However what I did was I defined my request backwards. That’s I instructed him what I needed on the very finish, which was a classification of the speeches the authors had however utilizing OpenAI batch requested classification with the gpt-4o-mini LLM. After which I labored backwards from there and stated I needed an explainer deck, I needed an audit utilizing referee2 earlier than he ran it, I needed a cut up pdf utilizing my pdf-splitter talent at my repo, and so forth. It’s simpler to clarify when you watch it.

So as soon as we agreed, and after some tweaking issues in plan mode,Claude instantly did one thing I appreciated: it created a self-contained mission folder moderately than scattering recordsdata throughout my present course listing.

workout routines/llm_replication/
├── article/
│   └── splits/           # PDF chunks + notes
├── code/
│   ├── 01_prepare_data.py
│   ├── 02_create_batch.py
│   ├── 03_submit_batch.py
│   ├── 04_download_results.py
│   └── 05_compare_results.py
├── knowledge/
│   ├── uncooked/              # Downloaded replication knowledge
│   ├── processed/        # Cleaned CSV
│   ├── batch_input/      # JSONL recordsdata for API
│   └── batch_output/     # Outcomes
├── deck/
│   └── deck.md           # Presentation slides
├── plan.md
└── README.md

This construction made sense to me. Every script does one factor. The information flows from uncooked → processed → batch_input → batch_output → outcomes. If one thing breaks, you recognize the place to look. So this is kind of replicable and I can use this to indicate my college students subsequent week after we evaluate the paper and replicate it kind of utilizing an LLM, not the methodology that they used.

The replication package deal from Card et al. is 1.39 GB. How do I do know that? As a result of Claude Code searched and located it. He discovered it, pulled the zipped file into my native listing, and noticed that utilizing no matter device it’s within the terminal that permits you to test the file dimension. Right here’s the place he put it.

The zipped file he then unzipped and positioned in that ./knowledge listing. It contains the speech texts, the RoBERTa mannequin predictions, and the unique human annotations. So that is now the PNAS, from the bottom up.

When Claude downloaded the info and explored the construction, right here’s what we discovered:

Congressional speeches: 290,800 segments in a .jsonlist file
Presidential communications: 14,195 segments in a separate file
Every document contains: textual content, date, speaker, celebration, chamber, and the unique mannequin’s likelihood scores for every label

It’s slightly completely different than what the PNAS says curiously which stated there are 200,000 congressional speeches and 5,000 presidential communications. This got here out to 305,000. So I stay up for digging extra into that.

I even have the unique paper’s classifier outputs chances for all three courses. If a speech has chances (anti=0.6, impartial=0.3, professional=0.1), we take the argmax: ANTI-IMMIGRATION. This was from their very own evaluation.

However Claude wrote 01_prepare_data.py to load each recordsdata, extract the related fields, compute the argmax labels, and save every thing to a clear CSV. Working it produced:

Complete information: 304,995

--- Authentic Label Distribution ---
  ANTI-IMMIGRATION: 48,234 (15.8%)
  NEUTRAL: 171,847 (56.3%)
  PRO-IMMIGRATION: 84,914 (27.8%)

Most speeches are impartial—which is sensible. Plenty of congressional speech is procedural.

Or are they impartial? That’s what we’re going to discover out. When now we have the LLM do the classification, we’re going to see if possibly there may be nonetheless more money on the desk. After which we’ll create a transition matrix to see what the LLM categorised as ANP and what the unique authors categorised as ANP. We’ll see if some issues are getting shifted round.

That is the place it will get fascinating. How do you inform gpt-4o-mini to categorise political speeches the identical means a fine-tuned RoBERTa mannequin did?

Claude’s first draft was detailed—possibly too detailed:

You're a analysis assistant analyzing political speeches about immigration...

Classification classes:

1. PRO-IMMIGRATION
   - Valuing immigrants and their contributions
   - Favoring much less restricted immigration insurance policies
   - Emphasizing humanitarian considerations, household unity, cultural contributions
   - Utilizing constructive frames like "hardworking," "contributions," "households"

2. ANTI-IMMIGRATION
   - Opposing immigration or favoring extra restrictions
   - Emphasizing threats, crime, illegality, or financial competitors
   - Utilizing damaging frames like "unlawful," "criminals," "flood," "invasion"
   ...

I had a priority: by itemizing particular key phrases, had been we biasing the mannequin towards pattern-matching moderately than semantic understanding?

That is precisely the type of methodological query that issues in analysis. Should you inform the mannequin “speeches with the phrase ‘flood’ are anti-immigration,” you’re not likely testing whether or not it understands tone—you’re testing whether or not it may possibly grep.

We determined to maintain the detailed immediate for now however flagged it as one thing to revisit. A less complicated immediate would possibly truly carry out higher for a replication research, the place you need the LLM’s unbiased judgment. However, what I believe I’ll do is a component 3 the place we do resubmit with a brand new immediate that doesn’t lead the llm as a lot as I did, however I believe it’s nonetheless helpful simply to see even utilizing the unique prompting, whether or not this extra superior llm, which has much more talent at discerning context than earlier ones (even Roberta), would possibly come to the identical or completely different conclusions.

So now we get into the OpenAI half. I absolutely perceive that this half is a thriller to many individuals. Simply what am I precisely going to be sensible doing on this fourth step? And that’s the place I believe relying Claude Code for assist in answering your questions, in addition to studying learn how to do it, after which utilizing referee2 to audit the code, goes to be useful. However right here’s the gist.

To get the classification of the speeches finished, now we have to add these speeches to OpenAI. However OpenAI’s Batch API expects one thing referred to as JSONL recordsdata the place every line is an entire API request. So, with out me even explaining learn how to do it, Claude wrote 02_create_batch.py to generate these.

Just a few technical particulars that matter:

Chunking: We cut up the 304,995 information into 39 batch recordsdata of 8,000 information every. This retains file sizes manageable.
Truncation: Some speeches are very lengthy. We truncate at 3,000 characters to suit inside context limits. Claude added logging to trace what number of information get truncated.
Price estimation: Earlier than creating something, the script estimates the overall value:

--- Estimated Price (gpt-4o-mini with Batch API) ---
  Enter tokens:  140,373,889 (~$10.53)
  Output tokens: 1,524,975 (~$0.46)
  TOTAL ESTIMATED COST: $10.99

Lower than eleven {dollars} to categorise 300,000 speeches! That’s outstanding. Just a few years in the past, this might have required coaching your personal mannequin or paying for costly human annotation. However now for $11 and what will take a mere 24 hours I — a one man present, doing all of this inside an hour over a video — bought this submitted! Un. Actual.

03_submit_batch.py is the place cash will get spent, so Claude inbuilt a number of security options:

A --dry-run flag that exhibits what could be submitted with out truly submitting
An specific affirmation immediate that requires typing “sure” earlier than continuing
Retry logic with exponential backoff for dealing with API errors
Monitoring recordsdata that save batch IDs so you may test standing later

I appreciated the defensive programming. While you’re about to spend cash on an API name, you need to make sure you’re doing what you propose.

Right here’s the place issues bought meta.

I’ve a system I discussed the opposite day referred to as personas. And the one persona I’ve thus far is an aggressive “auditor” referred to as “Referee 2”—I take advantage of him by opening a separate Claude occasion, in order that I don’t have Claude Code reviewing its personal code. This second Claude Code occasion is referee 2. It didn’t write the code we’re utilizing to submit the batch requests. It’s sole job is to evaluate the opposite Claude Code’s code with the important eye of an educational reviewer after which write a referee report. The thought is to catch issues earlier than you run costly jobs or publish embarrassing errors.

So, I requested Referee 2 to audit the complete mission: the code, the methodology, and the presentation deck. And you may see me within the video doing this. The report got here again with a advice of “Minor Revision Earlier than Submission”—tutorial converse for “that is good however repair a number of issues first.” I bought an R&R!

Label normalization edge circumstances. The unique code checked if “PRO” was within the response, however what if the mannequin returns “NOT PRO-IMMIGRATION”? The string “PRO” is in there, however that’s clearly not a pro-immigration classification. Referee 2 recommended utilizing startswith() as a substitute of in, with precise matching as the primary test.
Lacking metrics. Uncooked settlement charge doesn’t account for probability settlement. If each classifiers label 56% of speeches as NEUTRAL, they’ll agree on lots of impartial speeches simply by probability. Referee 2 really useful including Cohen’s Kappa.
Temporal stratification. Speeches from 1880 use completely different language than speeches from 2020. Does gpt-4o-mini perceive Nineteenth-century political rhetoric in addition to trendy speech? Referee 2 recommended analyzing settlement charges individually for pre-1950 and post-1950 speeches.
The immediate design query. Referee 2 echoed my concern in regards to the detailed immediate probably biasing outcomes towards key phrase matching.

Clear code construction with one script per activity
Defensive programming within the submission script
Good logging all through
The deck following “Rhetoric of Decks” ideas (extra on this under)

I applied the required fixes. I needed to pause at sure factors the recording, however I believe it in all probability took about half-hour. The code is now extra sturdy than it might have been with out the evaluate.

One factor I’ve discovered from educating: when you can’t clarify what you probably did in slides, you in all probability don’t absolutely perceive it your self.

I requested Claude to create a presentation deck explaining the mission. However I gave it constraints: comply with the “Rhetoric of Decks” philosophy I’ve been growing, which emphasizes:

One concept per slide
Magnificence is perform (no ornament with out function)
The slide serves the spoken phrase (slides are anchors, not paperwork)
Narrative arc (Downside → Investigation → Decision)

I’m going to avoid wasting the deck, although, for tomorrow when the outcomes are completed in order that we will all have a look at the deck collectively! Cliff hanger!

As of the second of typing this, the batch has been despatched. Right here’s the place we’re at this second. A few of them are almost finished, and a few have but to start.

However listed below are among the issues I’m questioning as I wait.

Will the LLM agree with the fine-tuned mannequin? The unique paper studies ~65% accuracy for tone classification, with most errors between impartial and the extremes. If gpt-4o-mini achieves comparable settlement, that’s a validation of zero-shot LLM classification. If it’s a lot decrease, we be taught that fine-tuning nonetheless issues.
Will settlement differ by time interval? Will the LLM will do higher on trendy speeches (post-1965) than on Nineteenth-century rhetoric? The coaching knowledge for GPT fashions skews latest, or does it?
Will settlement differ by celebration? If the LLM systematically disagrees with RoBERTa on Republican speeches however not Democratic ones (or vice versa), that tells us one thing about how these fashions encode political language. I can do all this utilizing a transition matrix desk, which I’ll present you, to see how the classifications differ.
What is going to the disagreements appear like? I’m genuinely curious to learn examples the place the 2 classifiers diverge. That’s typically the place you be taught essentially the most.

This began as a check of Claude Code’s capabilities. Can it deal with an actual analysis activity with a number of shifting elements? Can it deal with a “exhausting activity”?

The reply thus far is sure—with caveats. Claude wanted steerage on methodology. It benefited enormously from the Referee 2 evaluate. And I needed to keep engaged all through, asking questions and pushing again on choices. Discover this was not “right here’s a job now go do it”. I’m fairly engaged the entire time, however that’s additionally how I work. I believe I’ll at all times be within the “dialogue so much with Claude Code” camp.

However the workflow labored. We went from “I need to replicate this paper” to “batch job submitted” in about an hour. The code is clear and was double checked (audited) by referee 2. The documentation is thorough. The methodology is defensible. We’re updating a paper. I’m one man in my pajamas filming this complete factor so you may simply see for your self learn how to use Claude Code to do a troublesome activity.

To me, the actual thriller of Claude Code is why does the copy-paste methodology of coding appear to really make me much less attentive, however Claude Code for some motive retains me extra engaged, extra attentive? I nonetheless don’t fairly perceive psychologically why that may be the case however I’ve seen again and again that on initiatives utilizing Claude Code, I don’t have the slippery grasp on what I’ve finished, how I’ve finished it, and in order I typically did with the copy-paste methodology of utilizing ChatGPT to code. That kind of copy-paste is kind of senseless button pushing. Whereas I considering how I take advantage of Claude Code shouldn’t be like that, and therein lies the actual worth. Claude didn’t simply do the work—it did the work in a means that taught me what was taking place. I believe that a minimum of for now’s labor productiveness enhancing. I’m doing new duties I couldn’t do, I’m attending to solutions I can research sooner, I’m considering extra, I’m staying engaged, and curiously, I guess you I’m spending the identical period of time on analysis, however much less time on the stuff that isn’t truly “actual analysis”.

The batch job will take as much as 24 hours to finish. As soon as it’s finished, I’ll obtain the outcomes and run the comparability evaluation.

Half 2 will cowl:

General settlement charge and Cohen’s Kappa
The transition matrix (which labels does the LLM get “improper”?)
Settlement by time interval, celebration, and supply
Examples of fascinating disagreements
What this implies for researchers contemplating LLM-based textual content classification

Till then, I’m looking at a monitoring file with 39 batch IDs and ready.

Keep tuned.

Technical particulars for the curious:

Mannequin: gpt-4o-mini
Information: 304,995
Estimated value: $10.99 (with 50% batch low cost)
Classification labels: PRO-IMMIGRATION, ANTI-IMMIGRATION, NEUTRAL
Comparability metric: Settlement charge + Cohen’s Kappa
Time stratification: Pre-1950 vs. Submit-1950 (utilizing Congress quantity as proxy)

Repository (unique paper’s replication knowledge):
github.com/dallascard/us-immigration-speeches

Paper quotation:
Card, D., Chang, S., Becker, C., Mendelsohn, J., Voigt, R., Boustan, L., Abramitzky, R., & Jurafsky, D. (2022). Computational evaluation of 140 years of US political speeches reveals extra constructive however more and more polarized framing of immigration. PNAS, 119(31), e2120510119.

CSS Bar Charts Utilizing Fashionable Features

Programming

Dr. Mike

February 5, 2026

CSS Bar Charts Utilizing Fashionable Features

New CSS options can typically make it simpler and extra environment friendly to code designs we already knew find out how to create. This effectivity may stem from diminished code or hacks, or improved readability because of the new options.

In that spirit, let’s revamp what’s underneath the hood of a bar chart.


We start by laying out a grid.
.chart {
  show: grid;
  grid-template-rows: repeat(100, 1fr);
  /* and many others. */
}
The chart metric relies on share, as in “some quantity out of 100.” Let’s say we’re working with a grid containing 100 rows. That should stress check it, proper?
Subsequent, we add the bars to the grid with the grid-column and grid-row properties:
.chart-bar {
  grid-column:  sibling-index();
  grid-row: span attr(data-value quantity);
  /* and many others. */
}
Proper off the bat, I need to be aware a few issues. First is that sibling-index() perform. It’s model new and has incomplete browser assist as of this writing (come on, Firefox!), although it’s presently supported within the newest Chrome and Safari (however not on iOS apparently). Second is that attr() perform. We’ve had it for some time, nevertheless it was just lately upgraded and now accepts data-attributes. So when we've got a kind of in our markup — like data-value="32" — that’s one thing the perform can learn.
With these in place, that’s actually all we have to create a fairly darn good bar chart in vanilla CSS! The next demo has fallbacks in place so to nonetheless see the ultimate end in case your browser hasn’t adopted these new options:
CodePen Embed Fallback
Sure, that was simple to do, nevertheless it’s finest to know precisely why it really works. So, let’s break that down.
Mechanically Establishing Grid Columns
Declaring the sibling-index() perform on the grid-column property explicitly locations the record objects in consecutive columns. I say “specific” as a result of we’re telling the grid precisely the place to position every merchandise by its data-value attribute within the markup. It goes first 

 in first column, second 
 in second column, and so forth.
That’s the facility of sibling-index() — the grid intelligently generates the order for us with out having to do it manually by way of CSS variables.
/* First bar: sibling-index() = 1 */
grid-column: sibling-index();

/* ...leads to: */
grid-column: 1;
grid-column-start: 1; grid-column-end: auto;

/* Second bar: sibling-index() = 2 */
grid-column: sibling-index();

/* ...leads to: */
grid-column: 2;
grid-column-start: 2; grid-column-end: auto;

/* and many others. */
Mechanically Establishing Grid Rows
It’s just about the identical factor! However on this case, every bar occupies a sure variety of rows based mostly on the proportion it represents. The grid will get these values from the data-value attribute within the markup, successfully telling the grid how tall every bar within the chart ought to be.
/* First bar: data-value="32" */
grid-row: span attr(data-value quantity);

/* ...leads to: */
grid-row: span 32

/* Second bar: data-value="46" */
grid-row: span attr(data-value quantity);

/* ...leads to: */
grid-row: span 46
The attr() perform, when supplied with a information sort parameter (the parameter worth quantity in our case), casts the worth retrieved by attr() into that particular sort. In our instance, the attr() perform returns the worth of data-value as a  sort, which is then used to find out the variety of rows to span for every bar.
Let’s Make Completely different Charts!
Since we've got the nuts and bolts down on this method, I figured I’d push issues a bit and exhibit how we are able to apply the identical strategies for every kind of CSS-only charts.
For instance, we are able to use grid-row values to regulate the vertical route of the bars:
CodePen Embed Fallback
Or we are able to skip bars altogether and use markers as a substitute:
CodePen Embed Fallback
CodePen Embed Fallback
We will additionally swap the columns and rows for horizontal bar charts:
CodePen Embed Fallback
Wrapping up
Fairly thrilling, proper? Simply have a look at all of the methods we used to tug these items off earlier than the times of sibling-index() and an upgraded attr():


    1...858687...394Page 86 of 394

Learn how to generate random numbers in Stata

Mechanistic Interpretability: Peeking Inside an LLM

Intro

Refresher: The design of an LLM

Introduction to interpretability strategies

Use instances

LLM interpretability analysis

Conclusion

Contact

References

Anthropic Releases Claude Opus 4.6 With 1M Context, Agentic Coding, Adaptive Reasoning Controls, and Expanded Security Tooling Capabilities

Mannequin focus: agentic work, not single solutions

Lengthy-context capabilities and developer controls

Product integrations: Claude Code, Excel, and PowerPoint

Benchmark profile: coding, search, long-context retrieval

Key Takeaways

Spain’s Ministry of Science shuts down programs after breach claims

The ‘mono’ virus raises the danger of MS and most cancers in some. 22 genes trace at why.

Deno Sandbox launched for working AI-generated code

Apple’s M5 Extremely secret could have been spilled

In abstract:

China joins race to develop space-based information facilities with 5-year plan

I Requested Claude to Replicate a PNAS Paper Utilizing OpenAI’s Batch API. This is What Occurred (Half 1)

CSS Bar Charts Utilizing Fashionable Features

Mechanically Establishing Grid Columns

Mechanically Establishing Grid Rows

Let’s Make Completely different Charts!

Wrapping up