Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Fashions Reaching 68.2% and 72.5% on SWE-bench Verified

April 29, 2026

215

Poolside AI launched the primary two fashions in its Laguna household: Laguna M.1 and Laguna XS.2. Alongside these, the corporate is releasing pool — a light-weight terminal-based coding agent and a twin Agent Consumer Protocol (ACP) client-server — the identical atmosphere Poolside makes use of internally for agent RL coaching and analysis, now obtainable as a analysis preview.

What are These Fashions, and Why Ought to You Care?

Each Laguna M.1 and Laguna XS.2 are Combination-of-Specialists (MoE) fashions. As a substitute of activating all parameters for each token, MoE fashions route every token by means of solely a subset of specialised sub-networks known as ‘specialists.’ This implies a big whole parameter depend and the potential headroom that comes with it whereas solely paying the compute value of a a lot smaller “activated” parameter depend at inference time.

Laguna M.1 is a 225B whole parameter MoE mannequin with 23B activated parameters, educated from scratch on 30T tokens utilizing 6,144 interconnected NVIDIA Hopper GPUs. It accomplished pre-training on the finish of final yr and serves as the inspiration for all the Laguna household. On benchmarks, it reaches 72.5% on SWE-bench Verified, 67.3% on SWE-bench Multilingual, 46.9% on SWE-bench Professional, and 40.7% on Terminal-Bench 2.0.

Laguna XS.2 is the second-generation MoE and Poolside’s first open-weight mannequin, constructed on every part realized since coaching M.1. At 33B whole parameters with 3B activated per token, it’s designed for agentic coding and long-horizon work on a neighborhood machine — compact sufficient to run on a Mac with 36 GB of RAM by way of Ollama. It scores 68.2% on SWE-bench Verified, 62.4% on SWE-bench Multilingual, 44.5% on SWE-bench Professional, and 30.1% on Terminal-Bench 2.0. Poolside may even launch Laguna XS.2-base quickly for practitioners who wish to fine-tune.

Structure: The Effectivity Choices in XS.2

XS.2 makes use of sigmoid gating with per-layer rotary scales, enabling a blended Sliding Window Consideration (SWA) and international consideration format in a 3:1 ratio throughout 40 whole layers — 30 SWA layers and 10 international consideration layers. Sliding Window Consideration limits every token’s consideration to a neighborhood window of 512 tokens reasonably than the complete sequence, dramatically slicing KV cache reminiscence. The worldwide consideration layers at a 1-in-4 ratio protect long-range dependencies with out paying the complete value in every single place. The mannequin additionally quantizes the KV cache to FP8, additional decreasing reminiscence per token.

Beneath the hood, XS.2 makes use of 256 specialists with 1 shared skilled, helps a context window of 131,072 tokens, and options native reasoning help — interleaved pondering between device calls with per-request management over enabling or disabling pondering.

https://poolside.ai/weblog/laguna-a-deeper-dive

Coaching: Three Areas Poolside Pushed Exhausting

Poolside workforce trains all its fashions from scratch utilizing its personal knowledge pipeline, its personal coaching codebase (Titan), and its personal agent RL infrastructure. Three areas noticed specific funding for Laguna.

AutoMixer: Optimizing the Knowledge Combine Routinely. Knowledge curation and the combination that goes into coaching is extraordinarily impactful on closing mannequin efficiency. Slightly than counting on guide heuristics, Poolside developed an automixing framework that trains a swarm of roughly 60 proxy fashions, every on a unique knowledge combine, and measures efficiency throughout key functionality teams — code, math, STEM, and customary sense. Surrogate regressors are then match to approximate how adjustments in dataset proportions have an effect on downstream evaluations, giving a realized mapping from knowledge combine to efficiency that may be straight optimized. The strategy is impressed by prior work together with Olmix, MDE, and RegMix, tailored to Poolside’s setting with richer knowledge groupings.

On the information facet, each Laguna fashions have been educated on greater than 30T tokens. Poolside’s diversity-preserving knowledge curation strategy — which retains parts of mid- and lower-quality buckets alongside top-quality knowledge to keep away from STEM bias — yields roughly 2× extra distinctive tokens in comparison with precision-focused pipelines, with the achieve persisting at longer coaching horizons. A separate deduplication evaluation additionally confirmed that international deduplication disproportionately removes high-quality knowledge, informing how the workforce tuned its pipeline. Artificial knowledge contributes about 13% of the ultimate coaching combine in Laguna XS.2, with the Laguna collection utilizing roughly 4.4T+ artificial tokens in whole.

Muon Optimizer. Slightly than AdamW — the most typical optimizer in massive mannequin coaching — Poolside used a distributed implementation of the Muon optimizer by means of all coaching levels of each fashions. In preliminary pre-training ablations, the analysis workforce achieved the identical coaching loss as an AdamW baseline in roughly 15% fewer steps, with massive absolute analysis uplifts on the ultimate mannequin, and achieved studying fee switch throughout mannequin scales. A further profit: Muon requires just one state per parameter reasonably than two, decreasing reminiscence necessities for each coaching and checkpointing. Throughout pre-training of Laguna M.1, the overhead from the optimizer was lower than 1% of the coaching step time.

Poolside additionally runs periodic hash checks on mannequin weights throughout coaching replicas to catch silent knowledge corruption (SDC) from faulty GPUs — particularly errors in arithmetic logic and pipeline registers, which not like DRAM and SRAM should not lined by ECC safety.

Async On-Coverage Agent RL. That is arguably essentially the most complicated piece of the Laguna coaching stack. Poolside constructed a completely asynchronous on-line RL system the place actor processes pull duties from a dataset, spin up sandboxed containers, and run the manufacturing agent binary towards every process utilizing the freshly deployed mannequin. The ensuing trajectories are scored, filtered, and written to Iceberg tables, whereas the coach constantly consumes these data and produces the following checkpoint — inference and coaching operating asynchronously in parallel, with throughput tuned to steadiness off-policy staleness.

Key Takeaways

Poolside releases its first open-weight mannequin: Laguna XS.2 is a 33B whole parameter MoE mannequin with solely 3B activated parameters per token, obtainable below an Apache 2.0 license — compact sufficient to run regionally on a Mac with 36 GB of RAM by way of Ollama.
Robust benchmark efficiency at small scale: Laguna XS.2 scores 68.2% on SWE-bench Verified and 44.5% on SWE-bench Professional, whereas the bigger Laguna M.1 (225B whole, 23B activated) reaches 72.5% on SWE-bench Verified and 46.9% on SWE-bench Professional — each educated from scratch on 30T tokens.
Muon optimizer beats AdamW by 15% in coaching effectivity: Poolside changed AdamW with a distributed implementation of the Muon optimizer, reaching the identical coaching loss in roughly 15% fewer steps, with decrease reminiscence necessities — just one state per parameter as an alternative of two.
AutoMixer replaces guide knowledge mixing with realized optimization: As a substitute of handcrafted knowledge recipes, Poolside trains a swarm of ~60 proxy fashions on totally different knowledge mixes and suits surrogate regressors to optimize dataset proportions — with artificial knowledge making up ~13% of Laguna XS.2’s closing coaching combine from a complete of 4.4T+ artificial tokens.
Absolutely asynchronous agent RL with GPUDirect RDMA weight switch: Poolside’s RL system runs inference and coaching in parallel, transferring a whole bunch of gigabytes of BF16 weights between nodes in below 5 seconds by way of GPUDirect RDMA, utilizing a token-in, token-out actor design and the CISPO algorithm for off-policy coaching stability.

Try the Mannequin Weights and Technical particulars. Additionally, be happy to observe us on Twitter and don’t overlook to affix our 130k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you may be part of us on telegram as nicely.

Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Join with us

Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Fashions Reaching 68.2% and 72.5% on SWE-bench Verified

What are These Fashions, and Why Ought to You Care?

Structure: The Effectivity Choices in XS.2

Coaching: Three Areas Poolside Pushed Exhausting

Key Takeaways

Related Articles

Can AI brokers resolve monitoring and scaling crises on the community?

5 Finest Social Intelligence Instruments for 2026

Google to make use of UK and EU person IP addresses for advert personalization

Latest Articles

Can AI brokers resolve monitoring and scaling crises on the community?

5 Finest Social Intelligence Instruments for 2026

Google to make use of UK and EU person IP addresses for advert personalization

The primary Atlantic tropical storm of 2026 is right here—and it was once a Pacific cyclone

The Siren Track of ariaNotify()