DeepReinforce Releases Ornith-1.0: An Open-Supply Coding Mannequin Household That Learns Its Personal RL Scaffolds

June 25, 2026

5

DeepReinforce has launched Ornith-1.0, an open-source mannequin household constructed for agentic coding. The lineup spans 4 sizes, from a 9B dense mannequin to a 397B mixture-of-experts flagship. Each checkpoint ships beneath the MIT license on Hugging Face. The fashions are post-trained on prime of pretrained Gemma 4 and Qwen 3.5.

Most coding brokers pair a mannequin with a set, human-designed harness. Ornith-1.0 as a substitute learns to put in writing its personal. The DeepReinforce analysis staff experiences state-of-the-art outcomes amongst open fashions of comparable measurement.

TL;DR

Ornith-1.0 ships in 9B, 31B, 35B-MoE, and 397B-MoE sizes beneath MIT, constructed on Gemma 4 and Qwen 3.5.
The mannequin learns its personal scaffold throughout RL, collectively optimizing the harness and the answer.
Ornith-1.0-397B tops Claude Opus 4.7 on each headline benchmarks, however not Opus 4.8 or the bigger GLM-5.2-744B.
Three layers — fastened belief boundary, deterministic monitor, frozen LLM choose — guard in opposition to reward hacking.

What’s Ornith-1.0?

Ornith-1.0 is a set of reasoning fashions tuned for coding brokers. The variants are 9B Dense, 31B Dense, 35B MoE, and 397B MoE. The 35B mannequin is mixture-of-experts and prompts roughly 3B parameters per token. FP8 and GGUF builds are additionally printed for sooner native serving.

Every mannequin is a reasoning mannequin. Replies open with a block earlier than the ultimate reply. The serving recipes allow a reasoning parser, in order that hint returns in a separate reasoning_content discipline. The fashions additionally emit well-formed instrument requires agent loops.

Deployment is easy. The 9B mannequin is about 19GB in bf16 and serves on a single 80GB GPU. Serving recipes goal vLLM, SGLang, and Transformers. Every mannequin exposes an OpenAI-compatible endpoint. Commonplace agent frameworks due to this fact work with out code modifications.

Interactive Explainer

=5){clearInterval(timer);timer=null;b.textContent=”Auto-run ▶”;}else{doStep();}},1400); }); root.querySelector(‘#resetBtn’).addEventListener(‘click on’,perform(){ if(timer){clearInterval(timer);timer=null;root.querySelector(‘#autoBtn’).textContent=”Auto-run ▶”;} step=0;reward=0.08; root.querySelector(‘#rFill’).model.width=”8%”; root.querySelector(‘#rVal’).textContent=”0.08″; root.querySelector(‘#scaffTxt’).textContent=scaffs[0]; root.querySelector(‘#outTxt’).textContent=”Press “Run coaching step” to start.”; root.querySelector(‘#stepOut’).innerHTML=’Step 0 — untrained coverage with a set, hand-written harness.’; resize(); }); /* benchmark knowledge (vendor-reported) */ var BENCHES=[‘Terminal-Bench 2.1′,’SWE-Bench Verified’,’SWE-Bench Pro’,’SWE-Bench Multilingual’,’NL2Repo’,’ClawEval Avg’]; var DATA={ t397:{label:’Ornith-1.0-397B’,hero:’Ornith-1.0-397B’, fashions:[‘Ornith-1.0-397B’,’Qwen3.5-397B’,’Qwen3.7-Max’,’GLM-5.2-744B’,’Minimax-M3-428B’,’DeepSeek-V4-Pro-1.6T’,’Claude Opus 4.7′,’Claude Opus 4.8′], vals:[[77.5,53.5,73.5,81.0,64,64,70.3,85],[82.4,76.4,80.4,null,null,80.6,80.8,87.6],[62.2,51.6,60.6,62.1,59,55.4,64.3,69.2],[78.9,69.3,78.3,null,null,76.2,null,null],[48.2,36.8,47.2,48.9,42.1,null,null,69.7],[77.1,70.7,65.2,null,null,75.8,78.2,null]]}, t35:{label:’Ornith-1.0-35B-A3B’,hero:’Ornith-1.0-35B-A3B’, fashions:[‘Ornith-1.0-35B-A3B’,’Qwen3.5-35B-A3B’,’Qwen3.6-35B-A3B’,’Gemma4-31B’,’Qwen3.5-397B’], vals:[[64.2,41.4,52.5,42.1,53.5],[75.6,70,73.4,52,76.4],[50.4,44.6,49.5,35.7,51.6],[69.3,60.3,67.2,51.7,69.3],[34.6,20.5,29.4,15.5,36.8],[69.8,65.4,68.7,48.5,70.7]]}, t9:{label:’Ornith-1.0-9B’,hero:’Ornith-1.0-9B’, fashions:[‘Ornith-1.0-9B’,’Qwen3.5-9B’,’Qwen3.5-35B-A3B’,’Gemma4-12B’,’Gemma4-31B’], vals:[[43.1,21.3,41.4,21,42.1],[69.4,53.2,70,44.2,52],[42.9,31.3,44.6,27.6,35.7],[52,39.7,60.3,32.5,51.7],[27.2,16.2,20.5,10.3,15.5],[63.1,53.2,65.4,32.5,48.5]]} }; var curTier=”t397″,curB=0; var bchips=root.querySelector(‘#benchChips’); BENCHES.forEach(perform(b,i){ var c=doc.createElement(‘div’);c.className=”chip”+(i===0?’ on’:”);c.textContent=b;c.dataset.b=i; c.addEventListener(‘click on’,perform(){curB=i;bchips.querySelectorAll(‘.chip’).forEach(perform(x){x.classList.take away(‘on’)});c.classList.add(‘on’);draw();}); bchips.appendChild(c); }); root.querySelectorAll(‘.chip[data-tier]’).forEach(perform(c){ c.addEventListener(‘click on’,perform(){curTier=c.dataset.tier;root.querySelectorAll(‘.chip[data-tier]’).forEach(perform(x){x.classList.take away(‘on’)});c.classList.add(‘on’);draw();}); }); perform draw(){ var d=DATA[curTier];var row=d.vals[curB];var chart=root.querySelector(‘#chart’);chart.innerHTML=”; var max=Math.max.apply(null,row.filter(perform(v){return v!=null})); d.fashions.forEach(perform(m,i){ var v=row[i];var hero=(m===d.hero); var div=doc.createElement(‘div’);div.className=”row”+(hero?’ hero’:”)+(v==null?’ na’:”); div.innerHTML=’ ‘+m+’ ‘+(v==null?’n/a’:v)+’ ‘; chart.appendChild(div); (perform(bf,val){setTimeout(perform(){bf.model.width=(val==null?0:(val/max*100))+’%’;},40);})(div.querySelector(‘.bf’),v); }); root.querySelector(‘#benchNote’).textContent=”Benchmark: “+BENCHES[curB]+’. Bars scaled to the best rating proven. “n/a” = not reported by the seller. Self-reported, not independently verified.’; resize(); } draw(); /* defenses accordion */ root.querySelectorAll(‘.layer’).forEach(perform(l){ l.addEventListener(‘click on’,perform(){l.classList.toggle(‘open’);resize();}); }); /* auto-resize for WordPress iframe */ perform resize(){ strive{ var h=root.offsetHeight+40; if(window.mother or father){window.mother or father.postMessage({sort:’mtp-ornith-height’,top:h},’*’);} }catch(e){} } window.addEventListener(‘load’,resize); setTimeout(resize,300); window.addEventListener(‘resize’,resize); })();

” model=”width:100%;border:0;show:block;min-height:600px;overflow:hidden” top=”600″ scrolling=”no” loading=”lazy” title=”Ornith-1.0 Interactive Explainer”>

The Self-Scaffolding Thought

Most coding brokers depend on a scaffold, additionally known as a harness. A scaffold wraps the mannequin with reminiscence, instruments, error dealing with, and orchestration logic. AI groups often hand-design one scaffold per process class.

Ornith-1.0 treats the scaffold as a learnable object as a substitute. Throughout reinforcement studying, the scaffold co-evolves with the mannequin’s coverage. Every RL step runs in two phases.

First, the mannequin reads the duty and its earlier scaffold. It then proposes a refined scaffold. Second, it makes use of that scaffold and the duty to generate an answer rollout. Reward from the rollout flows again to each phases.

So the mannequin is optimized to creator orchestration, not simply solutions. Over coaching, higher-reward scaffolds are mutated and chosen mechanically. Per-task methods emerge with out hand-engineered harness design.

Coaching additionally runs asynchronously, utilizing a pipeline-RL setup. A staleness weight downweights older, off-policy tokens and drops them previous a threshold. The optimization makes use of a token-level GRPO goal.

Guarding Towards Reward Hacking

Letting a mannequin write its personal scaffold invitations reward hacking. A scaffold may learn seen take a look at recordsdata and hardcode anticipated outputs. It may additionally copy an oracle answer sitting within the surroundings. DeepReinforce staff describes three protection layers.

The outer belief boundary is fastened and immutable. The surroundings, instrument floor, and take a look at isolation keep exterior the mannequin’s attain. The mannequin evolves solely its internal coverage scaffold.
A deterministic monitor flags banned actions. Studying withheld paths or enhancing verification scripts earns zero reward. These trajectories are excluded from the benefit computation.
A frozen LLM choose acts as a veto. It sits on prime of the verifier, not as the first reward.

Benchmark

DeepReinforce experiences vendor numbers throughout a number of agentic coding benchmarks. At flagship scale, Ornith-1.0-397B posts 77.5 on Terminal-Bench 2.1 and 82.4 on SWE-Bench Verified. On SWE-Bench Verified, that 82.4 trails solely Claude Opus 4.8 (87.6) among the many listed fashions. On Terminal-Bench 2.1, the image is extra blended.

Ornith-1.0-397B beats Claude Opus 4.7 (70.3) on Terminal-Bench 2.1. However it trails Claude Opus 4.8 (85) and the bigger GLM-5.2-744B (81.0). So the ‘state-of-the-art’ declare is scoped to open fashions of comparable measurement.

The smaller fashions carry the effectivity case. The 35B mannequin scores 64.2 on Terminal-Bench 2.1, above Qwen 3.5-397B’s 53.5. The 9B mannequin reaches 43.1 on Terminal-Bench 2.1 and 69.4 on SWE-Bench Verified.

Benchmark	Ornith-1.0-397B	Qwen3.5-397B	Qwen3.7-Max	GLM-5.2-744B	Minimax-M3-428B	DeepSeek-V4-Professional-1.6T	Claude Opus 4.7	Claude Opus 4.8
Terminal-Bench 2.1	77.5	53.5	73.5	81.0	64	64	70.3	85
SWE-Bench Verified	82.4	76.4	80.4	–	–	80.6	80.8	87.6
SWE-Bench Professional	62.2	51.6	60.6	62.1	59	55.4	64.3	69.2
SWE-Bench Multilingual	78.9	69.3	78.3	–	–	76.2	–	–
NL2Repo	48.2	36.8	47.2	48.9	42.1	–	–	69.7
ClawEval Avg	77.1	70.7	65.2	–	–	75.8	78.2	–

Use Instances and a Fast Begin

The fashions goal terminal-native coding brokers and repository-scale work. Sensible matches embody multi-file refactors, bug localization, and test-driven patches. The 9B mannequin fits edge or single-GPU setups the place latency and price matter. The 397B mannequin targets most accuracy on lengthy, multi-step duties.

For instance, a dev can run the 9B mannequin regionally to triage a failing take a look at suite. A platform staff can self-host the 397B mannequin for an inside coding agent.

Serving is a one-liner with vLLM:

vllm serve deepreinforce-ai/Ornith-1.0-9B 
    --served-model-name Ornith-1.0-9B 
    --max-model-len 262144 
    --enable-auto-tool-choice --tool-call-parser qwen3_xml 
    --reasoning-parser qwen3 
    --trust-remote-code

Then name it with any OpenAI consumer:

from openai import OpenAI

consumer = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

resp = consumer.chat.completions.create(
    mannequin="Ornith-1.0-9B",
    messages=[{"role": "user", "content": "Write a Python is_prime(n)."}],
    temperature=0.6, top_p=0.95,
)
msg = resp.decisions[0].message
print(getattr(msg, "reasoning_content", None))  # the  hint
print(msg.content material)                              # the ultimate reply

The reasoning hint returns in reasoning_content, with the reply in content material. Really useful sampling is temperature=0.6, top_p=0.95, top_k=20. The mannequin additionally plugs into OpenHands, OpenClaw, and OpenCode.

Take a look at the Mannequin Weights and Technical particulars. Additionally, be at liberty to observe us on Twitter and don’t neglect to hitch our 150k+ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be a part of us on telegram as nicely.

Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Join with us

DeepReinforce Releases Ornith-1.0: An Open-Supply Coding Mannequin Household That Learns Its Personal RL Scaffolds

TL;DR

What’s Ornith-1.0?

Interactive Explainer

The Self-Scaffolding Thought

Guarding Towards Reward Hacking

Benchmark

Use Instances and a Fast Begin

Related Articles

Greatest Mass Payout Platform 2026

How you can Construct AI Brokers That Really Be taught

Oracle’s AI-based layoffs might not be over

Latest Articles

Greatest Mass Payout Platform 2026

How you can Construct AI Brokers That Really Be taught

Oracle’s AI-based layoffs might not be over

Google Pockets simply received an replace to trace your on-line orders

17 Greatest Prime Day Health Tech Offers (2026) As much as $250 Off