JetBrains Releases Mellum2: A 12B MoE Mannequin for Quick, Specialised Duties in Multi-Mannequin AI Pipelines

June 2, 2026

90

JetBrains launched Mellum2, open-sourcing the weights underneath the Apache 2.0 license. The primary model of Mellum was a completion-focused 4B dense mannequin. Mellum2 is its successor: a general-purpose mannequin specialised in software program engineering. It covers code technology and enhancing, debugging, multi-step reasoning, software use and performance calling, agentic coding, and conversational programming help.

JetBrains staff positions Mellum2 as a “focal mannequin” — a quick, specialised part inside bigger AI techniques, not a standalone substitute for frontier fashions.

Structure

Mellum2 makes use of a Combination-of-Specialists (MoE) structure with 12B complete parameters and a pair of.5B lively parameters per token. In MoE fashions, solely a subset of parameters runs on every token. Right here, the mannequin has 64 specialists and prompts 8 per token. This retains per-token compute equal to a 2.5B dense mannequin, whereas the full parameter rely gives greater capability for specialization.

Key architectural particulars:

Layers: 28
Hidden measurement: 2304
MoE specialists: 64 complete, 8 activated per token
Consideration: Grouped-Question Consideration (GQA) with 32 question heads and 4 KV heads
Sliding Window Consideration (SWA): Utilized to 3 of each 4 layers, with a window measurement of 1,024. Full consideration runs on the remaining layer.
Context size: 131,072 tokens
Multi-Token Prediction (MTP) head: Serves as an auxiliary pre-training goal and as a built-in draft mannequin for speculative decoding
Precision: bfloat16
Vocabulary measurement: 98,304

The mannequin handles pure language and code. It’s not multimodal — there is no such thing as a picture or video enter.

Pre-Coaching

Pre-training spans roughly 10.6 trillion tokens by means of a three-phase curriculum. The info combination progressively shifts from numerous internet content material towards curated code and mathematical content material throughout the three phases.

Coaching used the Muon optimizer underneath FP8 hybrid precision with a Warmup-Maintain-Decay studying fee schedule with linear decay to zero.

After pre-training, the bottom mannequin’s context window was prolonged to 128K tokens utilizing a layer-selective YaRN methodology earlier than post-training started.

The Mannequin Household

JetBrains staff launched six checkpoints protecting the complete coaching pipeline:

Checkpoint	Description
Mellum2-12B-A2.5B-Base-Pretrain	Base checkpoint earlier than long-context extension
Mellum2-12B-A2.5B-Base	Last base mannequin after context extension
Mellum2-12B-A2.5B-Instruct-SFT	Supervised fine-tuned instruction checkpoint
Mellum2-12B-A2.5B-Considering-SFT	Supervised pondering checkpoint
Mellum2-12B-A2.5B-Instruct	RL-tuned instruction mannequin
Mellum2-12B-A2.5B-Considering	RL-tuned pondering mannequin

Publish-training follows two levels: supervised fine-tuning (SFT), then reinforcement studying with verifiable rewards (RLVR) on math, executable coding, software use, instruction following, reasoning, and data duties.

The Instruct variant solutions immediately, with out an externalized chain of thought. Use it for low-latency duties: direct solutions, software use, and instruction following.

The Considering variant emits an specific reasoning hint earlier than its last reply. Use it for complicated debugging, multi-step planning, or agentic flows the place step-by-step reasoning issues.

Benchmark Outcomes

All numbers under are self-reported by JetBrains. The comparability set is open-weight fashions within the 4B–14B vary.

Coding:

Benchmark	Mellum2 Instruct	Qwen3.5 (4B)	Qwen3.5 (9B)	Ministral 3 (14B)	OLMo-3 (7B)	Seed-Coder (8B)
LiveCodeBench v6	37.2	51.0	63.7	42.4	28.2	28.1
EvalPlus	78.4	69.4	71.8	74.1	67.3	73.8
MultiPL-E	67.1	51.0	67.1	71.5	36.1	77.0

Software Use:

Benchmark	Mellum2 Instruct	Qwen3.5 (4B)	Qwen3.5 (9B)	Ministral 3 (14B)	OLMo-3 (7B)
BFCL v3	66.3	64.1	70.5	52.7	41.9
BFCL v4	44.2	52.0	60.6	38.8	19.8

Math:

Benchmark	Mellum2 Instruct	Qwen3.5 (4B)	Qwen3.5 (9B)	Ministral 3 (14B)	OLMo-3 (7B)
AIME 2025+2026	41.7	38.3	58.3	33.3	40.0
GSM-Plus	80.5	85.2	87.9	86.6	85.8

Information and Conversational:

Benchmark	Mellum2 Instruct	Qwen3.5 (4B)	Qwen3.5 (9B)	Ministral 3 (14B)	OLMo-3 (7B)
MMLU-Redux	78.1	87.5	91.1	85.9	71.8
GPQA Diamond	40.9	76.8	79.8	58.6	40.9
IFEval	75.8	82.1	83.9	67.3	83.2
MixEval	62.2	65.9	71.1	71.2	59.4

Benchmark notes:

EvalPlus is the imply of HumanEval+ and MBPP+
AIME is the imply of AIME 2025 and AIME 2026 (30 questions every)
BFCL v4 is the macro-average of 5 subtasks: v1, v2, v3, internet search, reminiscence
Seed-Coder (8B) doesn’t help native software calling; BFCL scores should not listed for it

https://weblog.jetbrains.com/ai/2026/06/mellum2-goes-open-source-a-fast-model-for-ai-workflows/

Use Circumstances

JetBrains identifies 4 manufacturing situations the place Mellum2’s latency and effectivity profile is related:

Routing and orchestration: In a multi-model system, a router analyzes incoming prompts and selects the suitable mannequin or software for every job. Mellum2’s low per-token compute makes it appropriate for this high-frequency classification step.
Low-latency RAG pipelines: Retrieval-Augmented Era (RAG) techniques retrieve related context, summarize it, and generate a response. Mellum2 handles retrieval summarization at decrease latency than bigger dense fashions.
Sub-agents in complicated workflows: Agent pipelines break duties into steps: context gathering, planning, validation, and execution. Mellum2 can deal with repetitive or latency-sensitive steps as an alternative of routing each step by means of a single massive frontier mannequin.
Non-public and native deployment: The Apache 2.0 license permits self-hosting with out restrictions. Engineers can run Mellum2 on their very own infrastructure, maintaining code and knowledge underneath their management.

Strengths and Limitations

Strengths:

MoE design prompts solely 2.5B of 12B parameters per token — per-token compute equal to a 2.5B dense mannequin
MTP head permits speculative decoding with no separate draft mannequin
131,072 token context window
Full checkpoint set launched: base pretrain, base, SFT, and RL-tuned variants for each Instruct and Considering
Apache 2.0 license — permits business use, self-hosting, and fine-tuning
Robust EvalPlus (78.4) and BFCL v3 (66.3) scores relative to 4B–14B comparisons
vLLM help, together with non-compulsory tool-calling through --tool-call-parser hermes

Limitations:

Textual content and code solely — no picture or multimodal enter
LiveCodeBench v6 (37.2) trails Qwen3.5 9B (63.7) and Ministral 3 14B (42.4)
GPQA Diamond (40.9) and MMLU-Redux (78.1) are under most fashions within the comparability set
GSM-Plus (80.5) is under all comparable fashions listed
Not designed for frontier-level duties — JetBrains explicitly positions Mellum2 as a part mannequin

Marktechpost’s Visible Explainer

Overview

JetBrains Open-Sources Mellum2

A 12B Combination-of-Specialists mannequin launched underneath Apache 2.0 on June 2, 2026. Educated from scratch on ~10.6 trillion tokens for software program engineering duties.

Structure

How Mellum2 Is Constructed

MoE prompts 8 of 64 specialists per token — per-token compute stays equal to a 2.5B dense mannequin. An MTP head permits speculative decoding with no separate draft mannequin.

Specialists (complete / lively)

64 / 8

SWA Window

1,024 (¾ layers)

Pre-Coaching

Coaching Pipeline

Three-phase curriculum progressively shifts from numerous internet knowledge towards curated code and math. Context prolonged to 128K through layer-selective YaRN earlier than post-training.

Knowledge: ~10.6 trillion tokens throughout three curriculum phases
Optimizer: Muon underneath FP8 hybrid precision
LR Schedule: Warmup-Maintain-Decay with linear decay to zero
Context Extension: Layer-selective YaRN to 128K tokens
Publish-Coaching: SFT → RLVR on coding, math, software use, reasoning, data
Design Constraint: Inference effectivity on commodity GPUs validated by ablation

Mannequin Household

Six Checkpoints Launched

Full pipeline from base pretrain by means of RL-tuned variants. Use Instruct for direct low-latency solutions. Use Considering for specific step-by-step reasoning traces.

BASEMellum2-12B-A2.5B-Base-PretrainEarlier than context extension

BASEMellum2-12B-A2.5B-BaseAfter YaRN extension

SFTMellum2-12B-A2.5B-Instruct-SFTSupervised instruction

SFTMellum2-12B-A2.5B-Considering-SFTSupervised pondering

RLVRMellum2-12B-A2.5B-InstructRL-tuned, no CoT

RLVRMellum2-12B-A2.5B-ConsideringRL-tuned, specific CoT

Benchmarks

Analysis Outcomes (Instruct Variant)

All numbers self-reported by JetBrains. Comparability set: open-weight fashions within the 4B–14B vary.

Benchmark	Mellum2	Qwen3.5 9B	Ministral 3 14B	OLMo-3 7B
LiveCodeBench v6	37.2	63.7	42.4	28.2
EvalPlus	78.4	71.8	74.1	67.3
MultiPL-E	67.1	67.1	71.5	36.1
BFCL v3	66.3	70.5	52.7	41.9
AIME 2025+2026	41.7	58.3	33.3	40.0
IFEval	75.8	83.9	67.3	83.2

Use Circumstances

The place Mellum2 Suits in Manufacturing

JetBrains positions Mellum2 as a “focal mannequin” — dealing with high-frequency, latency-sensitive steps inside bigger AI pipelines.

Routing & Orchestration — Analyze prompts and choose the proper mannequin or software per job
RAG Pipelines — Summarize retrieved context at low latency earlier than response technology
Sub-Brokers — Deal with repetitive steps in agent pipelines (context gathering, validation, planning)
Non-public Deployment — Apache 2.0 permits full self-hosting with no exterior API calls required

Strengths & Limitations

What Works and What Doesn’t

Mellum2 is designed for effectivity in part roles, not frontier-level functionality throughout all benchmarks.

✓ Strengths

2.5B lively params — compute of a dense 2.5B mannequin
MTP head permits built-in speculative decoding
131K token context window
Robust EvalPlus (78.4) and BFCL v3 (66.3)
Apache 2.0 — business use, fine-tuning, self-hosting
vLLM help with tool-calling

✗ Limitations

Textual content and code solely — no multimodal enter
LiveCodeBench v6 (37.2) under Qwen3.5 9B (63.7)
GPQA Diamond (40.9) under most comparisons
GSM-Plus (80.5) trails all fashions listed
Not a frontier substitute — part function solely

Fast Begin

Deploy with vLLM

Set up vLLM and serve the Instruct variant. Allow tool-calling with the hermes parser for function-calling workflows.

pip set up vllm

# Primary serve
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct 
  --max-model-len 131072

# With software calling
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct 
  --max-model-len 131072 
  --enable-auto-tool-choice 
  --tool-call-parser hermes

Mannequin weights: huggingface.co/JetBrains/mellum-2 · Technical report: arXiv:2605.31268

Getting Began

Serve Mellum2 with vLLM:

pip set up vllm
vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct --max-model-len 131072

With software calling enabled:

vllm serve JetBrains/Mellum2-12B-A2.5B-Instruct 
  --max-model-len 131072 
  --enable-auto-tool-choice 
  --tool-call-parser hermes

Utilizing the Hugging Face Transformers library:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("JetBrains/Mellum2-12B-A2.5B-Instruct")
mannequin = AutoModelForCausalLM.from_pretrained("JetBrains/Mellum2-12B-A2.5B-Instruct")

messages = [{"role": "user", "content": "Write a Python function to reverse a string."}]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(mannequin.gadget)

outputs = mannequin.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs["input_ids"].form[-1]:]))

Take a look at the Mannequin Weights and Technical particulars. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be part of us on telegram as effectively.

Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us

JetBrains Releases Mellum2: A 12B MoE Mannequin for Quick, Specialised Duties in Multi-Mannequin AI Pipelines

Structure

Pre-Coaching

The Mannequin Household

Benchmark Outcomes

Use Circumstances

Strengths and Limitations

Strengths:

Limitations:

Marktechpost’s Visible Explainer

JetBrains Open-Sources Mellum2

How Mellum2 Is Constructed

Coaching Pipeline

Six Checkpoints Launched

Analysis Outcomes (Instruct Variant)

The place Mellum2 Suits in Manufacturing

What Works and What Doesn’t

✓ Strengths

✗ Limitations

Deploy with vLLM

Getting Began

Related Articles

5 Key Ideas Behind Agentic AI Each Engineer Should Perceive

Learn how to execute queries in parallel utilizing EF Core

Language Mannequin Hallucination Analysis with GraphEval

Latest Articles

5 Key Ideas Behind Agentic AI Each Engineer Should Perceive

Learn how to execute queries in parallel utilizing EF Core

Language Mannequin Hallucination Analysis with GraphEval

Intel simply posted its greatest progress in 15 years – and burned billions to make it occur

One in every of NASA’s Most Necessary Deep Area Observatories Hit by Spanish Wildfires