Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Assist for On-Gadget Inference

June 28, 2026

4

Liquid AI shipped LFM2.5-230M, it’s the corporate’s smallest mannequin to this point. The discharge targets a selected job: working agentic duties on telephones, robots, and automation gadgets. Each the bottom and instruction-tuned checkpoints are open-weight on Hugging Face.

The pitch is slim on goal. This isn’t a normal reasoning mannequin. It’s constructed for knowledge extraction and gear use on edge {hardware}.

TL;DR

Liquid AI’s LFM2.5-230M is its smallest mannequin but: 230M params, open-weight, constructed on LFM2.
Runs on-device at 213 tok/s on a Galaxy S25 Extremely and 42 on a Raspberry Pi 5.
Beats bigger fashions (Qwen3.5-0.8B, Gemma 3 1B) on instruction following and knowledge extraction.
Tuned for device use and extraction; not for math, code era, or artistic writing.
Day-one help throughout llama.cpp, MLX, vLLM, SGLang, and ONNX, with a 293–375 MB footprint.

What’s LFM2.5-230M?

LFM2.5-230M is a 230-million-parameter, text-only mannequin. It’s constructed on the LFM2 structure. The mannequin has 14 layers whole. Eight are double-gated LIV convolution blocks. The remaining six are grouped-query consideration (GQA) blocks. The hybrid structure targets quick CPU inference.

The context size is 32,768 tokens. The vocabulary dimension is 65,536. The data cutoff is mid-2024. It helps ten languages, together with English, Chinese language, Arabic, and Japanese.

Liquid AI crew ships two checkpoints. LFM2.5-230M-Base is the pre-trained mannequin for fine-tuning. LFM2.5-230M is the general-purpose instruction-tuned model. The license is lfm1.0.

Coaching and Publish-Coaching

The mannequin was pre-trained on 19 trillion tokens. That whole features a 32K context extension part. The post-training recipe then runs in three levels.

First comes supervised fine-tuning with distillation from the bigger LFM2.5-350M. Second is direct choice optimization (DPO). Third is multi-domain reinforcement studying. This preserves flexibility for downstream specialization.

The distillation step is what retains a 230M mannequin aggressive with bigger checkpoints. It inherits habits from the larger LFM2.5-350M on focused duties.

Benchmark

Liquid AI crew evaluated LFM2.5-230M throughout ten benchmarks. They span data, instruction following, knowledge extraction, and gear use.

The instruction-following outcomes help that. On IFEval, LFM2.5-230M scores 71.71. That beats Qwen3.5-0.8B (59.94) and Gemma 3 1B IT (63.49). On IFBench it scores 38.40, forward of each. On CaseReportBench, a medical data-extraction check, it scores 22.51.

Mannequin	Params	IFEval	IFBench	CaseReportBench	BFCLv4	MMLU-Professional
LFM2.5-230M	230M	71.71	38.40	22.51	21.03	20.25
LFM2.5-350M	350M	76.96	40.69	32.45	21.86	20.01
Granite 4.0-H-350M	350M	61.27	17.22	12.44	13.28	13.14
Qwen3.5-0.8B (Instruct)	800M	59.94	22.87	13.83	18.70	37.42
Gemma 3 1B IT	1B	63.49	20.33	2.28	7.17	14.04

LFM2.5-230M leads on instruction following and knowledge extraction. It trails on broad data: MMLU-Professional is 20.25, behind Qwen3.5-0.8B’s 37.42. Additionally it is weak on some agentic device use. On τ²-Bench Telecom it scores simply 5.26.

Liquid AI is direct concerning the limits. It doesn’t suggest the mannequin for reasoning-heavy workloads. Meaning superior math, code era, and artistic writing.

Use Circumstances With Examples

The mannequin matches two jobs effectively.

The primary is large-scale knowledge extraction pipelines. Image a pipeline parsing 100,000 medical reviews into structured fields. A 4-bit construct with a 293–375 MB reminiscence footprint runs that on commodity CPUs. You extract domestically, with no per-token API invoice.

The second job is light-weight on-device agentic workloads. Suppose a house automation hub that turns speech into device calls. Or a cellphone assistant that routes a request to the precise operate.

As an early sign, Liquid AI deployed the mannequin on a Unitree G1 humanoid robotic. It ran completely on the robotic’s onboard NVIDIA Jetson Orin. There the mannequin acted as a skill-selection layer. It turned one natural-language instruction right into a sequence of device calls. These calls invoked low-level abilities from NVIDIA’s SONIC framework.

LFM2.5 helps operate calling in 4 steps. You outline instruments as JSON within the system immediate. The mannequin writes a Pythonic operate name between particular tokens. You execute the decision and return the outcome. The mannequin then writes a plain-text reply.

By default the decision is a Python listing. It sits between the <|tool_call_start|> and <|tool_call_end|> tokens. Right here is the documented sample, with the device JSON abbreviated:

<|im_start|>system
Record of instruments: [{"name": "get_candidate_status",
  "parameters": {"candidate_id": {"type": "string"}}}]<|im_end|>
<|im_start|>consumer
What's the present standing of candidate ID 12345?<|im_end|>
<|im_start|>assistant
<|tool_call_start|>[get_candidate_status(candidate_id="12345")]<|tool_call_end|>Checking the present standing of candidate ID 12345.<|im_end|>

You may also drive JSON-formatted calls by way of the system immediate.

Working It: A Minimal Instance

The mannequin works with Transformers 5.0.0 and up. The advisable era settings are temperature 0.1, top_k 50, and repetition_penalty 1.05. Observe the do_sample=True flag, which is required for these sampling settings to use.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "LiquidAI/LFM2.5-230M"
mannequin = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    dtype="bfloat16",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is C. elegans?"}],
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(mannequin.gadget)

output = mannequin.generate(
    **inputs,
    do_sample=True,
    temperature=0.1,
    top_k=50,
    repetition_penalty=1.05,
    max_new_tokens=512,
)
print(tokenizer.decode(output[0][inputs["input_ids"].form[-1]:], skip_special_tokens=True))

Liquid AI additionally publishes fine-tuning recipes. They cowl SFT, DPO, and GRPO with LoRA, by way of Unsloth and TRL. Every ships as a Colab pocket book.

Interactive Explainer

‘+m.n+’ ‘+m.p+’ ‘+ ”+ ‘ ‘+m.d[idx].toFixed(2)+’

Liquid AI Ships LFM2.5-230M with llama.cpp, MLX, vLLM, SGLang, and ONNX Assist for On-Gadget Inference

TL;DR

What’s LFM2.5-230M?

Coaching and Publish-Coaching

Benchmark

Use Circumstances With Examples

Working It: A Minimal Instance

Interactive Explainer

Related Articles

“The Dying of Robin Hood” and the tales of who we’re – EpidemioLogical

Construct an MCP Server in Go: A Manufacturing-Prepared Tutorial for the Mannequin Context Protocol

What it takes to deploy bodily AI at scale

Latest Articles

“The Dying of Robin Hood” and the tales of who we’re – EpidemioLogical

Construct an MCP Server in Go: A Manufacturing-Prepared Tutorial for the Mannequin Context Protocol

What it takes to deploy bodily AI at scale

Democrats preserve saying they need a dude. What about Mamdani?

‘Logan’s Run’ at 50: Remembering this disco-age sci-fi traditional on its golden anniversary