Monday, March 23, 2026

Clarifai Reasoning Engine Achieves 414 Tokens Per Second on Kimi K2.5


TL;DR

Utilizing customized CUDA kernels and speculative decoding optimized for reasoning workloads, we achieved 414 tokens per second throughput on Kimi K2.5 operating on Nvidia B200 GPUs, making us one of many first suppliers to succeed in 400+ tokens per second on a trillion-parameter reasoning mannequin.


Forward of Nvidia GTC, we’re excited to share that Clarifai Reasoning Engine achieves 414 tokens per second (TPS) throughput on Kimi K2.5, positioning us among the many high inference suppliers for frontier reasoning fashions as measured by Synthetic Evaluation. Working on Nvidia B200 GPU infrastructure, our platform delivers production-grade efficiency for agentic workflows and complicated reasoning duties.

Determine 1: Clarifai achieves 414 tokens per second on Kimi K2.5, rating among the many quickest inference suppliers on Synthetic Evaluation benchmarks.

Why Kimi K2.5 efficiency issues

Kimi K2.5 is a 1-trillion-parameter reasoning mannequin with a 384-expert Combination-of-Consultants structure that prompts 32 billion parameters per request. Constructed by Moonshot AI with native multimodal coaching on 15 trillion combined visible and textual content tokens, the mannequin delivers sturdy efficiency throughout key benchmarks: 50.2% HLE with instruments, 76.8% SWE-Bench Verified, and 78.4% BrowseComp.

As a reasoning mannequin, Kimi K2.5 generates prolonged considering sequences earlier than ultimate solutions. Clarifai achieves a time to first reply token of 6 seconds, which incorporates the mannequin’s inner considering time earlier than offering a response. Throughput instantly impacts end-to-end response time for agentic techniques, code technology, and multimodal reasoning duties. At 414 TPS, we ship the velocity required for manufacturing deployments.

Time to first token-1-1

Determine 2: Time to first Reply token (TTFT) efficiency throughout inference suppliers, measured by Synthetic Evaluation with 10,000 enter tokens.

How we optimize for throughput

Clarifai Reasoning Engine makes use of three core optimizations for big reasoning fashions:

Customized CUDA kernels scale back reminiscence stalls and improve cache locality. By optimizing low-level GPU operations, we maintain streaming multiprocessors energetic throughout inference fairly than ready on knowledge motion.

Speculative decoding predicts doable token paths and prunes misses shortly. This reduces wasted computation in the course of the mannequin’s considering sequence, a sample widespread in reasoning workloads.

Adaptive optimization constantly learns from workload conduct. The system dynamically adjusts batching, reminiscence reuse, and execution paths primarily based on precise request patterns. These enhancements compound over time, particularly for the repetitive duties widespread in agentic workflows.

Working on Nvidia B200 infrastructure provides us the {hardware} basis to push efficiency boundaries, whereas our inference optimization stack delivers the software-level positive aspects.

Constructing with Kimi K2.5

Kimi K2.5 is now accessible on the Clarifai Platform. Attempt it out on the Playground or through the API to get began.

For those who want devoted compute to deploy Kimi K2.5 and different related high open fashions at scale for manufacturing workloads, get in contact with our crew.



Related Articles

Latest Articles