How I doubled my GPU effectivity with out shopping for a single new card

April 23, 2026

4

What modified after we cut up the swimming pools

We ran a two-week proof of idea. I cut up the cluster into two swimming pools: Eight GPUs devoted to immediate processing and the remaining GPUs dealing with token era. No new {hardware}, no new cluster — only a configuration change within the serving layer and a routing coverage that despatched every request to the suitable pool primarily based on its inference section. The prompt-processing pool hit 90–95% compute utilization persistently as a result of that’s all it did. No token era competing for scheduling slots. No decode requests sitting idle whereas a prefill burst hogged the cores.

The token-generation pool was the larger shock. By batching tons of of concurrent decode requests collectively the reminiscence reads obtained amortized throughout extra work. Bandwidth utilization climbed above 70% — much better than the 30% we’d been seeing when decode requests had been interleaved with prefill on the identical GPU. General compute effectivity roughly doubled.

The associated fee math adopted. The client was spending about $2M yearly on inference GPU-hours. After disaggregation they had been on monitor to chop that by $600–800K whereas serving the identical request quantity on the identical latency targets. No new {hardware} bought. Identical GPUs, identical cluster, identical mannequin weights — completely different structure.

How I doubled my GPU effectivity with out shopping for a single new card

What modified after we cut up the swimming pools

Related Articles

Ten years in the past on the weblog once we nonetheless thought of Watergate the head of political corruption

Recreating Apple’s Imaginative and prescient Professional Animation in CSS

Apple Machine Studying Analysis at ICLR 2026

Latest Articles

Ten years in the past on the weblog once we nonetheless thought of Watergate the head of political corruption

Recreating Apple’s Imaginative and prescient Professional Animation in CSS

Apple Machine Studying Analysis at ICLR 2026

JBL Endurance Race 2 buds drop under $45 for the primary time

Lume Dice Edge Gentle Go Overview (2026): Versatile, Transportable