What modified after we cut up the swimming pools
We ran a two-week proof of idea. I cut up the cluster into two swimming pools: Eight GPUs devoted to immediate processing and the remaining GPUs dealing with token era. No new {hardware}, no new cluster — only a configuration change within the serving layer and a routing coverage that despatched every request to the suitable pool primarily based on its inference section. The prompt-processing pool hit 90–95% compute utilization persistently as a result of that’s all it did. No token era competing for scheduling slots. No decode requests sitting idle whereas a prefill burst hogged the cores.
The token-generation pool was the larger shock. By batching tons of of concurrent decode requests collectively the reminiscence reads obtained amortized throughout extra work. Bandwidth utilization climbed above 70% — much better than the 30% we’d been seeing when decode requests had been interleaved with prefill on the identical GPU. General compute effectivity roughly doubled.
The associated fee math adopted. The client was spending about $2M yearly on inference GPU-hours. After disaggregation they had been on monitor to chop that by $600–800K whereas serving the identical request quantity on the identical latency targets. No new {hardware} bought. Identical GPUs, identical cluster, identical mannequin weights — completely different structure.
