GPU communication overhead is a measurable bottleneck in manufacturing AI workloads. Based on knowledge cited by the mKernel venture, communication can devour 43.6% of the ahead go and 32% of end-to-end coaching time. Throughout standard Combination-of-Consultants (MoE) fashions, inter-device communication can account for as much as 47% of complete execution time. Researchers from UC Berkeley’s UCCL venture have launched mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute right into a single kernel.
The Downside: Host-Pushed Communication
The usual mannequin for multi-GPU communication is host-driven: the CPU runs the management path and calls right into a library like NCCL or NVSHMEM. The library points the collective operation — an AllReduce, an AllGather, and so forth. — throughout GPUs. Compute and communication run on separate CUDA streams and overlap at kernel boundaries.
The analysis crew identifies two issues with this method:
(1) CPUs are usually not scaling with GPU compute. A GB300 NVL72 rack integrates 72 Blackwell Extremely GPUs and 36 Grace CPUs, delivering 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core efficiency, and 130 TB/s of all-to-all intra-rack NVLink bandwidth. At these speeds, microsecond-scale host orchestration overhead — a cudaLaunchKernel name, a CPU-side “all writes accomplished” verify, an inter-stream occasion — reveals up straight as pipeline bubbles.
(2) Host-driven methods overlap compute and communication at coarse kernel boundaries. Finer-grained overlap on the tile or chunk stage is just not potential from the host facet.
The choice is GPU-driven communication: the GPU itself triggers transfers, with communication fused into the identical kernel because the compute. Most present fused kernel libraries function inside a single node, or a single GPU. mKernel targets the multi-node case.
What mKernel Does
mKernel is a library of persistent CUDA kernels. Every kernel fuses intra-node NVLink communication, inter-node RDMA, and dense compute right into a single kernel.
Multi-GPU + multi-node, in a single kernel: Each intra-node NVLink and inter-node RDMA reside inside the identical persistent kernel.
Nice-grained intra-kernel overlap: Compute and communication overlap at tile/chunk granularity, protecting each intra-node and inter-node GPU communication.
Persistent kernel with SM specialization: CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. The variety of SMs devoted to every position is tunable per form.
GPU-driven networking constructed on libibverbs: mKernel makes use of GPU-initiated RDMA writes with out relying on NCCL or NVSHMEM. The communication backend is written from scratch to maximise efficiency and assist heterogeneous networking units.
The 5 Fused Kernels
| Kernel | What it fuses | Description |
|---|---|---|
| AllGather + GEMM | AllGather → GEMM | Every rank holds a shard of A. Whereas ranks collect friends’ shards over NVLink/RDMA, the native GEMM consumes tiles as quickly as they arrive. |
| GEMM + AllReduce | GEMM → AllReduce | Computes C = A @ B and reduces partial outputs throughout all ranks in a single launch. Output tiles are pushed into the discount tree the moment they’re produced. |
| MoE Dispatch + GEMM | All-to-All dispatch → grouped GEMM | Routes MoE tokens to their skilled ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the identical kernel. Tokens are processed as quickly as they land — no staging buffer round-trip. |
| Ring Consideration | Ring KV alternate → FlashAttention | Sequence-parallel consideration throughout ranks. Every step rotates a KV chunk across the ring whereas the native FlashAttention consumes the previously-received chunk. Compute and the ring ship/recv run concurrently inside a single persistent kernel. |
| GEMM + ReduceScatter | GEMM → ReduceScatter | Computes C = A @ B and reduce-scatters the output. Every output tile is diminished and forwarded to its proudly owning rank as quickly as it’s produced. |
Analysis Setup
The analysis crew evaluated mKernel on two 2-node × 8-H200 clusters that differ solely of their inter-node cloth:
| Testbed | Nodes × GPUs | Intra-node | Inter-node transport | NIC |
|---|---|---|---|---|
| AWS EFA | 2 × 8 H200 | NVLink | AWS EFA / SRD | 16 × 200 Gb/s EFA per node |
| ConnectX-7 | 2 × 8 H200 | NVLink | InfiniBand | 8 × 400 Gb/s NVIDIA ConnectX-7 per node |
mKernel was benchmarked in opposition to NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The crew notes that additional benchmarking at bigger scale continues to be in progress.
Backends and Necessities
mKernel helps two networking backends:
| Backend | Macro | Transport | The place it runs |
|---|---|---|---|
| CX7 | -DINTERNODE_BACKEND_IBVERBS |
libibverbs RC | ConnectX-7 / InfiniBand / RoCE |
| EFA | -DINTERNODE_BACKEND_EFA |
libibverbs + efadv (SRD) | AWS p5/p5e (H200, EFA) |
Each backends share the identical host-side API and the identical on-GPU kernel. Solely the proxy/session implementation differs (session.h for CX7, session_efa.h for EFA). Necessities: NVIDIA Hopper GPUs (default construct targets sm_90a), CUDA 12.9, Python with PyTorch. The CX7 backend requires libibverbs improvement headers and libraries. The EFA backend requires AWS EFA set up with libfabric, libibverbs, efadv, and EFA headers below EFA_HOME=/decide/amazon/efa by default.
Marktechpost’s Visible Explainer
Key Takeaways
- mKernel fuses intra-node NVLink, inter-node RDMA, and compute right into a single persistent CUDA kernel.
- Communication overhead accounts for as much as 47% of execution time in MoE fashions per cited manufacturing knowledge.
- 5 kernels are included: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Consideration, and GEMM+ReduceScatter.
- GPU-initiated RDMA is carried out straight by way of
libibverbs— no NCCL or NVSHMEM dependency. - At present requires Hopper GPUs (
sm_90a) and ConnectX-7 or AWS EFA networking; Blackwell assist is on the roadmap.
Try the Repo and Technical Particulars. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be part of us on telegram as effectively.
Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Join with us
