Tuesday, June 9, 2026

Meet mKernel: A Multi-GPU, Multi-Node Fused Kernel Library for GPU-Pushed Communication


GPU communication overhead is a measurable bottleneck in manufacturing AI workloads. Based on knowledge cited by the mKernel venture, communication can devour 43.6% of the ahead go and 32% of end-to-end coaching time. Throughout standard Combination-of-Consultants (MoE) fashions, inter-device communication can account for as much as 47% of complete execution time. Researchers from UC Berkeley’s UCCL venture have launched mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute right into a single kernel.

The Downside: Host-Pushed Communication

The usual mannequin for multi-GPU communication is host-driven: the CPU runs the management path and calls right into a library like NCCL or NVSHMEM. The library points the collective operation — an AllReduce, an AllGather, and so forth. — throughout GPUs. Compute and communication run on separate CUDA streams and overlap at kernel boundaries.

The analysis crew identifies two issues with this method:

(1) CPUs are usually not scaling with GPU compute. A GB300 NVL72 rack integrates 72 Blackwell Extremely GPUs and 36 Grace CPUs, delivering 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core efficiency, and 130 TB/s of all-to-all intra-rack NVLink bandwidth. At these speeds, microsecond-scale host orchestration overhead — a cudaLaunchKernel name, a CPU-side “all writes accomplished” verify, an inter-stream occasion — reveals up straight as pipeline bubbles.

(2) Host-driven methods overlap compute and communication at coarse kernel boundaries. Finer-grained overlap on the tile or chunk stage is just not potential from the host facet.

The choice is GPU-driven communication: the GPU itself triggers transfers, with communication fused into the identical kernel because the compute. Most present fused kernel libraries function inside a single node, or a single GPU. mKernel targets the multi-node case.

What mKernel Does

mKernel is a library of persistent CUDA kernels. Every kernel fuses intra-node NVLink communication, inter-node RDMA, and dense compute right into a single kernel.

Multi-GPU + multi-node, in a single kernel: Each intra-node NVLink and inter-node RDMA reside inside the identical persistent kernel.

Nice-grained intra-kernel overlap: Compute and communication overlap at tile/chunk granularity, protecting each intra-node and inter-node GPU communication.

Persistent kernel with SM specialization: CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. The variety of SMs devoted to every position is tunable per form.

GPU-driven networking constructed on libibverbs: mKernel makes use of GPU-initiated RDMA writes with out relying on NCCL or NVSHMEM. The communication backend is written from scratch to maximise efficiency and assist heterogeneous networking units.

The 5 Fused Kernels

Kernel What it fuses Description
AllGather + GEMM AllGather → GEMM Every rank holds a shard of A. Whereas ranks collect friends’ shards over NVLink/RDMA, the native GEMM consumes tiles as quickly as they arrive.
GEMM + AllReduce GEMM → AllReduce Computes C = A @ B and reduces partial outputs throughout all ranks in a single launch. Output tiles are pushed into the discount tree the moment they’re produced.
MoE Dispatch + GEMM All-to-All dispatch → grouped GEMM Routes MoE tokens to their skilled ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the identical kernel. Tokens are processed as quickly as they land — no staging buffer round-trip.
Ring Consideration Ring KV alternate → FlashAttention Sequence-parallel consideration throughout ranks. Every step rotates a KV chunk across the ring whereas the native FlashAttention consumes the previously-received chunk. Compute and the ring ship/recv run concurrently inside a single persistent kernel.
GEMM + ReduceScatter GEMM → ReduceScatter Computes C = A @ B and reduce-scatters the output. Every output tile is diminished and forwarded to its proudly owning rank as quickly as it’s produced.

Analysis Setup

The analysis crew evaluated mKernel on two 2-node × 8-H200 clusters that differ solely of their inter-node cloth:

Testbed Nodes × GPUs Intra-node Inter-node transport NIC
AWS EFA 2 × 8 H200 NVLink AWS EFA / SRD 16 × 200 Gb/s EFA per node
ConnectX-7 2 × 8 H200 NVLink InfiniBand 8 × 400 Gb/s NVIDIA ConnectX-7 per node

mKernel was benchmarked in opposition to NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The crew notes that additional benchmarking at bigger scale continues to be in progress.

Backends and Necessities

mKernel helps two networking backends:

Backend Macro Transport The place it runs
CX7 -DINTERNODE_BACKEND_IBVERBS libibverbs RC ConnectX-7 / InfiniBand / RoCE
EFA -DINTERNODE_BACKEND_EFA libibverbs + efadv (SRD) AWS p5/p5e (H200, EFA)

Each backends share the identical host-side API and the identical on-GPU kernel. Solely the proxy/session implementation differs (session.h for CX7, session_efa.h for EFA). Necessities: NVIDIA Hopper GPUs (default construct targets sm_90a), CUDA 12.9, Python with PyTorch. The CX7 backend requires libibverbs improvement headers and libraries. The EFA backend requires AWS EFA set up with libfabric, libibverbs, efadv, and EFA headers below EFA_HOME=/decide/amazon/efa by default.

Marktechpost’s Visible Explainer

01 / 07 — Overview

What’s mKernel?

mKernel is an open-source library of persistent CUDA kernels from UC Berkeley’s UCCL venture. It fuses intra-node NVLink communication, inter-node RDMA, and dense compute right into a single kernel.

Most present fused kernel libraries function inside a single node or a single GPU. mKernel is designed from the begin to span node boundaries.

43.6%

of ahead go consumed by communication in manufacturing

47%

of complete execution time in standard MoE fashions

32%

of end-to-end coaching time consumed by communication

02 / 07 — The Downside

Why Host-Pushed Communication Falls Quick

The usual mannequin is host-driven: the CPU calls NCCL or NVSHMEM, which points collective operations throughout GPUs. The UCCL crew identifies two issues.

CPUs are usually not scaling with GPUs. A GB300 NVL72 rack delivers 720 PFLOP/s FP8/FP6 and 1.44 EFLOP/s FP4. At these speeds, microsecond-scale overhead from cudaLaunchKernel, CPU-side sync checks, and inter-stream occasions reveals up straight as pipeline bubbles.

🔲

Overlap is simply too coarse. Host-driven methods overlap compute and communication solely at kernel boundaries. Finer-grained overlap on the tile or chunk stage is just not potential from the host facet.

🔀

The reply: GPU-driven communication. The GPU itself triggers fine-grained transfers, fused into the identical kernel because the compute.

03 / 07 — Design

4 Core Design Properties

🖧

Multi-GPU + multi-node, in a single kernel. Intra-node NVLink and inter-node RDMA each reside inside the identical persistent kernel.

🔬

Nice-grained intra-kernel overlap. Compute and communication overlap at tile/chunk granularity, protecting each intra-node and inter-node communication.

⚙️

Persistent kernel with SM specialization. CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. SM break up is tunable per form.

📡

GPU-driven networking by way of libibverbs. Makes use of GPU-initiated RDMA writes. No NCCL or NVSHMEM dependency. Communication backend is written from scratch.

04 / 07 — Kernels

The 5 Fused Kernels

AllGather + GEMM

AllGather —> GEMM

Every rank holds a shard of A. The native GEMM consumes tiles over NVLink/RDMA as they arrive — matmul begins earlier than the collective finishes.

GEMM + AllReduce

GEMM —> AllReduce

Computes C = A @ B and reduces partial outputs throughout all ranks in a single launch. Output tiles enter the discount tree the moment they’re produced.

MoE Dispatch + GEMM

All-to-All dispatch —> grouped GEMM

Routes MoE tokens to skilled ranks by way of NVLink + inter-node all-to-all, then runs per-expert grouped GEMM in the identical kernel. No staging buffer round-trip.

Ring Consideration

Ring KV alternate —> FlashAttention

Sequence-parallel consideration throughout ranks. Every step rotates a KV chunk across the ring whereas the native FlashAttention consumes the previously-received chunk.

GEMM + ReduceScatter

GEMM —> ReduceScatter

Computes C = A @ B and reduce-scatters the output. Every tile is diminished and forwarded to its proudly owning rank as quickly as it’s produced.

05 / 07 — Analysis

Analysis Setup

Examined on two 2-node × 8-H200 clusters differing solely in inter-node cloth.

Testbed Nodes × GPUs Inter-node NIC
AWS EFA 2 × 8 H200 AWS EFA / SRD 16 × 200 Gb/s EFA per node
ConnectX-7 2 × 8 H200 InfiniBand 8 × 400 Gb/s CX7 per node

Each testbeds use NVLink intra-node. Benchmarked in opposition to: NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. Bigger-scale benchmarking continues to be in progress.

06 / 07 — Backends & Necessities

Backends & Necessities

Backend Transport The place it runs
CX7 libibverbs RC ConnectX-7 / InfiniBand / RoCE
EFA libibverbs + efadv (SRD) AWS p5/p5e (H200, EFA)

📋

Necessities: NVIDIA Hopper GPUs (default sm_90a), CUDA 12.9, Python with PyTorch. CX7 wants libibverbs headers. EFA wants libfabric, libibverbs, efadv below EFA_HOME=/decide/amazon/efa.

📝

License & Attribution: MIT licensed. MMA/compute code tailored from ThunderKittens (HazyResearch).

07 / 07 — Roadmap & Key Takeaways

Roadmap & Key Takeaways

Fused GPU-driven multi-node kernels (AG+GEMM, GEMM+AR, MoE Dispatch+GEMM, Ring Consideration, GEMM+RS)

ConnectX-7 and AWS EFA backends

🚧

Full heterogeneous accelerator/NIC assist with topology-aware discovery, placement, routing

🚧

Inter-node megakernels: collapsing a number of fused steps right into a single megakernel spanning a transformer layer

🚧

Blackwell GPU assist

Fuses NVLink, inter-node RDMA, and compute right into a single persistent CUDA kernel

5 kernels: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Consideration, GEMM+ReduceScatter

GPU-initiated RDMA by way of libibverbs — no NCCL or NVSHMEM dependency

Requires Hopper GPUs (sm_90a) and ConnectX-7 or AWS EFA networking

Key Takeaways

  • mKernel fuses intra-node NVLink, inter-node RDMA, and compute right into a single persistent CUDA kernel.
  • Communication overhead accounts for as much as 47% of execution time in MoE fashions per cited manufacturing knowledge.
  • 5 kernels are included: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Consideration, and GEMM+ReduceScatter.
  • GPU-initiated RDMA is carried out straight by way of libibverbs — no NCCL or NVSHMEM dependency.
  • At present requires Hopper GPUs (sm_90a) and ConnectX-7 or AWS EFA networking; Blackwell assist is on the roadmap.

Try the Repo and Technical ParticularsAdditionally, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be part of us on telegram as effectively.

Have to accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so forth.? Join with us


Related Articles

Latest Articles