Saturday, July 4, 2026

Residual Context Diffusion Language Fashions


Diffusion Giant Language Fashions (dLLMs) have emerged as a promising various to purely autoregressive language fashions as a result of they will decode a number of tokens in parallel. Nevertheless, state-of-the-art block-wise dLLMs depend on a “remasking” mechanism that decodes solely essentially the most assured tokens and discards the remainder, successfully losing computation. We display that recycling computation from the discarded tokens is helpful, as these tokens retain contextual info helpful for subsequent decoding iterations. In gentle of this, we suggest Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them again for the following denoising step. RCD makes use of a decoupled two-stage coaching pipeline to bypass the reminiscence bottlenecks related to backpropagation. We validate our technique on each lengthy CoT reasoning (SDAR) and quick CoT instruction following (LLaDA) fashions. We display that an ordinary dLLM could be effectively transformed to the RCD paradigm with merely ∼1 billion tokens. RCD persistently improves frontier dLLMs by 5–10 factors in accuracy with minimal further computation overhead throughout a variety of benchmarks. Notably, on essentially the most difficult AIME duties, RCD almost doubles baseline accuracy and attains as much as 4–5x fewer denoising steps at equal accuracy ranges.

Related Articles

Latest Articles