RL for Reasoning by Adaptively Revealing Rationales

October 30, 2025

3

We suggest that reinforcement studying (RL) from partial skilled demonstrations just isn’t merely a coaching heuristic, however a promising framework for fixing advanced sequence technology duties. Supervised fine-tuning (SFT) depends on dense ground-truth labels, which change into more and more pricey as sequence size grows. RL, alternatively, struggles with sparse rewards and a combinatorially giant output house. We handle this by introducing adaptive backtracking (AdaBack), a per-sample curriculum studying algorithm that reveals solely a partial prefix of the goal output throughout coaching. The supervision size is adjusted dynamically for every pattern primarily based on the mannequin’s previous reward sign, permitting it to incrementally be taught to finish reasoning chains by conditioning on right partial options. We examine this intermediate regime between SFT and RL and argue that per-sample curriculum studying is greater than a trade-off between effectivity and generality, it may possibly reach duties with lengthy sequences of latent dependencies the place SFT and RL each fail to generalize. Utilizing an artificial job with latent parity constraints, we present that our adaptive curriculum over partial solutions reliably solves issues which might be in any other case intractable. On mathematical reasoning benchmarks (MATH, GSM8k), we discover that curriculum studying allows fashions to resolve issues that RL alone can not, buying new reasoning capabilities via incremental publicity to partial options.

† École Polytechnique Fédérale de Lausanne (EPFL)
* Equal supervision

RL for Reasoning by Adaptively Revealing Rationales

Related Articles

These top-rated binoculars ship ‘a sky-watching expertise that can transfer your soul’ — they’ve simply hit the very best worth we have ever seen...

Information to Moral AI in Efficiency Prediction Fashions

It is not only for cruise ships.

Latest Articles

These top-rated binoculars ship ‘a sky-watching expertise that can transfer your soul’ — they’ve simply hit the very best worth we have ever seen...

Information to Moral AI in Efficiency Prediction Fashions

It is not only for cruise ships.

Simply launched from Stata Press: Microeconometrics Utilizing Stata, Second Version

The Rise of Multimodal LLMs and Environment friendly Serving with vLLM