Wednesday, October 29, 2025

Studying from Failure to Deal with Extraordinarily Onerous Issues – Machine Studying Weblog | ML@CMU


This weblog put up is predicated on the work BaNEL: Exploration Posteriors for Generative Modeling Utilizing Solely Unfavorable Rewards.

Tackling Very Onerous Issues

The last word goal of machine studying analysis is to push machines past human limits in vital functions, together with the following technology of theorem proving, algorithmic downside fixing, and drug discovery. An ordinary recipe entails: (1) pre-training fashions on current knowledge to acquire base fashions, after which (2) post-training them utilizing scalar reward indicators that measure the standard or correctness of the generated samples.

Nonetheless, for the toughest situations of those issues, we encounter two challenges:

  1. Sparsity: the bottom generative mannequin attains a near-zero reward sign. The chance of manufacturing a positive-reward pattern will be so low that the mannequin might undergo many of the coaching with out ever encountering a constructive reward.
  2. Expensive reward analysis: Calls to the reward oracle will be costly or dangerous, requiring expensive simulations, computations, and even bodily experiments.
GPT-5 receives zero reward on this instance question.

For instance, when requested to design a treatment for most cancers, GPT-5 fails. If requested once more, will it succeed? In all probability not. What number of makes an attempt would it not take? We anticipate the success chance to be nonzero (since GPT-5, being an autoregressive generative mannequin, by no means assigns precisely zero chance to any finite sequence), however at finest, it’s vanishingly small. Worse nonetheless, evaluating the answer is dear and dangerous, because it requires conducting precise scientific trials.

A extra basic instance exhausting problem-solving is designing molecules with particular properties (e.g., excessive exercise in opposition to a particular protein goal), which additionally suffers from the aforementioned two points: (1) a base generative mannequin is unlikely to generate extremely potent molecules in opposition to a particular protein goal, and (2) the ground-truth verification of the efficiency requires precise wet-lab experiments.

These illustrate a broader difficulty: the toughest and most vital issues are these with near-zero success charges — and no constructive examples obtainable throughout studying. To deal with these eventualities, we introduce BaNEL (Bayesian Unfavorable Proof Studying), an algorithm that post-trains the generative mannequin utilizing failed makes an attempt solely, whereas minimizing the variety of reward evaluations (NREs).

Beneath such excessive reward sparsity, commonplace post-training strategies like coverage gradients (together with GRPO) collapse into brute-force random search, since zero rewards produce zero gradients. Novelty-bonus strategies, similar to count-based exploration or random community distillation, can present studying indicators below sparsity, however they require massive NREs and fall quick in efficiency. The next desk summarizes our evaluation of those strategies.

Comparability of desired properties–performance and low variety of reward evaluations (NREs)–for key classes of studying strategies. An empty circle ○ means the property isn’t happy, a stuffed circle ● means happy, and a half-filled circle ◐ means partially happy (e.g., a way is purposeful, however the success price doesn’t improve a lot).

Studying from Unfavorable Rewards

The zero-reward downside has traditionally been addressed utilizing constructive switch from different duties or domains, hand-designing curricula, and/or engineering extra informative and dense reward features. Nonetheless, we argue that there’ll at all times be duties and settings the place the bottom mannequin attains an especially sparse reward. If we can’t handle this basic impediment, post-training can be restricted to distribution sharpening moderately than unlocking genuinely new capabilities past coaching knowledge.

To sort out the zero-reward downside, algorithms ought to have the ability to be taught from failures alone—utilizing solely destructive reward samples—whereas minimizing the variety of reward evaluations (NREs). There’s a easy (if impractical) approach to see that studying from destructive samples alone is a minimum of theoretically doable.

Don’t make the identical mistake twice! If our funds for evaluating (r) was limitless, and assuming the answer has bounded size, we may trivially obtain an ideal success price by accumulating each potential mistake (R:={mathbf{x} mid r(mathbf{x})=0}) and avoiding all components of (R) :

$$p_{boldsymbol{theta} mid R^C}(mathbf{x}) propto p_{boldsymbol{theta}}(mathbf{x}) mathbf{1}[mathbf{x} notin R],$$

the place (p_{boldsymbol{theta}}) is the pre-trained generative mannequin (e.g., GPT-5). (p_{boldsymbol{theta} mid R^C}(mathbf{x})) means we situation the mannequin on the complement of (R) by multiplying the indicator operate. In plain phrases, this formulation says: when you’ve seen all potential failures, you’ll by no means make a brand new one.

Exploiting the construction underlying failures. After all, this method is infeasible as a result of the area of failures is combinatorial, and we need to decrease NREs. However crucially, in most duties the place success requires intelligence, failures will not be arbitrary. They comprise patterns that distinguish the failed makes an attempt from successes. If we will be taught these patterns, we will approximate (R) utilizing a small variety of samples. This failure-based method parallels how human scientists motive: they generalize from failures, avoiding previous errors with out discarding promising instructions. To reduce NREs, the algorithm should extract as a lot data as potential from failures earlier than making new makes an attempt.

Minimizing NREs requires heavy computation to completely exploit previous failures earlier than expensive new makes an attempt (e.g., scientific trials).

Studying a Generative Mannequin of Failures

Our core concept is to mannequin regularities underlying failures utilizing a separate generative mannequin (p_phi) skilled solely on failed makes an attempt. Generative modeling is a strong unsupervised means for studying construction from knowledge — and it scales extraordinarily nicely! Particularly, we prepare a separate generative mannequin (p_phi) (parameterized by (phi) ) on (m) destructive examples with the usual most chance goal:

$$max _{boldsymbol{phi}} frac{1}{m} sum_{i=1}^m log p_{boldsymbol{phi}}(mathbf{x}_i) .$$

As soon as well-trained, (p_phi(mathbf{x})) can be utilized to evaluate whether or not a given enter resembles beforehand noticed failures; particularly, we use (p_phi) to outline a rejection area (tilde{R}) approximating (R):

$$tilde{R}:=lbrace mathbf{x}: frac{p_{boldsymbol{theta}}(mathbf{x})}{p_{boldsymbol{phi}}(mathbf{x})}<tau rbrace$$

the place (tau) is a threshold worth. Observe that this requires (p_{boldsymbol{theta}}) and (p_phi) to be likelihood-based generative fashions below which we will compute the chance (e.g., autoregressive fashions). Utilizing the rejection area (tilde{R}), we kind a Bayesian posterior (tilde{p}_{boldsymbol{theta}}) to approximate (p_{boldsymbol{theta} mid R^C}) :

$$p_{boldsymbol{theta} mid tilde{R}^C}(mathbf{x}) propto p_{boldsymbol{theta}}(mathbf{x}) mathbf{1}[mathbf{x} notin tilde{R}],$$

This posterior filters out knowledge factors which might be just like prior failures in keeping with (tilde{R}); equivalently, we direct the mannequin to pattern solely from (tilde{R}^C).

On-line Recursive Replace

As soon as we enhance the generative mannequin utilizing the Bayesian replace as described above, we will use it to assemble one other batch of (m) samples. Right here, rejection areas from earlier rounds will be amassed by taking their union (i.e., (tilde R will get tilde R cup tilde R_{textual content{new}}) the place (R_{textual content{new}}) is the brand new rejection area). This may be repeated a number of occasions, as illustrated within the determine under. We name this technique BaNEL: Bayesian Unfavorable Proof Studying, an method that makes use of Bayesian updates to be taught from destructive samples solely.

Illustration of BaNEL on a 1D toy instance. The process begins with a pre-trained proposal distribution (topmost). Two reward-one samples (purple bars) are positioned at -2 and a couple of. At every iteration, the proposal distribution generates samples, that are very more likely to be 0-reward. These are used to coach a destructive mannequin (purple dashed curves). The proposal and destructive fashions are mixed to kind the Bayesian posterior (black curves). As iterations progress, the posterior more and more concentrates on the reward-one areas, till convergence (bottommost).

Experiment: Adversarial Assault On Toy Language Mannequin

We first consider BaNEL on a toy however informative setting the place high-reward samples are uncommon, and hand-engineering dense rewards is difficult. On this activity, the aim is to assault the goal mannequin, an autoregressive transformer skilled to reply digit-addition queries (e.g., it receives “`10+23=”` and should generate “`33”`). The aim of the attacker mannequin, additionally an autoregressive transformer pre-trained on the identical dataset to generate questions similar to “`10+23=”`, is to suggest syntactically legitimate addition queries on which the goal mannequin produces an incorrect sum.

That’s, the reward is outlined as:

  • (r(mathbf{x}) = 1) if (mathbf{x}) is a syntactically legitimate arithmetic expression and the goal’s output is wrong,
  • (r(mathbf{x}) = 0) in any other case.

Because the goal is skilled nicely, the pre-trained attacker’s empirical success price is roughly 0.0004. We set a tough restrict on NREs: (r) can solely be evaluated 7500 occasions at most. All reward-1 samples are filtered out throughout coaching — forcing the mannequin to be taught solely from failures.

Greatest imply, median, commonplace deviation, and relative enchancment over the Pretrained baseline of the empirical success charges on the adversarial assault activity over 5 random seeds. Success charges are measured utilizing 60,000 samples.

As proven on this desk, BaNEL improves the success price by 278x on common, outperforming baselines by a number of orders of magnitude.

Profitable assaults generated by BaNEL.

BaNEL identifies two failure modes of the goal:

  1. Main zeros: when a minimum of one of many enter digits begins with a minimum of one zero, the output outcome tends to be incorrect. That is seemingly as a result of the coaching knowledge (shared by each the goal and the attacker) doesn’t comprise any examples with main zeros.
  2. Carry-chain stressors: examples that want to hold a digit throughout summation.

Primarily based on these recognized patterns, we designed a rule-based assault and noticed that it achieves a near-perfect success price. This implies that BaNEL can be utilized not solely to extend a numeric success price, but additionally to information human instinct on exhausting issues to extract qualitative insights.

Compute scaling for the adversarial assault situation (main zeros will not be allowed): Enchancment think about success price of BaNEL over the bottom mannequin as a operate of the variety of epochs used to coach (p_phi) at every stage, averaged over 5 random seeds. The typical success charges of RND and count-based strategies are proven as horizontal reference strains.

We additionally research compute scaling (right here, we don’t enable main zero assaults to make the issue much more difficult). When the destructive generative mannequin (p_phi) is under-trained (few epochs), BaNEL performs on par with easier novelty-bonus baselines (RND and pseudo-count strategies). Nonetheless, as we spend extra compute on (p_phi) (with out extra NREs), BaNEL outperforms these strategies by a big margin.

This highlights a key property: BaNEL trades compute for reward effectivity. It’s suboptimal below strict compute limits however excels when extra offline computation is accessible. 

Experiment: Language Mannequin Reasoning

We additional consider BaNEL on reasoning duties utilizing GSM8K subsets, the place the pre-trained Qwen 2.5 0.5B mannequin (additional fine-tuned on GSM8K utilizing PPO) performs poorly. Once more, all reward-1 samples are filtered out throughout coaching.

Cumulative finest success price of BaNEL and RND on GSM8K-Onerous questions. The shaded space represents confidence intervals (Clopper-Pearson, (alpha=0.05), sample_size=10000).

For many issues, BaNEL considerably improves success charges over the pre-trained baseline, outperforming RND with fewer reward evaluations.

Closing Remarks

By modeling failures with a generative mannequin, BaNEL turns destructive proof right into a studying sign, enabling exploration in settings the place reward = 1 samples are practically nonexistent. We view BaNEL as an vital path for the generative modeling discipline: to really push the frontier of generative mannequin capabilities, we should be taught from failures! 

Try our paper for extra outcomes and particulars!

Related Articles

Latest Articles