Richard Sutton’s “Bitter Lesson” is often learn as a warning in opposition to constructing an excessive amount of human data into AI programs. Over the long term, the strategies that win should not those that encode our intelligent instinct most immediately, however the ones that scale: search, studying, and different basic strategies that may take up extra compute and knowledge.
Trendy basis mannequin pre-training appears to be like, at first look, like a triumph of that lesson. We take a basic structure, expose it to huge knowledge, and practice it with a easy self-supervised goal. Language fashions predict the following token. Imaginative and prescient fashions reconstruct masked patches, align views, or match instructor representations. The recipe is straightforward and scalable.
However there’s a catch.
Pre-training might observe the Bitter Lesson in the way it trains the fashions, however not the way it chooses what the mannequin must be educated on. The target continues to be chosen outdoors the coaching loop. We conduct a big pre-training run, consider downstream efficiency, alter the recipe, and run once more. The learner optimizes one self-supervised studying goal however the downstream suggestions truly arrives solely after the entire coaching course of. It is a very coarse management loop.
This paper asks whether or not that loop may be made extra direct and tighter. Our query is: given an unlabeled knowledge stream, and a small set of verifiable downstream examples, can we use these examples throughout continued pre-training? The proposed reply is value-based pre-training with downstream suggestions (V-pretraining). The important thing thought is to separate two roles which can be often collapsed. There’s nonetheless a learner, the muse mannequin being pretrained. However there may be additionally a light-weight job designer. The learner is up to date solely by a self-supervised loss on unlabeled knowledge. The designer, nonetheless, learns the best way to assemble the self-supervised job: which goal distribution to make use of in language modeling, or which views and masks to make use of in self-supervised imaginative and prescient coaching.
Determine 1 makes this distinction clear. Normal continued pretraining fixes a building rule earlier than coaching begins: for textual content, the next-token goal is a one-hot token; for imaginative and prescient, the crop, masks, or augmentation pipeline is mounted. V-pretraining replaces that mounted rule with a learnable designer, whereas retaining the learner’s replace self-supervised.
This distinction issues. V-pretraining isn’t supervised fine-tuning, desire optimization, or reinforcement studying from suggestions. In these strategies, downstream labels, preferences, or rewards immediately replace the learner. In V-pretraining, downstream examples are used solely to coach the duty designer. The learner by no means receives a downstream supervised gradient. The suggestions path is oblique.
One-step estimation
The technical query is how the designer is aware of which self-supervised job is beneficial. Ideally, we might select job constructions that result in the very best downstream mannequin after a full continued-pretraining trajectory. However differentiating by means of a complete pretraining run isn’t sensible. V-pretraining makes use of a neighborhood surrogate as a substitute.
Suppose a candidate self-supervised job produces a pretraining gradient, (g_{rm pre}). A small suggestions batch produces a downstream gradient, (g_{rm down}). If we took a small learner step utilizing (g_{rm pre}), the downstream loss would change roughly as:
$$L_{rm down}(theta-eta g_{rm pre})approx L_{rm down}(theta) – eta g_{rm down}^{prime} g_{rm pre}$$
So the internal product (g_{rm down}^{prime} g_{rm pre}) estimates whether or not this unlabeled self-supervised replace is more likely to cut back downstream loss. V-pretraining trains the designer to assemble duties whose learner gradients align with the downstream gradient. After that, the constructed targets or views are indifferent, and the learner takes an odd self-supervised replace.
Instantiations and most important outcomes
The concrete instantiations are easy and revealing. In language, we instantiate the duty design with adaptive top-Ok comfortable goal building. Normal next-token prediction makes use of a one-hot goal: the true subsequent token will get likelihood one. V-pretraining retains the identical textual content stream and context, however permits the designer to position a bounded quantity of likelihood mass over a small candidate set that at all times contains the true subsequent token plus high-probability alternate options from the present learner. The learner nonetheless trains by cross-entropy on continued-pretraining textual content; the suggestions examples solely form the goal distribution by means of the designer.
In the primary language experiments, the learner is continued-pretrained on NuminaMath-CoT, whereas 1,024 GSM8K coaching examples are used solely as suggestions for the duty designer. Below matched wall-clock coaching budgets, V-pretraining improves GSM8K Cross@1 throughout examined Qwen fashions. The most important reported single-run acquire is for Qwen2.5-0.5B, enhancing from 22.20 to 29.60. In replicated Qwen1.5 runs, V-pretraining improves 0.5B, 4B, and 7B fashions, with the 4B mannequin transferring from 56.48±1.56 to 58.98±1.03.

In imaginative and prescient, the identical precept is utilized to self-supervised view building. The learner is a DINO-style visible spine educated on unlabeled ImageNet photographs. The designer modifies instance-wise views or masks in order that the ensuing self-supervised gradient higher aligns with downstream dense-prediction suggestions from ADE20K segmentation and NYUv2 depth estimation. The spine itself continues to be up to date solely by the self-supervised DINO loss.
The primary imaginative and prescient outcomes present the identical sample: goal downstream capabilities enhance with out apparent collapse of basic representations. For DINOv3-ViT-L, ADE20K mIoU improves from 51.33 to 52.47, NYUv2 RMSE improves from 0.5752 to 0.5522, and ImageNet-1K linear accuracy improves from 84.07 to 84.59. The paper additionally reviews switch checks on picture retrieval, the place dense-task suggestions improves most Oxford/Paris retrieval protocols, although not uniformly.
A pure concern is that that is only a shortcut. Perhaps the continued-pretraining knowledge comprises benchmark duplicates. Perhaps comfortable labels assist no matter suggestions. Perhaps the designer is secretly smuggling supervision into the learner.
The paper contains a number of controls in opposition to these explanations. After decontaminating NuminaMath-CoT by eradicating near-duplicates of GSM8K and MATH, V-pretraining nonetheless stays above the baseline, though with a smaller margin. Random suggestions and uniform top-Ok smoothing carry out worse than the baseline within the Qwen1.5-4B ablation, whereas self-distillation improves however doesn’t match V-pretraining. These controls recommend that the acquire is not only label smoothing, self-distillation, contamination, or further stochasticity; the downstream-aligned worth sign is doing work.
Does this prolong the Bitter Lesson?
This brings us again to the Bitter Lesson. A shallow studying of the lesson may say: don’t inject downstream data into pre-training; simply scale next-token prediction. However that’s not fairly the purpose. The lesson isn’t that suggestions is unhealthy. It’s that hand-designed construction tends to lose to basic strategies that may study from scalable indicators.
Present pre-training is just partly “bitter.” The learner is educated by a scalable self-supervised goal, however the job recipe continues to be often mounted by hand. We select the information combination, masking rule, augmentation pipeline, goal format, and curriculum outdoors the coaching loop. Downstream suggestions then arrives solely after a run is evaluated.
V-pretraining makes one a part of that recipe learnable. The learner nonetheless updates solely on unlabeled self-supervised knowledge, however a job designer makes use of downstream suggestions to determine which self-supervised prediction issues are more likely to be helpful. Within the paper’s phrases, suggestions adjustments the duty building somewhat than immediately supervising the learner.
That’s the extra bitter model of pre-training: not simply scaling a set proxy job, however studying which proxy duties produce invaluable updates. Pre-training shouldn’t solely study from knowledge. It ought to study what to foretell.
For extra particulars: Worth-Primarily based Pre-Coaching with Downstream Suggestions
