Cram Much less to Match Extra: Coaching Information Pruning Improves Memorization of Information

April 15, 2026

55

This paper was accepted on the Workshop on Navigating and Addressing Information Issues for Basis Fashions at ICLR 2026.

Massive language fashions (LLMs) can wrestle to memorize factual information of their parameters, usually resulting in hallucinations and poor efficiency on knowledge-intensive duties. On this paper, we formalize reality memorization from an information-theoretic perspective and research how coaching knowledge distributions have an effect on reality accuracy. We present that reality accuracy is suboptimal (under the capability restrict) each time the quantity of knowledge contained within the coaching knowledge info exceeds mannequin capability. That is additional exacerbated when the actual fact frequency distribution is skewed (e.g. an influence legislation). We suggest knowledge choice schemes primarily based on the coaching loss alone that purpose to restrict the variety of info within the coaching knowledge and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy info, our choice methodology successfully boosts reality accuracy to the capability restrict. When pretraining language fashions from scratch on an annotated Wikipedia corpus, our choice methodology permits a GPT2-Small mannequin (110m parameters) to memorize 1.3X extra entity info in comparison with normal coaching, matching the efficiency of a 10X bigger mannequin (1.3B parameters) pretrained on the complete dataset.

Cram Much less to Match Extra: Coaching Information Pruning Improves Memorization of Information

Related Articles

Google Releases Gemini 3.5 Reside Translate, a Streaming Speech-to-Speech Audio Mannequin Masking 70+ Languages Throughout Meet, Translate, and the Reside API

The AI boomerang impact: extra information suggests employers are reversing AI layoffs

Planet 9 thriller deepens as new discovery challenges hidden planet concept

Latest Articles

Google Releases Gemini 3.5 Reside Translate, a Streaming Speech-to-Speech Audio Mannequin Masking 70+ Languages Throughout Meet, Translate, and the Reside API

The AI boomerang impact: extra information suggests employers are reversing AI layoffs

Planet 9 thriller deepens as new discovery challenges hidden planet concept

Lastly the Steady Diff-in-Diff Estimator Reveals Up!

Testing Claude Fable 5: Hype or Actuality?