Friday, April 17, 2026

Cram Much less to Match Extra: Coaching Information Pruning Improves Memorization of Information


This paper was accepted on the Workshop on Navigating and Addressing Information Issues for Basis Fashions at ICLR 2026.

Massive language fashions (LLMs) can wrestle to memorize factual information of their parameters, usually resulting in hallucinations and poor efficiency on knowledge-intensive duties. On this paper, we formalize reality memorization from an information-theoretic perspective and research how coaching knowledge distributions have an effect on reality accuracy. We present that reality accuracy is suboptimal (under the capability restrict) each time the quantity of knowledge contained within the coaching knowledge info exceeds mannequin capability. That is additional exacerbated when the actual fact frequency distribution is skewed (e.g. an influence legislation). We suggest knowledge choice schemes primarily based on the coaching loss alone that purpose to restrict the variety of info within the coaching knowledge and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy info, our choice methodology successfully boosts reality accuracy to the capability restrict. When pretraining language fashions from scratch on an annotated Wikipedia corpus, our choice methodology permits a GPT2-Small mannequin (110m parameters) to memorize 1.3X extra entity info in comparison with normal coaching, matching the efficiency of a 10X bigger mannequin (1.3B parameters) pretrained on the complete dataset.

Related Articles

Latest Articles