Saturday, January 17, 2026

The Information-High quality Phantasm: Rethinking Classifier-Based mostly High quality Filtering for LLM Pretraining


Massive-scale fashions are pretrained on huge web-crawled datasets containing paperwork of combined high quality, making information filtering important. A preferred technique is Classifier-based High quality Filtering (CQF), which trains a binary classifier to differentiate between pretraining information and a small, high-quality set. It assigns every pretraining doc a high quality rating outlined because the classifier’s rating and retains solely the top-scoring ones. We offer an in-depth evaluation of CQF. We present that whereas CQF improves downstream activity efficiency, it doesn’t essentially improve language modeling on the high-quality dataset. We clarify this paradox by the truth that CQF implicitly filters the high-quality dataset as effectively. We additional examine the habits of fashions skilled with CQF to these skilled on artificial information of accelerating high quality, obtained by way of random token permutations, and discover starkly completely different traits. Our outcomes problem the view that CQF captures a significant notion of information high quality.

Related Articles

Latest Articles