This paper was accepted on the Workshop on Reminiscence for LLM-Based mostly Agentic Programs at ICLR.
Language fashions have constantly grown to compress extra world data into their parameters, however the data that may be pretrained into them is upper-bounded by their parameter measurement. Particularly the capability of Small Language Fashions (SLMs) is restricted, resulting in factually incorrect generations. This drawback is commonly mitigated by giving the SLM entry to an outdoor supply: the flexibility to question a bigger mannequin, paperwork, or a database. Below this setting, we research the elemental query of which tokens an SLM can and may be taught throughout pretraining, versus which of them it ought to delegate through a token. We discover that this isn’t merely a query of loss: though the loss is predictive of whether or not a predicted token mismatches the ground-truth, some tokens are acceptable in that they’re truthful various continuations of a pretraining doc, and shouldn’t set off a even when their loss is excessive. We discover {that a} spaCy grammar parser might help increase the loss sign to determine which tokens the SLM ought to be taught to delegate to forestall factual errors and that are secure to be taught and predict even below excessive losses. We suggest LaCy, a novel pretraining technique based mostly on this token choice philosophy. Our experiments exhibit that LaCy fashions efficiently be taught which tokens to foretell and the place to delegate for assist. This leads to increased FactScores when producing in a cascade with a much bigger mannequin and outperforms Rho or LLM-judge skilled SLMs, whereas being less complicated and cheaper.
- †College of Cambridge
- ** Work completed whereas at Apple