What analysis may be pursued with small fashions skilled to finish true applications? Sometimes, researchers research program synthesis through massive language fashions (LLMs) which introduce points comparable to realizing what’s in or out of distribution, understanding fine-tuning results, understanding the consequences of tokenization, and better demand on compute and storage to hold out experiments. We current a system known as Cadmus which incorporates an integer digital machine (VM), a dataset composed of true applications of numerous duties, and an autoregressive transformer mannequin that’s skilled for beneath $200 of compute price. The system can be utilized to check program completion, out-of-distribution representations, inductive reasoning, and instruction following in a setting the place researchers have efficient and reasonably priced fine-grained management of the coaching distribution and the power to examine and instrument fashions. Smaller fashions engaged on complicated reasoning duties allow instrumentation and investigations that could be prohibitively costly on bigger fashions. To show that these duties are complicated sufficient to be of curiosity, we present that these Cadmus fashions outperform GPT-5 (by reaching 100% accuracy whereas GPT-5 has 95% accuracy) even on a easy activity of finishing appropriate, integer arithmetic applications in our domain-specific language (DSL) whereas offering transparency into the dataset’s relationship to the issue. We additionally present that GPT-5 brings unknown priors into its reasoning course of when fixing the identical duties, demonstrating a confounding issue that forestalls the usage of large-scale LLMs for some investigations the place the coaching set relationship to the duty must be absolutely understood.
- ** Work performed whereas at Apple
