Tuesday, January 13, 2026

Information-Centric Classes To Enhance Speech-Language Pretraining


Spoken Query-Answering (SQA) is a core functionality for helpful and interactive synthetic intelligence programs. Just lately, a number of speech-language fashions (SpeechLMs) have been launched with a selected deal with enhancing their SQA efficiency. Nevertheless, an absence of managed ablations of pretraining information processing and curation makes it difficult to know what elements account for efficiency, regardless of substantial features from related research in different information modalities. On this work, we handle this hole by conducting a data-centric exploration for pretraining SpeechLMs. We deal with three analysis questions basic to speech-language pretraining information: (1) how you can course of uncooked web-crawled audio content material for speech-text pretraining, (2) how you can assemble artificial pretraining datasets to reinforce web-crawled information and (3) how you can interleave (textual content, audio) segments into coaching sequences. We apply the insights from our managed data-centric ablations to pretrain a 3.8B-parameter SpeechLM, known as SpeLangy, that outperforms fashions which might be as much as 3x bigger by 10.2% absolute efficiency. We hope our findings spotlight the influence of efficient information curation for speech-language pretraining and information future data-centric exploration in SpeechLMs.

Related Articles

Latest Articles