Thursday, February 26, 2026

Closing the Hole Between Textual content and Speech Understanding in LLMs


Massive Language Fashions (LLMs) could be tailored to increase their textual content capabilities to speech inputs. Nonetheless, these speech-adapted LLMs constantly underperform their text-based counterparts—and even cascaded pipelines—on language understanding duties. We time period this shortfall the text-speech understanding hole: the efficiency drop noticed when a speech-adapted LLM processes spoken inputs relative to when the unique text-based LLM processes the equal textual content. Latest approaches to narrowing this hole both depend on large-scale speech synthesis of textual content corpora, which is expensive and closely depending on artificial information, or on large-scale proprietary speech datasets, which aren’t reproducible. Because of this, there stays a necessity for extra data-efficient options for closing the text-speech understanding hole. On this work, we analyze the hole as pushed by two components: (i) forgetting of textual content capabilities throughout adaptation, and (ii) cross-modal misalignment between speech and textual content. Primarily based on this evaluation, we introduce SALAD—Pattern-efficient Alignment with Studying via Lively choice and cross-modal Distillation—which mixes cross-modal distillation with focused artificial information to enhance alignment whereas mitigating forgetting. Utilized to 3B and 7B LLMs, SALAD achieves aggressive efficiency with a powerful open-weight mannequin throughout broad-domain benchmarks in information, language understanding, and reasoning, whereas coaching on over an order of magnitude much less speech information from public corpora.

Related Articles

Latest Articles