Closing the Hole Between Textual content and Speech Understanding in LLMs

February 26, 2026

55

Massive Language Fashions (LLMs) could be tailored to increase their textual content capabilities to speech inputs. Nonetheless, these speech-adapted LLMs constantly underperform their text-based counterparts—and even cascaded pipelines—on language understanding duties. We time period this shortfall the text-speech understanding hole: the efficiency drop noticed when a speech-adapted LLM processes spoken inputs relative to when the unique text-based LLM processes the equal textual content. Latest approaches to narrowing this hole both depend on large-scale speech synthesis of textual content corpora, which is expensive and closely depending on artificial information, or on large-scale proprietary speech datasets, which aren’t reproducible. Because of this, there stays a necessity for extra data-efficient options for closing the text-speech understanding hole. On this work, we analyze the hole as pushed by two components: (i) forgetting of textual content capabilities throughout adaptation, and (ii) cross-modal misalignment between speech and textual content. Primarily based on this evaluation, we introduce SALAD—Pattern-efficient Alignment with Studying via Lively choice and cross-modal Distillation—which mixes cross-modal distillation with focused artificial information to enhance alignment whereas mitigating forgetting. Utilized to 3B and 7B LLMs, SALAD achieves aggressive efficiency with a powerful open-weight mannequin throughout broad-domain benchmarks in information, language understanding, and reasoning, whereas coaching on over an order of magnitude much less speech information from public corpora.

† Université de Toulon, Aix Marseille Université, CNRS, LIS

Closing the Hole Between Textual content and Speech Understanding in LLMs

Related Articles

This is how a potential Galaxy Z TriFold Large would possibly look

There have been ‘audible screams of enjoyment’: Why Artemis II sightings of meteor flashes on the moon have scientists giddy

Human-machine teaming dives underwater | MIT Information

Latest Articles

This is how a potential Galaxy Z TriFold Large would possibly look

There have been ‘audible screams of enjoyment’: Why Artemis II sightings of meteor flashes on the moon have scientists giddy

Human-machine teaming dives underwater | MIT Information

NVIDIA and the College of Maryland Researchers Launched Audio Flamingo Subsequent (AF-Subsequent): A Tremendous Highly effective and Open Massive Audio-Language Mannequin

What private trainers truly take into consideration the viral 12-3-30 treadmill exercise