Friday, January 23, 2026

Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Textual content Mannequin Designed to Deal with 60-Minute Lengthy-Type Audio in a Single Move


Microsoft has launched VibeVoice-ASR as a part of the VibeVoice household of open supply frontier voice AI fashions. VibeVoice-ASR is described as a unified speech-to-text mannequin that may deal with 60-minute long-form audio in a single move and output structured transcriptions that encode Who, When, and What, with assist for Custom-made Hotwords.

VibeVoice sits in a single repository that hosts Textual content-to-Speech, actual time TTS, and Computerized Speech Recognition fashions below an MIT license. VibeVoice makes use of steady speech tokenizers that run at 7.5 Hz and a next-token diffusion framework the place a Massive Language Mannequin causes over textual content and dialogue and a diffusion head generates acoustic element. This framework is principally documented for TTS, but it surely defines the general design context through which VibeVoice-ASR lives.

https://huggingface.co/microsoft/VibeVoice-ASR

Lengthy kind ASR with a single international context

Not like typical ASR (Computerized Speech Recognition) methods that first lower audio into brief segments after which run diarization and alignment as separate parts, VibeVoice-ASR is designed to just accept as much as 60 minutes of steady audio enter inside a 64K token size funds. The mannequin retains one international illustration of the total session. This implies the mannequin can keep speaker identification and subject context throughout all the hour as a substitute of resetting each few seconds.

60-minute Single-Move Processing

The first key function is that many typical ASR methods course of lengthy audio by reducing it into brief segments, which may lose international context. VibeVoice-ASR as a substitute takes as much as 60 minutes of steady audio inside a 64K token window so it may well keep constant speaker monitoring and semantic context throughout all the recording.

That is vital for duties like assembly transcription, lectures, and lengthy assist calls. A single move over the entire sequence simplifies the pipeline. There is no such thing as a have to implement customized logic to merge partial hypotheses or restore speaker labels at boundaries between audio chunks.

Custom-made Hotwords for area accuracy

Custom-made Hotwords are the second key function. Customers can present hotwords similar to product names, group names, technical phrases, or background context. The mannequin makes use of these hotwords to information the popularity course of.

This lets you bias decoding towards the proper spelling and pronunciation for area particular tokens with out retraining the mannequin. For instance, a dev-user can move inner undertaking names or buyer particular phrases at inference time. That is helpful when deploying the identical base mannequin throughout a number of merchandise that share comparable acoustic circumstances however very totally different vocabularies.

Microsoft additionally ships a finetuning-asr listing with LoRA based mostly superb tuning scripts for VibeVoice-ASR. Collectively, hotwords and LoRA superb tuning give a path for each mild weight adaptation and deeper area specialization.

Wealthy Transcription, diarization, and timing

The third function is Wealthy Transcription with Who, When, and What. The mannequin collectively performs ASR, diarization, and timestamping, and returns a structured output that signifies who stated what and when.

See under the three analysis figures named DER, cpWER, and tcpWER.

https://huggingface.co/microsoft/VibeVoice-ASR
  • DER is Diarization Error Fee, it measures how properly the mannequin assigns speech segments to the proper speaker
  • cpWER and tcpWER are phrase error price metrics computed below conversational settings

These graphs summarize how properly the mannequin performs on multi speaker lengthy kind knowledge, which is the first goal setting for this ASR system.

The structured output format is properly fitted to downstream processing like speaker particular summarization, motion merchandise extraction, or analytics dashboards. Since segments, audio system, and timestamps already come from a single mannequin, downstream code can deal with the transcript as a time aligned occasion log.

Key Takeaways

  • VibeVoice-ASR is a unified speech to textual content mannequin that handles 60 minute lengthy kind audio in a single move inside a 64K token context.
  • The mannequin collectively performs ASR, diarization, and timestamping so it outputs structured transcripts that encode Who, When, and What in a single inference step.
  • Custom-made Hotwords let customers inject area particular phrases similar to product names or technical jargon to enhance recognition accuracy with out retraining the mannequin.
  • Analysis with DER, cpWER, and tcpWER focuses on multi speaker conversational eventualities which aligns the mannequin with conferences, lectures, and lengthy calls.
  • VibeVoice-ASR is launched within the VibeVoice open supply stack below MIT license with official weights, superb tuning scripts, and a web-based Playground for experimentation.

Take a look at the Mannequin WeightsRepo and Playground. Additionally, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be a part of us on telegram as properly.


Related Articles

Latest Articles