NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Conscious Streaming Mannequin Transcribing 40 Language-Locales in Actual Time

June 6, 2026

162

NVIDIA’s Nemotron Speech group has launched Nemotron 3.5 ASR. It’s a 600M-parameter streaming Computerized Speech Recognition (ASR) mannequin. A single checkpoint transcribes 40 language-locales in actual time. Punctuation and capitalization are inbuilt natively. The mannequin ships as open weights on Hugging Face. The license is OpenMDW-1.1. The structure is a Cache-Conscious FastConformer-RNNT.

What’s Nemotron 3.5 ASR

Nemotron 3.5 ASR extends nvidia/nemotron-speech-streaming-en-0.6b to many languages. It provides prompt-based language-ID conditioning to the bottom mannequin. That lets one 600M-parameter checkpoint cowl 40 language-locales. No per-language mannequin or model-swapping is required.

The mannequin targets two workloads. The primary is low-latency streaming for dwell audio. The second is high-throughput batch transcription. Output is production-ready textual content with correct casing and punctuation. No separate punctuation-restoration step is required.

Picture supply: https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b

How Cache-Conscious FastConformer-RNNT Works

The mannequin has two foremost items. The primary is a Cache-Conscious FastConformer encoder with 24 layers. FastConformer is an environment friendly evolution of the Conformer structure. It makes use of linearly scalable consideration. The second piece is an RNNT (Recurrent Neural Community Transducer) decoder. RNNT emits textual content body by body as audio streams in.

The “cache-aware” design is the effectivity lever. Buffered streaming re-processes overlapping audio home windows at each step. That repeats the identical work and provides delay. This mannequin caches encoder self-attention and convolution activations as a substitute. It reuses these cached states as new audio arrives. So every audio body is processed precisely as soon as, with no overlap. Compute and end-to-end latency each drop, with out an accuracy penalty.

The Latency Knob: `att_context_size`

One inference setting controls the latency-accuracy tradeoff. It’s the consideration context dimension, att_context_size. Smaller context emits textual content sooner however sees much less future audio. Bigger context raises accuracy at greater latency.

The identical checkpoint covers the total vary. Settings map to chunk sizes of 80ms, 160ms, 320ms, 560ms, and 1.12s. For instance, [56,0] offers an 80ms ultra-low-latency mode. The [56,13] setting offers 1.12s for highest accuracy. Groups choose the working level at inference time, with no retraining.

Language Detection and Protection

The 40 language-locales embrace English, Spanish, German, and French variants. Additionally they cowl Arabic, Japanese, Korean, Mandarin, Hindi, and Thai. A number of different European and Nordic languages are included too.

Language conditioning works two methods. Setting target_lang to a identified locale normally offers the very best accuracy. Setting target_lang=auto lets the mannequin detect the language itself. In auto mode, it emits a language tag after terminal punctuation. One deployment can then transcribe mixed-language site visitors. No separate language-ID element is required.

Comparability

Product	Firm	Entry	Native streaming	Language protection	Reported latency	Pricing mannequin
Nemotron 3.5 ASR	NVIDIA	Open weights (OpenMDW-1.1), self-host; hosted on DeepInfra	Sure — cache-aware FastConformer-RNNT	40 language-locales	80ms–1.12s, configurable at inference	Free to self-host; usage-based through host
Whisper large-v3	OpenAI	Open weights (MIT), self-host; API	No — offline/batch	~99 languages	Not streaming-native	Self-host free; API ~$0.006/min (batch)
Nova-3	Deepgram	Closed API; on-premise/self-host (enterprise)	Sure — streaming + batch	Multilingual; +10 monolingual languages added Jan 2026	Low-latency streaming (reported sub-300ms)	~$0.0077/min (Nova-3 Monolingual, PAYG)
Common-3 Professional Streaming	AssemblyAI	Closed API (EU endpoint out there)	Sure	6 languages: English, Spanish, French, German, Italian, Portuguese	Sub-300ms (official); first partial ~750ms	Utilization-based (PAYG)
Scribe v2 Realtime	ElevenLabs	Closed API	Sure	90+ languages (99 per ElevenLabs)	~150ms (p50)	~$0.28/hour
Ursa / streaming	Speechmatics	API + on-premise + edge	Sure — streaming + batch	50+ languages with automated identification	Extremely-low latency (positioned)	Enterprise/utilization

Superb-Tuning Outcomes

As a result of the weights are open, groups can fine-tune for a language, area, or accent. NVIDIA printed a labored instance on Greek and Bulgarian. It fine-tuned the bottom checkpoint with the identical Cache-Conscious FastConformer-RNNT recipe. Every clip carried a target_lang tag for language conditioning. Coaching knowledge got here from public corpora, together with Granary, Widespread Voice, and FLEURS.

Outcomes had been measured as WER on held-out FLEURS, on the 80ms setting. Greek WER fell from 35 to 24, a 32% relative enchancment. Bulgarian fell from 22 to fifteen, a 31% relative enchancment. These are uncooked WER percentages on the lowest-latency streaming mode. NVIDIA notes that evaluating at deployment latency, on held-out knowledge, offers sincere numbers.

Strengths and Concerns

Strengths:

One 600M-parameter checkpoint covers 40 language-locales, chopping deployment sprawl.
Cache-aware streaming processes every body as soon as, reported at 17x buffered concurrency on an H100.
att_context_size tunes latency from 80ms to 1.12s at inference, with no retraining.
Punctuation, capitalization, and auto language tagging are inbuilt.
Open weights enabled a 31–32% relative WER drop on Greek and Bulgarian after fine-tuning.

Concerns:

The mannequin handles English, however NVIDIA recommends its devoted English mannequin for English-only use.
The 80ms mode trades some accuracy for the bottom latency.
Japanese and Korean use CER, so cross-language error comparisons want care.
Throughput figures are measured on H100, so outcomes on different GPUs will differ.
The manufacturing NIM with gRPC streaming is introduced, however not but launched.

Key Takeaways

NVIDIA’s Nemotron 3.5 ASR is an open-weights (OpenMDW-1.1), 600M-parameter streaming mannequin transcribing 40 language-locales from one checkpoint.
Its Cache-Conscious FastConformer-RNNT design processes every audio body as soon as, reported at 17x the concurrent streams of buffered approaches on an H100.
Latency is configurable from 80ms to 1.12s at inference through att_context_size, with no retraining.
A brief fine-tune reduce FLEURS WER 32% on Greek (35→24) and 31% on Bulgarian (22→15), on the 80ms setting.
It’s self-hostable and streaming-native, in contrast to closed APIs (Deepgram, AssemblyAI, ElevenLabs) or offline Whisper.

Marktechpost’s Visible Explainer

NEMOTRON 3.5 ASR
1 / 10

NVIDIA · STREAMING SPEECH AI · OPEN WEIGHTS

Nemotron 3.5 ASR

A 600M-parameter cache-aware streaming mannequin that transcribes 40 language-locales in actual time, from a single checkpoint.

600M parameters
40 language-locales
80ms–1.12s latency
OpenMDW-1.1

01 — WHAT IT IS

One mannequin, 40 language-locales

Extends nvidia/nemotron-speech-streaming-en-0.6b with prompt-based language-ID conditioning.
A single 600M-parameter checkpoint covers 40 language-locales. No model-swapping.
Punctuation and capitalization are inbuilt. No separate post-processing step.
Targets two workloads: low-latency streaming and high-throughput batch.
NVIDIA nonetheless recommends its English-only mannequin for English-only use.

02 — ARCHITECTURE

Cache-Conscious FastConformer-RNNT

A 24-layer FastConformer encoder paired with an RNNT decoder.
Buffered streaming re-processes overlapping audio home windows at each step.
This mannequin caches encoder self-attention and convolution states, then reuses them.
Every audio body is processed precisely as soon as, with no overlap.
Compute and end-to-end latency drop, with no accuracy penalty.

03 — THE LATENCY KNOB

One setting tunes latency vs. accuracy

att_context_size	Chunk (latency)	Use case
`[56,0]`	80ms (Extremely-Low)	Extremely low latency voice brokers
`[56,1]`	160ms (Low)	Interactive voice brokers
`[56,3]`	320ms (Balanced)	Conversational AI, dwell caption
`[56,6]`	560ms (Medium)	Greater accuracy, cheap latency
`[56,13]`	1.12s (Excessive)	Highest accuracy

Identical checkpoint, chosen at inference time. No retraining required.

04 — LANGUAGES & DETECTION

Protection and automated language ID

40 language-locales, together with English, Spanish, German, and French variants.
Additionally covers Arabic, Japanese, Korean, Mandarin, Hindi, and Thai.
Set target_lang to a identified locale for the very best accuracy.
Set target_lang=auto to let the mannequin detect the language.
In auto mode, it emits a language tag after terminal punctuation.
One deployment handles mixed-language site visitors, with no separate language-ID element.

05 — THROUGHPUT

Half the dimensions, extra concurrent streams

NVIDIA compares it in opposition to Parakeet RNNT 1.1B multilingual, which makes use of buffered streaming.
Nemotron 3.5 ASR is roughly half the dimensions: 0.6B versus 1.1B.
The group reviews 17x the concurrent streams of buffered approaches, on the identical H100.
Avoiding redundant recomputation lowers the fee per stream in manufacturing.

The 17x determine is from the discharge announcement; the mannequin card states the qualitative declare immediately.

06 — FINE-TUNING RESULTS

A brief fine-tune lifts weaker languages

Language	Base WER	Superb-tuned	Relative
Greek	35	24	32%
Bulgarian	22	15	31%

Uncooked WER (%) on held-out FLEURS on the 80ms setting. Information: Granary, Widespread Voice, FLEURS.

07 — AVAILABILITY & ACCESS

Open weights, plus a hosted path

Weights on Hugging Face below the OpenMDW-1.1 license.
Runtime is NeMo 26.06 or newer. Enter have to be mono-channel.
Hosted on DeepInfra, which provides phrase boosting for area vocabulary.
NVIDIA says a NIM launch is deliberate for later within the month, with gRPC streaming.
Said GPU assist: Ampere, Hopper, Blackwell, Lovelace, Turing, Volta, and Jetson.

08 — HOW IT COMPARES

The place it sits within the panorama

Product	Entry	Streaming	Languages
Nemotron 3.5 ASR	Open weights	Native	40 locales
Whisper large-v3	Open weights	No (batch)	~99
Deepgram Nova-3	API / on-prem	Native	Multilingual
AssemblyAI U-3 Professional	API	Native	6
ElevenLabs Scribe v2	API	Native	90+
Google Chirp / Azure	API	Native	100+ / 140+

Latency and WER will not be immediately comparable throughout distributors; this compares construction, not a rating.

09 — KEY TAKEAWAYS

The brief model

An open-weights 600M streaming mannequin transcribing 40 language-locales from one checkpoint.
Cache-aware design processes every body as soon as; reported 17x buffered concurrency on an H100.
Latency configurable from 80ms to 1.12s at inference, with no retraining.
A brief fine-tune reduce FLEURS WER 32% (Greek) and 31% (Bulgarian).
Self-hostable and streaming-native, in contrast to closed APIs or offline Whisper.

Curated for AI engineers by Marktechpost — practitioner-first protection of AI & ML.

Take a look at the Mannequin weights. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as properly.

Must associate with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and many others.? Join with us

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Conscious Streaming Mannequin Transcribing 40 Language-Locales in Actual Time

What’s Nemotron 3.5 ASR

How Cache-Conscious FastConformer-RNNT Works

The Latency Knob: `att_context_size`

Language Detection and Protection

Comparability

Superb-Tuning Outcomes

Strengths and Concerns

Strengths:

Concerns:

Key Takeaways

Marktechpost’s Visible Explainer

Nemotron 3.5 ASR

One mannequin, 40 language-locales

Cache-Conscious FastConformer-RNNT

One setting tunes latency vs. accuracy

Protection and automated language ID

Half the dimensions, extra concurrent streams

A brief fine-tune lifts weaker languages

Open weights, plus a hosted path

The place it sits within the panorama

The brief model

Related Articles

One in every of NASA’s Most Necessary Deep Area Observatories Hit by Spanish Wildfires

It is Friday. There is a heatwave. I am feeling lazy. I am simply going to publish some tweets.

Tabular LLMs: An Introduction to the Basis Fashions That Predict Your Spreadsheet

Latest Articles

One in every of NASA’s Most Necessary Deep Area Observatories Hit by Spanish Wildfires

It is Friday. There is a heatwave. I am feeling lazy. I am simply going to publish some tweets.

Tabular LLMs: An Introduction to the Basis Fashions That Predict Your Spreadsheet

What Occurred, What Issues, What’s Subsequent

The hunt to maintain organs alive exterior the physique

NVIDIA Releases Nemotron 3.5 ASR: A 600M-Parameter Cache-Conscious Streaming Mannequin Transcribing 40 Language-Locales in Actual Time

What’s Nemotron 3.5 ASR

How Cache-Conscious FastConformer-RNNT Works

The Latency Knob: att_context_size

Language Detection and Protection

Comparability

Superb-Tuning Outcomes

Strengths and Concerns

Strengths:

Concerns:

Key Takeaways

Marktechpost’s Visible Explainer

Nemotron 3.5 ASR

One mannequin, 40 language-locales

Cache-Conscious FastConformer-RNNT

One setting tunes latency vs. accuracy

Protection and automated language ID

Half the dimensions, extra concurrent streams

A brief fine-tune lifts weaker languages

Open weights, plus a hosted path

The place it sits within the panorama

The brief model

Related Articles

Latest Articles

The Latency Knob: `att_context_size`