NVIDIA Releases Nemotron Speech ASR: Low-Latency Speech Recognition

Nemotron Speech ASR: Cache Aware Streaming for Voice Agents

NVIDIA has released Nemotron Speech ASR, a new 600M parameter streaming English transcription model designed for low-latency applications like voice agents and live captioning. The model, available as a checkpoint on Hugging Face, achieves a word error rate (WER) of around 7.84% at a 0.16-second chunk size.

Why This Matters

Traditional streaming ASR often relies on overlapping windows, reprocessing audio repeatedly to maintain context, leading to increased computational cost and latency drift. Nemotron Speech ASR employs a cache-aware design, drastically reducing redundant computations and enabling stable, predictable latency—crucial for real-time voice interaction where delays can significantly hinder usability and user experience. A mismanaged streaming ASR pipeline can easily degrade agent responsiveness, impacting user engagement and driving up infrastructure costs.

Key Insights

Cache Aware design: Eliminates recomputation of overlapping context in streaming, improving efficiency.
Latency/Accuracy Tradeoff: Achieves 7.84% WER at 0.16s chunk size, decreasing to 7.16% at 1.12s, allowing developers to prioritize latency or accuracy.
Scalability: Supports approximately 560 concurrent streams on an NVIDIA H100 GPU with a 320ms chunk size, a 3x improvement over baseline streaming systems.

Working Example

# Example inference code (conceptual)
from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="nvidia/nemotron-speech-streaming-en-0.6b",
    device="cuda" # Or 'cpu'
)

audio_chunk = # Load 80ms - 1.12s audio chunk
result = pipe(audio_chunk)
print(result["text"])

Practical Applications

Voice Assistants: Real-time transcription for faster response times in conversational AI.
Pitfall: Failing to configure the att_context_size parameter appropriately can lead to suboptimal latency-accuracy tradeoffs and potentially increase computational costs.

References:

https://www.marktechpost.com/2026/01/06/nvidia-ai-released-nemotron-speech-asr-a-new-open-source-transcription-model-designed-from-the-ground-up-for-low-latency-use-cases-like-voice-agents/

On This Page

Nemotron Speech ASR: Cache Aware Streaming for Voice Agents

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model for Long-Form Audio

Compile FFmpeg with NVENC/NVDEC on NVIDIA Jetson AGX Orin 64GB

Meta AI Releases Omnilingual ASR: A Suite of Open-Source Multilingual Speech Recognition Models for 1600+ Languages