Skip to main content

On This Page

xAI Launches Grok STT and TTS APIs for Enterprise Voice Developers

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers

Elon Musk’s xAI has launched standalone Speech-to-Text (STT) and Text-to-Speech (TTS) APIs built on the same infrastructure powering Grok Voice. The new STT engine reports a 5.0% error rate on phone call entity recognition, significantly lower than the 12.0% recorded by ElevenLabs.

Why This Matters

Enterprise voice applications often fail when processing technical entities like account numbers or currencies in noisy environments, where competitors like AssemblyAI see error rates as high as 21.3%. By providing built-in Inverse Text Normalization and speaker diarization, xAI addresses the gap between raw transcription and the structured, low-latency data required for legal, medical, and financial use cases.

Key Insights

  • Grok STT achieves a 5.0% error rate on phone call entity recognition versus Deepgram’s 13.5% (xAI Research, 2026).
  • Inverse Text Normalization automatically converts spoken phrases like ‘one hundred sixty-seven thousand dollars’ into structured output like ‘$167,000.00’.
  • Expressive TTS control is enabled through wrapping tags like and inline tags like [laugh] or [sigh] to reduce emotional flatness.
  • The APIs support 12 audio formats including raw formats like PCM, µ-law, and A-law for legacy telephony integration.
  • The TTS WebSocket streaming endpoint allows for unlimited text input length and immediate audio playback before full processing is complete.

Practical Applications

  • Use case: Starlink customer support utilizes the stack for automated troubleshooting and real-time transcription. Pitfall: Using batch processing for live support calls leads to latency that breaks the conversational flow.
  • Use case: Enterprise meeting tools use speaker diarization to separate multi-speaker recordings into distinct transcripts. Pitfall: Lack of word-level timestamps in transcripts makes searching through video recordings nearly impossible for legal documentation.

References:

Continue reading

Next article

Building Production-Grade Background Task Systems with Huey and SQLite

Related Content