Elon Musk’s AI firm xAI has launched two standalone audio APIs — a Speech-to-Textual content (STT) API and a Textual content-to-Speech (TTS) API — each constructed on the identical infrastructure that powers Grok Voice on cell apps, Tesla automobiles, and Starlink buyer assist. The discharge strikes xAI squarely into the aggressive speech API market at present occupied by ElevenLabs, Deepgram, and AssemblyAI.
What Is the Grok Speech-to-Textual content API?
Speech-to-Textual content is the know-how that converts spoken audio into written textual content. For builders constructing assembly transcription instruments, voice brokers, name heart analytics, or accessibility options, an STT API is a core constructing block. Moderately than creating this from scratch, builders name an endpoint, ship audio, and obtain a structured transcript in return.
The Grok STT API is now usually out there, providing transcription throughout 25 languages with each batch and streaming modes. The batch mode is designed for processing pre-recorded audio recordsdata, whereas streaming allows real-time transcription as audio is captured. Pricing is saved easy: Speech-to-Textual content is $0.10 per hour for batch and $0.20 per hour for streaming.
The API contains word-level timestamps, speaker diarization, and multichannel assist, together with clever Inverse Textual content Normalization that accurately handles numbers, dates, currencies, and extra. It additionally accepts 12 audio codecs — 9 container codecs (WAV, MP3, OGG, Opus, FLAC, AAC, MP4, M4A, MKV) and three uncooked codecs (PCM, µ-law, A-law), with a most file measurement of 500 MB per request.
Speaker diarization is the method of separating audio by particular person audio system — answering the query ‘who mentioned what.’ That is important for multi-speaker recordings like conferences, interviews, or buyer calls. Phrase-level timestamps assign exact begin and finish occasions to every phrase within the transcript, enabling use instances like subtitle era, searchable recordings, and authorized documentation. Inverse Textual content Normalization converts spoken kinds like ‘100 sixty-seven thousand 9 hundred eighty-three {dollars} and fifteen cents’ into readable structured output: “$167,983.15.”.
Benchmark Efficiency
xAI analysis crew is making sturdy claims on accuracy. On telephone name entity recognition — names, account numbers, dates — Grok STT claims a 5.0% error fee versus ElevenLabs at 12.0%, Deepgram at 13.5%, and AssemblyAI at 21.3%. That could be a substantial margin if it holds in manufacturing. For video and podcast transcription, Grok and ElevenLabs tied at a 2.4% error fee, with Deepgram and AssemblyAI trailing at 3.0% and three.2% respectively. xAI crew additionally experiences a 6.9% phrase error fee on common audio benchmarks.



What’s the Grok Textual content-to-Speech API?
Textual content-to-Speech converts written textual content into spoken audio. Builders use TTS APIs to energy voice assistants, read-aloud options, podcast era, IVR (interactive voice response) methods, and accessibility instruments.
The Grok TTS API delivers quick, pure speech synthesis with detailed management by way of speech tags, and is priced at $4.20 per 1 million characters. The API accepts as much as 15,000 characters per REST request; for longer content material, a WebSocket streaming endpoint is out there that has no textual content size restrict and begins returning audio earlier than the complete enter is processed. The API helps 20 languages and 5 distinct voices: Ara, Eve, Leo, Rex, and Sal — with Eve set because the default.
Past voice choice, builders can inject inline and wrapping speech tags to manage supply. These embody inline tags like [laugh], [sigh], and [breath], and wrapping tags like and , letting builders create participating, lifelike supply with out advanced markup. This expressiveness addresses one of many core limitations of conventional TTS methods, which frequently produce technically right however emotionally flat output.
Key Takeaways
- xAI has launched two standalone audio APIs — Grok Speech-to-Textual content (STT) and Textual content-to-Speech (TTS) — constructed on the identical manufacturing stack already serving thousands and thousands of customers throughout Grok cell apps, Tesla automobiles, and Starlink buyer assist.
- The Grok STT API provides real-time and batch transcription throughout 25 languages with speaker diarization, word-level timestamps, Inverse Textual content Normalization, and assist for 12 audio codecs — priced at $0.10/hour for batch and $0.20/hour for streaming.
- On telephone name entity recognition benchmarks, Grok STT experiences a 5.0% error fee, considerably outperforming ElevenLabs (12.0%), Deepgram (13.5%), and AssemblyAI (21.3%), with significantly sturdy efficiency in medical, authorized, and monetary use instances.
- The Grok TTS API helps 5 expressive voices (Ara, Eve, Leo, Rex, Sal) throughout 20 languages, with inline and wrapping speech tags like
[laugh],[sigh], andgiving builders fine-grained management over vocal supply — priced at $4.20 per 1 million characters.
Try the Technical particulars right here. Additionally, be happy to comply with us on Twitter and don’t overlook to hitch our 130k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be a part of us on telegram as properly.
Must accomplice with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us

