Gemini 3.1 Flash TTS: the next generation of expressive AI speech
What Happened
Gemini 3.1 Flash TTS is now available across Google products.
Our Take
Google shipped Gemini 3.1 Flash TTS across its product surface — Search, Assistant, Android. It's a native multimodal model generating speech directly, not a text-to-speech pipeline bolted onto an LLM. Latency and naturalness benchmarks aren't public yet, but the architecture eliminates the tokenize-then-vocalize bottleneck.
For teams building voice agents or audio-first interfaces, the real shift is cost. Flash-tier models run at a fraction of Pro pricing, and native TTS means one API call instead of chaining an LLM with ElevenLabs or Play.ht. Most developers overpay for voice by stitching two vendors together when a single multimodal endpoint handles both reasoning and speech.
If you're running a voice agent on OpenAI's TTS plus GPT-4o, benchmark Gemini Flash TTS on latency and per-token cost before your next billing cycle. Teams doing text-only RAG can ignore this entirely.
What To Do
Benchmark Gemini 3.1 Flash TTS against your current LLM-plus-TTS pipeline on p95 latency and cost-per-minute because a single multimodal call eliminates the stitching overhead.
Builder's Brief
What Skeptics Say
Google launches TTS models regularly and deprecates them just as fast. Flash-tier quality may not hold up for production voice agents where expressiveness and emotion control actually matter — ElevenLabs still owns that margin.
Cited By
React
Get the weekly AI digest
The stories that matter, with a builder's perspective. Every Thursday.