Gemini 3.1 Flash TTS Preview
Google · Gemini 3.1
Google's expressive Gemini 3.1 Flash TTS preview, an API-first text-to-speech model spanning 70+ languages.
Overview
Freshness note: Model capabilities, limits, and pricing can change quickly. This profile is a point-in-time snapshot last verified on May 1, 2026.
Gemini 3.1 Flash TTS Preview is Google’s April 15, 2026 text-to-speech model in the Gemini Flash line. It takes text input and returns audio output, with the language coverage extended to roughly 70+ languages — close to a 3x jump over the previous Gemini 2.5 Flash TTS generation. Google describes it as a substantial generational step on naturalness, expressiveness, and steerability through audio tags, and it is exposed across the Gemini API, AI Studio, Vertex AI, and Google Vids for Workspace.
The “Preview” tag means pricing and behavior may shift before GA. For production systems that depend on stable voices and pricing, plan for a re-baseline when the model leaves preview.
Capabilities
Google’s launch materials highlight a specific capability profile:
- Text-to-speech generation across 70+ languages, nearly tripling the previous coverage.
- Expressive control through audio tags that influence tone, pace, and delivery style.
- Steerability that supports voice continuity across longer narrations.
- API-first design with the same Gemini ergonomics used elsewhere in the family.
Token semantics are different from chat models: input tokens correspond to text, while output tokens are audio tokens billed at the audio-output rate.
Technical Details
Public anchors at this snapshot:
- Model ID:
gemini-3.1-flash-tts-preview - Input: text
- Output: audio (audio tokens, not generated text)
- Coverage: 70+ languages
- Available through the Gemini API, Google AI Studio, Vertex AI, and Google Vids
Because this is an audio-output model, maxOutput is set to 0 to reflect that token-style output ceilings do not apply directly. Audio output is billed and limited differently per Google’s TTS documentation.
Pricing & Access
Listed Google pricing for Gemini 3.1 Flash TTS Preview:
- Input (text): $1.00 per 1M tokens
- Output (audio): $20.00 per 1M audio tokens
- Batch mode input: $0.50 per 1M tokens
- Batch mode output: $10.00 per 1M audio tokens
A free tier is available, though Google notes that data from free-tier usage may be used for product improvement.
Access options:
- Gemini API for developers
- Google AI Studio for free experimentation
- Vertex AI for enterprise users
- Google Vids for Workspace subscribers
Best Use Cases
Choose Gemini 3.1 Flash TTS Preview for:
- Multilingual voice generation across long-tail languages where competing TTS models still lag.
- Narration and content production that benefits from audio-tag-driven expressiveness.
- Workspace and Vids workflows that already live inside Google’s ecosystem.
- Prototyping voice features through the free Google AI Studio tier before locking in providers.
For premium voice quality with the most mature ecosystem of voice presets and cloning options, ElevenLabs remains the more conventional default. For OpenAI-native pipelines, OpenAI’s TTS models stay the easier integration.
Comparisons
- Gemini 2.5 Flash TTS (Google): Direct predecessor; 3.1 Flash TTS extends language coverage and expressive control.
- ElevenLabs voice models: More mature voice-cloning and preset ecosystem; Gemini 3.1 Flash TTS leans on Google integration and multilingual breadth.
- OpenAI TTS models: Comparable API-first TTS with simpler pricing; Gemini 3.1 Flash TTS differentiates on language coverage and steerability.