Gemini 3.1 Flash TTS Preview

Overview

Freshness note: Model capabilities, limits, and pricing can change quickly. This profile is a point-in-time snapshot last verified on June 8, 2026.

Gemini 3.1 Flash TTS Preview is Google’s April 2026 text-to-speech model in the Gemini Flash line. It takes text input and returns audio output, with language coverage across 70+ languages. Google describes it as a substantial generational step on naturalness, expressiveness, and steerability through audio tags, and it is exposed across the Gemini API, AI Studio, Vertex AI, and Google Vids for Workspace.

The “Preview” tag means pricing and behavior may shift before GA. For production systems that depend on stable voices and pricing, plan for a re-baseline when the model leaves preview.

Capabilities

Google’s launch materials highlight a specific capability profile:

Text-to-speech generation across 70+ languages, nearly tripling the previous coverage.
Expressive control through audio tags that influence tone, pace, and delivery style.
Steerability that supports voice continuity across longer narrations.
API-first design with the same Gemini ergonomics used elsewhere in the family.

Token semantics are different from chat models: input tokens correspond to text, while output tokens are audio tokens billed at the audio-output rate.

Technical Details

Public anchors at this snapshot:

Model ID: gemini-3.1-flash-tts-preview
Input: text
Output: audio (audio tokens, not generated text)
Input token limit: 8,192
Output token limit: 16,384
Coverage: 70+ languages
Available through the Gemini API, Google AI Studio, Vertex AI, and Google Vids

Because this is an audio-output model, maxOutput stores Google’s published audio-output token limit rather than a generated-text ceiling. Google’s model page lists audio generation and Batch API support, while caching, code execution, file search, function calling, grounding, structured outputs, thinking, URL context, image generation, and Live API are not supported for this model.

Pricing & Access

Listed Google pricing for Gemini 3.1 Flash TTS Preview:

Input (text): $1.00 per 1M tokens
Output (audio): $20.00 per 1M audio tokens
Batch mode input: $0.50 per 1M tokens
Batch mode output: $10.00 per 1M audio tokens

A free tier is available, though Google notes that data from free-tier usage may be used for product improvement.

Google’s deprecation table lists no shutdown date for gemini-3.1-flash-tts-preview at this snapshot.

Access options:

Gemini API for developers
Google AI Studio for free experimentation
Vertex AI for enterprise users
Google Vids for Workspace subscribers

Best Use Cases

Choose Gemini 3.1 Flash TTS Preview for:

Multilingual voice generation across long-tail languages where competing TTS models still lag.
Narration and content production that benefits from audio-tag-driven expressiveness.
Workspace and Vids workflows that already live inside Google’s ecosystem.
Prototyping voice features through the free Google AI Studio tier before locking in providers.

For premium voice quality with the most mature ecosystem of voice presets and cloning options, ElevenLabs remains the more conventional default. For OpenAI-native pipelines, OpenAI’s TTS models stay the easier integration.

Comparisons

Gemini 2.5 Flash TTS (Google): Direct predecessor; 3.1 Flash TTS extends language coverage and expressive control.
ElevenLabs voice models: More mature voice-cloning and preset ecosystem; Gemini 3.1 Flash TTS leans on Google integration and multilingual breadth.
OpenAI TTS models: Comparable API-first TTS with simpler pricing; Gemini 3.1 Flash TTS differentiates on language coverage and steerability.