Gemini 3.1 Flash Live Preview

Overview

Freshness note: Model capabilities, limits, and pricing can change quickly. This profile is a point-in-time snapshot last verified on June 8, 2026.

Gemini 3.1 Flash Live Preview is Google’s current low-latency live dialogue model for realtime multimodal interaction. The official positioning is clear: this is an audio-to-audio system optimized for real-time dialogue, acoustic nuance, numeric precision, and multimodal awareness rather than just a standard text model with speech bolted on afterward.

That makes it relevant for voice agents, guided assistants, live tutoring, and operational support flows where turn-taking quality matters as much as raw reasoning.

Capabilities

Google’s current model page lists support for text, image, audio, and video inputs with text and audio outputs. It also shows Live API support, function calling, search grounding, and thinking support, which means the model can act as a realtime operational layer instead of only a speech demo.

The interesting distinction is that some common platform capabilities are intentionally absent. Google’s current docs mark batch usage, caching, code execution, structured outputs, and URL context as unsupported here. That is a useful practical signal: the model is specialized for live interactive work, not broad batch automation.

Technical Details

Google’s model docs list:

Model code: gemini-3.1-flash-live-preview
Input token limit: 131,072
Output token limit: 65,536
Inputs: text, image, audio, and video
Outputs: text and audio

The same docs position it as part of the Live API surface, which is the real implementation constraint. Teams should think in terms of session behavior, latency, and dialogue quality, not just one-off prompt-response throughput. Google’s migration notes also say the default thinking level is minimal for latency, async function calling is not yet supported, and proactive audio plus affective dialogue are not yet supported on the 3.1 Flash Live model.

Pricing & Access

Google’s current pricing page lists paid-tier pricing at:

Input: $0.75 per 1M text tokens
Input: $3.00 per 1M audio tokens or$ 0.005/min
Input: $1.00 per 1M image/video tokens or$ 0.002/min
Output: $4.50 per 1M text tokens
Output: $12.00 per 1M audio tokens or$ 0.018/min

Signal Lens stores the text input and text output rates in frontmatter for baseline comparison, but real deployment cost depends heavily on modality mix. Google’s deprecation table lists no shutdown date for gemini-3.1-flash-live-preview at this snapshot.

Best Use Cases

Use Gemini 3.1 Flash Live Preview for realtime voice agents, multimodal live support, conversational tutoring, guided demos, or assistant experiences where interruptibility and natural audio behavior matter.

It is a weak fit for offline summarization, large batch processing, or heavily structured JSON pipelines. Those are better served by non-live Gemini routes.

Comparisons

Gemini 2.5 Flash Native Audio (Google): 2.5 remains a solid live-audio route; 3.1 Flash Live is the newer Gemini 3.1 live lane.
Gemini 3 Flash Preview (Google): Flash Preview is broader and better for non-live agentic work, while Flash Live is specialized for realtime dialogue.
GPT Realtime-style stacks: Same general class of product, with the platform choice usually driven by ecosystem fit, tooling, and deployment preferences.