Gemini 3.1 Flash Live Preview

Google · Gemini 3.1

Google's low-latency Gemini 3.1 live model for realtime audio-to-audio and multimodal dialogue.

Type
multimodal
Context
131K tokens
Max Output
66K tokens
Status
preview
Input
$0.75/1M tok
Output
$4.5/1M tok
API Access
Yes
License
proprietary
live-api native-audio multimodal voice-agents realtime preview
Released March 2026 · Updated April 13, 2026

Overview

Freshness note: Model capabilities, limits, and pricing can change quickly. This profile is a point-in-time snapshot last verified on April 13, 2026.

Gemini 3.1 Flash Live Preview is Google’s current low-latency live dialogue model for realtime multimodal interaction. The official positioning is clear: this is an audio-to-audio system optimized for real-time dialogue, acoustic nuance, numeric precision, and multimodal awareness rather than just a standard text model with speech bolted on afterward.

That makes it relevant for voice agents, guided assistants, live tutoring, and operational support flows where turn-taking quality matters as much as raw reasoning.

Capabilities

Google’s current model page lists support for text, image, audio, and video inputs with text and audio outputs. It also shows Live API support, function calling, search grounding, and thinking support, which means the model can act as a realtime operational layer instead of only a speech demo.

The interesting distinction is that some common platform capabilities are intentionally absent. Google’s current docs mark batch usage, caching, code execution, structured outputs, and URL context as unsupported here. That is a useful practical signal: the model is specialized for live interactive work, not broad batch automation.

Technical Details

Google’s model docs list:

  • Model code: gemini-3.1-flash-live-preview
  • Input token limit: 131,072
  • Output token limit: 65,536
  • Inputs: text, image, audio, and video
  • Outputs: text and audio

The same docs position it as part of the Live API surface, which is the real implementation constraint. Teams should think in terms of session behavior, latency, and dialogue quality, not just one-off prompt-response throughput.

Pricing & Access

Google’s current pricing page lists paid-tier pricing at:

  • Input: $0.75 per 1M text tokens
  • Input: 3.00per1Maudiotokensor3.00 per 1M audio tokens or 0.005/min
  • Input: 1.00per1Mimage/videotokensor1.00 per 1M image/video tokens or 0.002/min
  • Output: $4.50 per 1M text tokens
  • Output: 12.00per1Maudiotokensor12.00 per 1M audio tokens or 0.018/min

Signal Lens stores the text input and text output rates in frontmatter for baseline comparison, but real deployment cost depends heavily on modality mix.

Best Use Cases

Use Gemini 3.1 Flash Live Preview for realtime voice agents, multimodal live support, conversational tutoring, guided demos, or assistant experiences where interruptibility and natural audio behavior matter.

It is a weak fit for offline summarization, large batch processing, or heavily structured JSON pipelines. Those are better served by non-live Gemini routes.

Comparisons

  • Gemini 2.5 Flash Native Audio (Google): 2.5 remains a solid live-audio route; 3.1 Flash Live is the newer Gemini 3.1 live lane.
  • Gemini 3 Flash Preview (Google): Flash Preview is broader and better for non-live agentic work, while Flash Live is specialized for realtime dialogue.
  • GPT Realtime-style stacks: Same general class of product, with the platform choice usually driven by ecosystem fit, tooling, and deployment preferences.