GPT-Realtime-2 — Signal Lens

Overview

Freshness note: Model capabilities, limits, and pricing can change quickly. This profile is a point-in-time snapshot last verified on June 8, 2026.

GPT-Realtime-2 is OpenAI’s May 2026 flagship realtime voice model. OpenAI positions it as the first GPT-5-class voice model in the Realtime API: a speech-to-speech model that can reason through harder requests, keep a live conversation moving, use tools, and recover more gracefully when a task changes midstream.

This page should be read as the forward-looking replacement for older GPT Realtime routes when the product needs higher voice-agent intelligence rather than just low latency.

Capabilities

GPT-Realtime-2 is built for live voice agents that need to listen, reason, act, and speak while the conversation is still unfolding. OpenAI highlights stronger instruction following, better tool-calling reliability, audible tool transparency through short preambles, and better handling of specialized terminology in domains such as healthcare and customer support.

The model also raises the practical ceiling for long-running voice sessions. OpenAI increased the realtime context window from the earlier 32K tier to 128K tokens, which makes it more realistic to build agents that preserve customer context, tool results, and conversation state across longer interactions.

Technical Details

Current published limits:

Context window: 128,000 tokens
Max output: 32,000 tokens
Configurable reasoning effort: minimal, low, medium, high, and xhigh
Function calling: supported
Structured outputs: not supported

GPT-Realtime-2 supports text, audio, and image input, with text and audio output through realtime interaction paths. It is suitable for tool-using voice agents, but teams that require strict JSON contracts still need validation and fallback layers because structured outputs are not supported on this route.

Pricing & Access

Published OpenAI API pricing:

Text input: $4.00 per 1M tokens
Cached text input: $0.40 per 1M tokens
Text output: $24.00 per 1M tokens
Audio input: $32.00 per 1M tokens
Cached audio input: $0.40 per 1M tokens
Audio output: $64.00 per 1M tokens
Image input: $5.00 per 1M tokens
Cached image input: $0.50 per 1M tokens

Access is through OpenAI’s Realtime API and related realtime-oriented endpoints. Budget with actual session traffic, not just text-token estimates: production voice products are usually dominated by audio tokens and live-session duration.

Best Use Cases

Use GPT-Realtime-2 for customer-support agents, phone and browser voice assistants, travel or commerce copilots, healthcare intake helpers with human oversight, field-service assistants, and multilingual support flows where interruptions and tool calls happen during conversation.

Avoid using it as a default for offline transcription, batch audio analysis, or text-only agents. Those workloads are usually cheaper and simpler on transcription, translation, or non-realtime text models.

Comparisons

GPT-realtime-1.5 (OpenAI): Older available realtime route; GPT-Realtime-2 is the stronger reasoning and tool-use upgrade.
GPT-Realtime-Translate (OpenAI): Dedicated live translation model priced by minute rather than a general voice-agent model.
Gemini 3.1 Flash Live Preview (Google): Comparable realtime voice-agent category, with provider choice usually driven by ecosystem fit, latency behavior, and tool integration.