GPT-Realtime-2
OpenAI · GPT Realtime
OpenAI's GPT-5-class realtime voice model for reasoning, tool-using speech agents, and live support workflows.
Overview
Freshness note: Model capabilities, limits, and pricing can change quickly. This profile is a point-in-time snapshot last verified on May 8, 2026.
GPT-Realtime-2 is OpenAI’s May 2026 flagship realtime voice model. OpenAI positions it as the first GPT-5-class voice model in the Realtime API: a speech-to-speech model that can reason through harder requests, keep a live conversation moving, use tools, and recover more gracefully when a task changes midstream.
This page should be read as the forward-looking replacement for older GPT Realtime routes when the product needs higher voice-agent intelligence rather than just low latency.
Capabilities
GPT-Realtime-2 is built for live voice agents that need to listen, reason, act, and speak while the conversation is still unfolding. OpenAI highlights stronger instruction following, better tool-calling reliability, audible tool transparency through short preambles, and better handling of specialized terminology in domains such as healthcare and customer support.
The model also raises the practical ceiling for long-running voice sessions. OpenAI increased the realtime context window from the earlier 32K tier to 128K tokens, which makes it more realistic to build agents that preserve customer context, tool results, and conversation state across longer interactions.
Technical Details
Current published limits:
- Context window: 128,000 tokens
- Max output: 32,000 tokens
- Configurable reasoning effort: minimal, low, medium, high, and xhigh
- Function calling: supported
- Structured outputs: not supported
GPT-Realtime-2 supports text, audio, and image input, with text and audio output through realtime interaction paths. It is suitable for tool-using voice agents, but teams that require strict JSON contracts still need validation and fallback layers because structured outputs are not supported on this route.
Pricing & Access
Published OpenAI API pricing:
- Text input: $4.00 per 1M tokens
- Cached text input: $0.40 per 1M tokens
- Text output: $24.00 per 1M tokens
- Audio input: $32.00 per 1M tokens
- Cached audio input: $0.40 per 1M tokens
- Audio output: $64.00 per 1M tokens
- Image input: $5.00 per 1M tokens
- Cached image input: $0.50 per 1M tokens
Access is through OpenAI’s Realtime API and related realtime-oriented endpoints. Budget with actual session traffic, not just text-token estimates: production voice products are usually dominated by audio tokens and live-session duration.
Best Use Cases
Use GPT-Realtime-2 for customer-support agents, phone and browser voice assistants, travel or commerce copilots, healthcare intake helpers with human oversight, field-service assistants, and multilingual support flows where interruptions and tool calls happen during conversation.
Avoid using it as a default for offline transcription, batch audio analysis, or text-only agents. Those workloads are usually cheaper and simpler on transcription, translation, or non-realtime text models.
Comparisons
- GPT-realtime-1.5 (OpenAI): Older available realtime route; GPT-Realtime-2 is the stronger reasoning and tool-use upgrade.
- GPT-Realtime-Translate (OpenAI): Dedicated live translation model priced by minute rather than a general voice-agent model.
- Gemini 3.1 Flash Live Preview (Google): Comparable realtime voice-agent category, with provider choice usually driven by ecosystem fit, latency behavior, and tool integration.