GPT-Realtime-Translate
OpenAI · GPT Realtime
OpenAI's realtime speech-to-speech translation model for live multilingual audio experiences.
Overview
Freshness note: Model capabilities, limits, and pricing can change quickly. This profile is a point-in-time snapshot last verified on May 8, 2026.
GPT-Realtime-Translate is OpenAI’s dedicated live speech-to-speech translation model. It is not a general realtime assistant model. Its job is to translate speech while source audio is still arriving, returning translated audio and transcript deltas quickly enough for live multilingual products.
OpenAI positions it for customer support, education, events, media, creator platforms, and cross-border sales workflows where waiting for batch translation would break the experience.
Capabilities
The model supports more than 70 input languages and 13 output languages in OpenAI’s launch materials. It is built for speech-to-speech translation that preserves meaning while keeping pace with the speaker, including regional pronunciation and domain-specific language.
Because it is a dedicated translation route, it does not support function calling or structured outputs. Treat it as a specialized audio model in a larger system rather than the agent brain for a workflow.
Technical Details
Current published limits:
- Context window: 16,000 tokens
- Max output: 2,000 tokens
- Audio input and audio output
- Text output for transcripts and deltas
- Streaming: supported
- Function calling: not supported
- Structured outputs: not supported
The model uses the realtime translation endpoint and is priced by audio duration rather than text tokens.
Pricing & Access
Published OpenAI API pricing:
- Realtime audio duration: $0.034 per minute
- Equivalent per-second rate: about $0.00057
Access is through OpenAI’s realtime translation API. The billing model is duration-based, so cost estimates should start from expected conversation minutes rather than token counts.
Best Use Cases
Use GPT-Realtime-Translate for live translated customer calls, multilingual support queues, live event interpretation, product education videos, cross-language sales conversations, and creator tools where translated audio needs to arrive while the source is still being spoken.
For offline localization or document translation, a text model plus post-editing workflow will usually be cheaper and easier to govern.
Comparisons
- GPT-Realtime-2 (OpenAI): Better for reasoning and tool-using voice agents; Translate is the specialized live translation lane.
- GPT-Realtime-Whisper (OpenAI): Transcribes live speech to text; Translate produces translated speech and transcript deltas.
- Gemini Live-style translation flows: Similar product class, with the choice depending on language coverage, latency, and platform alignment.