GPT-Realtime-Whisper

OpenAI · GPT Realtime

OpenAI's streaming speech-to-text model for low-latency realtime transcription.

Type
audio
Context
16K tokens
Max Output
2K tokens
Status
current
API Access
Yes
License
proprietary
realtime transcription speech-to-text audio streaming api
Released May 2026 · Updated May 8, 2026

Overview

Freshness note: Model capabilities, limits, and pricing can change quickly. This profile is a point-in-time snapshot last verified on May 8, 2026.

GPT-Realtime-Whisper is OpenAI’s realtime transcription model for applications that need transcript deltas from live audio while a person is still speaking. It extends the Whisper-style speech-to-text lane into lower-latency realtime use cases rather than replacing every offline transcription model.

Use this page when the question is “how do we transcribe live audio streams quickly?” rather than “which voice model should run a full conversation?”

Capabilities

GPT-Realtime-Whisper is designed for live captions, meetings, classrooms, broadcast workflows, customer-support monitoring, healthcare intake capture, and voice-agent pipelines that need continuous understanding of user speech.

The model supports streaming and is tuned for latency and accuracy tradeoffs in realtime audio. It does not support function calling or structured outputs, because it is a transcription model rather than a general agent model.

Technical Details

Current published limits:

  • Context window: 16,000 tokens
  • Max output: 2,000 tokens
  • Audio input
  • Text output
  • Streaming: supported
  • Function calling: not supported
  • Structured outputs: not supported

The model uses realtime transcription sessions and is priced by audio duration rather than text tokens.

Pricing & Access

Published OpenAI API pricing:

  • Realtime audio duration: $0.017 per minute
  • Equivalent per-second rate: about $0.00028

Access is through OpenAI’s realtime transcription API. Estimate cost from concurrent audio minutes and expected operating hours, then separately budget any downstream summarization or agent steps.

Best Use Cases

Use GPT-Realtime-Whisper for live captions, transcript feeds for voice agents, support-call indexing while the call is active, accessibility tooling, meeting notes, classroom transcription, and broadcast workflows that need low-latency text.

For batch file transcription where latency does not matter, compare against GPT-4o Transcribe, GPT-4o mini Transcribe, and legacy Whisper pricing before choosing this route.

Comparisons

  • GPT-4o Transcribe (OpenAI): Better fit for high-quality offline transcription and file ingestion.
  • GPT-4o mini Transcribe (OpenAI): Cheaper modern STT route for non-live pipelines.
  • GPT-Realtime-Translate (OpenAI): Translates speech across languages; Realtime Whisper focuses on same-language speech-to-text.