Gemini Embedding 2 — Signal Lens

Overview

Freshness note: Model capabilities, limits, and pricing can change quickly. This profile is a point-in-time snapshot last verified on June 8, 2026.

Gemini Embedding 2 is Google’s first multimodal embedding model in the Gemini line. It maps text, images, audio, video, and PDFs into a single shared vector space, enabling cross-modal retrieval and semantic search without building separate per-modality embedding pipelines. Google’s current model page lists gemini-embedding-2 as stable.

This is the entry to reach for when the question is “how does Google embed multimodal content for retrieval?” rather than “which Google model should I use for chat or reasoning?”

Capabilities

Google’s release and pricing materials highlight a specific capability profile:

Multimodal embedding across text, images, audio, video, and PDF inputs into a unified vector space.
Direct cross-modal retrieval, including text queries against image, audio, or video corpora without bridging models.
Standard embedding-API ergonomics through the Gemini API and Vertex AI.
Batch API support at a 50% discount for offline embedding pipelines.
Flexible output dimensions from 128 to 3,072, with 768, 1,536, and 3,072 as Google’s recommended sizes.

This is meaningfully different from earlier Gemini embedding models, which were text-only.

Technical Details

Public anchors at this snapshot:

Model ID: gemini-embedding-2
Multimodal inputs across text, image, audio, video, and PDF
Output: dense vector embeddings, not generated tokens
Input token limit: 8,192
Output dimensions: 128-3,072
Available through both the Gemini API for developers and Vertex AI for enterprise

Embedding models do not generate tokens, so maxOutput is set to 0 to reflect that the output is a vector rather than text.

Pricing & Access

Listed Google pricing per 1M tokens for Gemini Embedding 2:

Text input: $0.20
Image input: $0.45
Audio input: $6.50
Video input: $12.00
Batch API: 50% discount across the listed modalities

Access options:

Gemini API for developers
Google AI Studio
Vertex AI for enterprise customers
Compatible with standard Gemini SDK clients

Best Use Cases

Choose Gemini Embedding 2 for:

Multimodal retrieval-augmented generation (RAG) that needs to index images, audio, video, or PDFs alongside text.
Cross-modal search products where users issue text queries against non-text corpora.
Knowledge bases that consolidate document understanding across formats without per-modality embedding pipelines.
Teams already running on Gemini, Vertex AI, or Google Cloud who benefit from staying inside one provider.

For pure-text retrieval at lower cost, OpenAI’s text-embedding-3-small remains substantially cheaper. For self-hosted control, open-weight embedding models such as BGE or E5 are still relevant alternatives.

Comparisons

OpenAI text-embedding-3-large (OpenAI): Strong text-only embedding model at lower cost per million tokens; Gemini Embedding 2 differentiates on multimodality.
OpenAI text-embedding-3-small (OpenAI): Cheap baseline text embeddings; Gemini Embedding 2 is a different product class.
Earlier Gemini embedding models (Google): Text-only predecessors; Embedding 2 is the multimodal step change and current recommended migration target.