Gemini Embedding 2 Preview
Google · Gemini Embedding
Google's first multimodal Gemini embedding model, mapping text, images, audio, video, and PDFs into a unified vector space.
Overview
Freshness note: Model capabilities, limits, and pricing can change quickly. This profile is a point-in-time snapshot last verified on May 1, 2026.
Gemini Embedding 2 Preview is Google’s first multimodal embedding model in the Gemini line. It maps text, images, audio, video, and PDFs into a single shared vector space, enabling cross-modal retrieval and semantic search without building separate per-modality embedding pipelines. The “Preview” tag means Google may adjust pricing or behavior before the model exits preview, so production systems should be designed for re-embedding when GA pricing locks in.
This is the entry to reach for when the question is “how does Google embed multimodal content for retrieval?” rather than “which Google model should I use for chat or reasoning?”
Capabilities
Google’s release and pricing materials highlight a specific capability profile:
- Multimodal embedding across text, images, audio, video, and PDF inputs into a unified vector space.
- Direct cross-modal retrieval, including text queries against image, audio, or video corpora without bridging models.
- Standard embedding-API ergonomics through the Gemini API and Vertex AI.
- Batch API support at a 50% discount for offline embedding pipelines.
This is meaningfully different from earlier Gemini embedding models, which were text-only.
Technical Details
Public anchors at this snapshot:
- Model ID:
gemini-embedding-2-preview - Multimodal inputs across text, image, audio, video, and PDF
- Output: dense vector embeddings, not generated tokens
- Available through both the Gemini API for developers and Vertex AI for enterprise
Token-context fields here are nominal anchors rather than chat-style limits. Embedding models do not generate tokens, so maxOutput is set to 0 to reflect that the output is a vector rather than text. Treat the listed contextWindow as a quick comparison number; consult Google’s official model card for current per-modality input limits before designing chunking strategies.
Pricing & Access
Listed Google pricing per 1M tokens for Gemini Embedding 2 Preview:
- Text input: $0.20
- Image input: $0.45
- Audio input: $6.50
- Video input: $12.00
- Batch API: 50% discount on text input
Access options:
- Gemini API for developers
- Google AI Studio
- Vertex AI for enterprise customers
- Compatible with standard Gemini SDK clients
Best Use Cases
Choose Gemini Embedding 2 Preview for:
- Multimodal retrieval-augmented generation (RAG) that needs to index images, audio, video, or PDFs alongside text.
- Cross-modal search products where users issue text queries against non-text corpora.
- Knowledge bases that consolidate document understanding across formats without per-modality embedding pipelines.
- Teams already running on Gemini, Vertex AI, or Google Cloud who benefit from staying inside one provider.
For pure-text retrieval at lower cost, OpenAI’s text-embedding-3-small remains substantially cheaper. For self-hosted control, open-weight embedding models such as BGE or E5 are still relevant alternatives.
Comparisons
- OpenAI
text-embedding-3-large(OpenAI): Strong text-only embedding model at lower cost per million tokens; Gemini Embedding 2 differentiates on multimodality. - OpenAI
text-embedding-3-small(OpenAI): Cheap baseline text embeddings; Gemini Embedding 2 is a different product class. - Earlier Gemini embedding models (Google): Text-only predecessors; Embedding 2 Preview is the multimodal step change.