Google Launches Gemini Embedding 2: The First Natively Multimodal Embedding Model

Google has announced one of its most significant updates for enterprise customers – the public preview release of Gemini Embedding 2, which is the first embedding model designed to be natively multimodal. Built on the Gemini architecture, the model can map text, images, video, audio, and documents into a single embedding space, enabling multimodal search and classification across different types of media.

The model is already available in Public Preview via the Gemini API and Vertex AI.

What is an embedding

An embedding is a numerical representation of data that allows models to analyze and interpret information. Embeddings can be used not only for text, but also for images, audio, and video.

Text embeddings are numerical vectors that map words or phrases into a multidimensional space. These vectors help machine learning models recognize meaning and relationships within text.

As Google explains:

“Expanding on our previous text-only foundation, Gemini Embedding 2 maps text, images, videos, audio and documents into a single, unified embedding space, and captures semantic intent across over 100 languages. This simplifies complex pipelines and enhances a wide variety of multimodal downstream tasks—from Retrieval-Augmented Generation (RAG) and semantic search to sentiment analysis and data clustering.”

Key features of the new embedding model

Gemini Embedding 2 can generate embeddings for a wide range of data types. Here are the capabilities of the model across different media formats:

Text: supports context up to 8192 output tokens
Images: can process up to 6 images per request and supports formats such as PNG and JPEG
Video: supports videos up to 120 seconds in MP4 and MOV formats
Audio: can process audio data directly without requiring text transcription
Documents: can generate embeddings for PDF files up to 6 pages

An important clarification: these limits do not apply to the system’s memory or storage capacity, but only to the amount of input data per request. This means that you cannot, for example, upload hundreds of pages of a PDF file at once. Instead, the document needs to be split into segments of up to six pages and each segment should be sent separately.

Another important feature is that the model has a cumulative effect. After the fragments you send are converted into vectors, they can be stored together in a database, which later enables search across all files. The same principle also applies to video and audio formats.

How Gemini Embedding 2 differs from previous approaches

Previously, different approaches were used to connect two different types of data. For example, CLIP relied on two separate encoders: one designed for text and the other for images. The reason this approach did not work perfectly is that the separate encoders processed the data independently, and their outputs were later aligned using contrastive learning.

Because this alignment occurred only at the final stage, it missed deeper cross-modal connections. With this approach, the system could understand that certain images correspond to certain pieces of text, but it was not always able to capture the more complex interactions and relationships between different types of data, some of which could have formed in the intermediate layers of the network.

A key advantage of the model is that it can natively understand combined inputs, such as image + text within a single request. This allows the model to process complex data and capture relationships between different types of media.

Matryoshka Representation Learning

One of the nuances when working with embeddings is that higher dimensional vectors can capture more details, but they also require more memory. To address this, Gemini Embedding 2 uses a technique called Matryoshka Representation Learning (MRL).

The idea behind it is that a single representation vector can be truncated to fewer dimensions while still retaining its usefulness for tasks such as search or text comparison. In other words, the technique allows information to be nested within the vector, reducing its size while preserving performance.

MRL directs the most important information into the earliest dimensions of the vector, instead of distributing the semantic signal evenly across all 3072 dimensions. To use this feature, developers need to pass the output_dimensionality parameter.

It is important to note that dimensions smaller than 3072 are not normalized by default, so vectors must be normalized manually before computing similarity. If this step is skipped, the resulting distance metrics may become distorted.

“This allows flexible scaling of the output embedding size, reducing it from the default dimension of 3072. As a result, developers can balance performance and storage costs. For the best quality, we recommend using dimensions of 3072, 1536, or 768.”

Two-stage retrieval with MRL

Another advantage of MRL is that it enables an effective two-stage retrieval algorithm. In the first stage, smaller vectors can be used for fast retrieval from the index. In the second stage, the retrieved results can be re-ranked using the full 3072-dimensional vectors. This approach gives developers the accuracy of a large model with the latency profile of a much smaller one.

Performance and multimodal capabilities

Gemini Embedding 2 stands out for its performance in multimodal tasks, including advanced capabilities for working with speech, text, images, and video.

The new embedding model represents direct competition with other multimodal embedding providers, as Google is taking a significant step forward by enabling support for different types of data within a single embedding space.

In practice, this could reduce the need for separate pipelines for each data modality, since a unified model simplifies the process of handling multiple types of data. However, it is still important to note that in many real-world deployments additional layers remain necessary, including metadata processing, compliance requirements, and access control mechanisms.

Overall, Google states that the new model sets a new performance standard, improving how developers work with embeddings and enabling more efficient multimodal systems.

Meaning for enterprise databases

Embeddings are widely used by Google itself in various products, including Retrieval-Augmented Generation (RAG), large-scale data management, and traditional search systems.

Information within companies often exists as a fragmented and scattered set of data. A customer may face the same issue that involves a PDF document, several emails, meaning text, as well as audio. In the past, working with each of these formats required four separate pipelines, but with the new Gemini 2 Embedding model the situation changes significantly. With the new embedding, it becomes much easier for organizations to perform search regardless of the data format.

According to the company, several early-access partners are already using the model to build multimodal applications.

“We chose Gemini embeddings to help legal professionals find critical information during the discovery process in litigation – a highly technical challenge in a high-stakes setting, and one Gemini excels at.” – Max Christoff, CTO of Everlaw

Getting started and pricing

The model is already available for testing and integration into projects, but it is still subject to updates and improvements. A full, stable release (General Availability, GA) will be rolled out later.

Developers can explore interactive Colab notebooks for the Gemini API and Vertex AI to learn how to implement the new model.

Gemini Embedding 2 can also be integrated with popular frameworks and tools such as LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Vector Search.

Integrations with the above-mentioned ecosystems are important because embedding models are rarely used in isolation. They are typically placed behind a vector index, which stores embeddings of the data corpus and performs nearest-neighbor search. This infrastructure enables a wide range of applications, including enterprise search, customer support assistants, and content moderation systems.

Access channels

Access to the new Gemini Embedding 2 model is available through two main channels:

Gemini API – designed for developers and rapid prototyping, with a simplified pricing model. Ideal for startups and small businesses.
Vertex AI (Google Cloud) – built for large enterprises and high-scale projects, offering advanced security features and seamless integration with the broader Google Cloud ecosystem.

Gemini API pricing

Free Tier: Developers can experiment with the model at no cost, subject to usage limits (typically 60 requests per minute). Data from this tier may be used by Google to improve its products.
Paid Tier: For production-level usage: text, images, and video at $0.25 per 1M tokens; native audio (without transcription) at $0.50 per 1M tokens.

Vertex AI pricing for enterprises

Flex PayGo: Ideal for unpredictable workloads; pay only for what you use.
Provisioned Throughput: Guarantees consistent capacity and low-latency performance for high-traffic applications.
Batch Prediction: Optimized for processing massive historical archives where speed is less critical but volume is extremely high.

All official Gemini API and Vertex AI Colab notebooks are licensed under Apache License 2.0, a permissive license that allows developers to use, modify, and even commercialize the code without the obligation to open-source their own projects.

What is an embedding#

Key features of the new embedding model#

How Gemini Embedding 2 differs from previous approaches#

Matryoshka Representation Learning#

Two-stage retrieval with MRL#

Performance and multimodal capabilities#

Meaning for enterprise databases#

Getting started and pricing#

Access channels#

Gemini API pricing#

Vertex AI pricing for enterprises#