Datasets Models Results
Models Generic Vertex AI Multi-Modal
Google

Vertex AI Multi-Modal

Foundation embedding model projecting text, images, and video into a shared space.

Google Proprietary Multi-Modal
Generic
Model Type
#6
Overall Rank
38.67%
Avg. R@1
32.14%
Avg.
mAP@20
1408
Embedding
Size
N/A
Input
Size
5
Datasets

About This Model

Overview

Google's Vertex AI platform exposes multimodalembedding@001, a multimodal embedding service that maps text, images, and video into a shared semantic space. It is intended for retrieval and semantic similarity tasks that benefit from comparing content across modalities in a managed API workflow.

Architecture

Google Cloud documentation describes multimodalembedding@001 as a multimodal embedding service and documents its vector dimensionality, but does not provide a detailed public architecture description. The service returns 1,408-dimensional embeddings by default and also supports lower output dimensions.

Capabilities

The embeddings support semantic search, recommendation, content moderation, classification, and similarity-based retrieval across text, image, and video. Text and image embeddings share the same dimensionality and semantic space, enabling cross-modal use cases such as text-to-image retrieval.

Performance Across Datasets

Dataset Category R@1 R@5 mAP
Stanford Online Products E-commerce 76.88% 87.73% 56.32%
Products-10K E-commerce 63.26% 82.21% 43.08%
DIY v1 Hardware/DIY 24.74% 47.60% 34.69%
Automotive v1 Automotive 19.69% 43.21% 24.77%
Clips-and-Connectors v1 Industrial 8.77% 21.80% 1.83%
Average 38.67% 56.51% 32.14%