Vertex AI Multi-Modal
mAP@20
Size
Size
About This Model
Overview
Google's Vertex AI platform exposes multimodalembedding@001, a multimodal embedding service that maps text, images, and video into a shared semantic space. It is intended for retrieval and semantic similarity tasks that benefit from comparing content across modalities in a managed API workflow.
Architecture
Google Cloud documentation describes multimodalembedding@001 as a multimodal embedding service and documents its vector dimensionality, but does not provide a detailed public architecture description. The service returns 1,408-dimensional embeddings by default and also supports lower output dimensions.
Capabilities
The embeddings support semantic search, recommendation, content moderation, classification, and similarity-based retrieval across text, image, and video. Text and image embeddings share the same dimensionality and semantic space, enabling cross-modal use cases such as text-to-image retrieval.
Performance Across Datasets
| Dataset | Category | R@1 | R@5 | mAP |
|---|---|---|---|---|
| Stanford Online Products | E-commerce | 76.88% | 87.73% | 56.32% |
| Products-10K | E-commerce | 63.26% | 82.21% | 43.08% |
| DIY v1 | Hardware/DIY | 24.74% | 47.60% | 34.69% |
| Automotive v1 | Automotive | 19.69% | 43.21% | 24.77% |
| Clips-and-Connectors v1 | Industrial | 8.77% | 21.80% | 1.83% |
| Average | 38.67% | 56.51% | 32.14% | |