Stanford Online Products
About This Dataset
Overview
The Stanford Online Products (SOP) dataset is a widely used benchmark for deep metric learning and instance-level image retrieval. It contains 120,053 images representing 22,634 products across 12 broad categories, collected from real e-commerce listings on platforms such as eBay. These categories span household items, apparel, electronics, and accessories, introducing substantial variation in object appearance and imaging conditions.
SOP is particularly challenging because each product instance is represented by only a small number of images, forcing models to learn fine-grained distinctions rather than relying on broad category differences.
Dataset Composition
The dataset is evenly divided into a training and test split. The first half consists of 11,318 products with 59,551 images, designated for training. The second half contains 11,316 products with 60,502 images, reserved for testing.
In this study, we focus exclusively on the test split and evaluate models using a self-retrieval protocol, where each query image must retrieve other images belonging to the same product instance.
Dataset Statistics
| No. of | Train | Test |
|---|---|---|
| Images | 59,551 | 60,502 |
| Categories | 12 | 12 |
| Products | 11,318 | 11,316 |
References
- Oh Song, Hyun, et al. "Deep metric learning via lifted structured feature embedding." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
Model Performance on Stanford Online Products
| Rank | Model | Provider | Embedding Size |
Input Size |
R@1 | R@5 | mAP@20 |
|---|---|---|---|---|---|---|---|
| 1 | GEM v5.1 (ours) | nyris | 768 | 336 | 86.87% | 94.17% | 72.45% |
| 2 | SigLIP2 SO400M | 1152 | 384 | 80.28% | 90.01% | 60.79% | |
| 3 | PE-Core L/14 | Meta | 1024 | 336 | 80.09% | 89.83% | 59.46% |
| 4 | Vertex AI Multi-Modal | 1408 | N/A | 76.88% | 87.73% | 56.32% | |
| 5 | Gemini Embedding 2 | 3072 | N/A | 75.63% | 86.69% | 55.13% | |
| 6 | Cohere Embed v4 | Cohere | 1536 | N/A | 68.00% | 79.76% | 45.09% |
| 7 | DINOv3 ViT-L/16 | Meta | 1024 | 224 | 66.61% | 77.92% | 42.47% |
| 8 | Jina Embeddings v4 | Jina AI | 2048 | Dynamic | 59.48% | 72.24% | 35.15% |
| 9 | Nomic Embed MM 3B | Nomic AI | 2048 | Dynamic | 56.92% | 69.34% | 32.60% |
| 10 | DINOv2 Large | Meta | 1024 | 224 | 56.34% | 67.74% | 31.91% |
Sample Images
Curated query-reference pairs from this dataset. Each row shows a query image and its matching reference images.
arXiv