PE-Core L/14
Generic
Model Type
#4
Overall Rank
40.39%
Avg. R@1
32.40%
Avg.
mAP@20
mAP@20
1024
Embedding
Size
Size
336
Input
Size
Size
5
Datasets
About This Model
Overview
The Perception Encoder (PE) is a vision foundation model developed by Meta. It is designed as a general visual backbone for retrieval and related vision tasks. In this benchmark, it represents a generic open-weight image encoder rather than a domain-specialized product retrieval model.
Architecture
The L-14/336 variant used in this study is built on a Vision Transformer Large architecture with 336×336 image inputs. In this evaluation, it is used as a vision-only encoder that produces 1,024-dimensional image embeddings.
Capabilities
Perception Encoder is evaluated here as a general-purpose visual representation model for image retrieval, providing a comparison point against both specialized product models and managed multimodal embedding APIs.
Performance Across Datasets
| Dataset | Category | R@1 | R@5 | mAP |
|---|---|---|---|---|
| Stanford Online Products | E-commerce | 80.09% | 89.83% | 59.46% |
| Products-10K | E-commerce | 65.65% | 83.47% | 41.38% |
| DIY v1 | Hardware/DIY | 25.26% | 49.34% | 34.28% |
| Automotive v1 | Automotive | 18.82% | 37.63% | 24.53% |
| Clips-and-Connectors v1 | Industrial | 12.14% | 26.85% | 2.37% |
| Average | 40.39% | 57.42% | 32.40% | |
arXiv