Datasets Models Results
Models Generic PE-Core L/14
Meta

PE-Core L/14

Vision foundation model trained with contrastive vision-language objective.

Meta Open Source Vision Only
Generic
Model Type
#4
Overall Rank
40.39%
Avg. R@1
32.40%
Avg.
mAP@20
1024
Embedding
Size
336
Input
Size
5
Datasets

About This Model

Overview

The Perception Encoder (PE) is a vision foundation model developed by Meta. It is designed as a general visual backbone for retrieval and related vision tasks. In this benchmark, it represents a generic open-weight image encoder rather than a domain-specialized product retrieval model.

Architecture

The L-14/336 variant used in this study is built on a Vision Transformer Large architecture with 336×336 image inputs. In this evaluation, it is used as a vision-only encoder that produces 1,024-dimensional image embeddings.

Capabilities

Perception Encoder is evaluated here as a general-purpose visual representation model for image retrieval, providing a comparison point against both specialized product models and managed multimodal embedding APIs.

Performance Across Datasets

Dataset Category R@1 R@5 mAP
Stanford Online Products E-commerce 80.09% 89.83% 59.46%
Products-10K E-commerce 65.65% 83.47% 41.38%
DIY v1 Hardware/DIY 25.26% 49.34% 34.28%
Automotive v1 Automotive 18.82% 37.63% 24.53%
Clips-and-Connectors v1 Industrial 12.14% 26.85% 2.37%
Average 40.39% 57.42% 32.40%