PE-Core L/14

Overview

The Perception Encoder (PE) is a vision foundation model developed by Meta. It is designed as a general visual backbone for retrieval and related vision tasks. In this benchmark, it represents a generic open-weight image encoder rather than a domain-specialized product retrieval model.

Architecture

The L-14/336 variant used in this study is built on a Vision Transformer Large architecture with 336×336 image inputs. In this evaluation, it is used as a vision-only encoder that produces 1,024-dimensional image embeddings.

Capabilities

Perception Encoder is evaluated here as a general-purpose visual representation model for image retrieval, providing a comparison point against both specialized product models and managed multimodal embedding APIs.

Dataset	Category	R@1	R@5	mAP
Stanford Online Products	E-commerce	80.09%	89.83%	59.46%
Products-10K	E-commerce	65.65%	83.47%	41.38%
DIY v1	Hardware/DIY	25.26%	49.34%	34.28%
Automotive v1	Automotive	18.82%	37.63%	24.53%
Clips-and-Connectors v1	Industrial	12.14%	26.85%	2.37%
Average	40.39%	57.42%	32.40%

Dataset

About This Model

Overview

Architecture

Capabilities

Performance Across Datasets