Perception Encoder: The best visual embeddings are not at the output of the network
Abstract
We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Breaking the Encoder Barrier for Seamless Video-Language Understanding (2025)
- TULIP: Towards Unified Language-Image Pretraining (2025)
- DiffCLIP: Differential Attention Meets CLIP (2025)
- The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer (2025)
- Scaling Language-Free Visual Representation Learning (2025)
- LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders (2025)
- MIEB: Massive Image Embedding Benchmark (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 7
Browse 7 models citing this paperDatasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper