Update README.md
Browse files
README.md
CHANGED
|
@@ -1,11 +1,10 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
| 4 |
-
|
| 5 |
# Model Details
|
| 6 |
|
| 7 |
Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
|
| 8 |
-
are
|
| 9 |
|
| 10 |
**Model Developer**: Meta
|
| 11 |
|
|
@@ -16,42 +15,26 @@ are hidden inside the network](https://TBC)".
|
|
| 16 |
|
| 17 |
| Scale | Tower | Params | Width | Depth | MLP | Heads | CLIP Dim | Resolution | Patch Size | Text Context Length |
|
| 18 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 19 |
-
| **B** | Vision | 0.09B | 768 | 12 | 3072 | 12 | 1024 |
|
| 20 |
-
| | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 |
|
| 21 |
| **L** | Vision | 0.32B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
|
| 22 |
| | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
|
| 23 |
-
| **G** | Vision | 1.88B | 1536 | 50 | 8960 | 16 | 1280 |
|
| 24 |
-
| | Text | 0.47B | 1280 | 24 | 5120 | 20 | 1280 |
|
| 25 |
|
| 26 |
|
| 27 |
# How to use
|
| 28 |
|
| 29 |
## PE codebase
|
| 30 |
-
We provide the pretraining code in https://github.com/
|
| 31 |
|
| 32 |
You can find more details in the GitHub repo.
|
| 33 |
|
| 34 |
-
|
| 35 |
-
# Evaluation
|
| 36 |
-
We evaluate the pretrained MobileLLM models on Zero-shot Common Sense Reasoning tasks
|
| 37 |
-
|
| 38 |
-
Here is the table in Markdown format:
|
| 39 |
-
|
| 40 |
-
## Zero-Shot Image Results
|
| 41 |
-
|
| 42 |
-
<img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_image.png" style="width: 100%; margin: 0;" />
|
| 43 |
-
|
| 44 |
-
## Zero-Shot Video Results
|
| 45 |
-
|
| 46 |
-
<img src="https://huggingface.co/facebook/PE-Core-G14-448/resolve/main/docs/pe_zeroshot_video.png" style="width: 90%; margin: 0" />
|
| 47 |
-
|
| 48 |
-
|
| 49 |
# Citation
|
| 50 |
-
|
| 51 |
If you find our code useful for your research, please consider citing:
|
| 52 |
|
| 53 |
@article{PE,
|
| 54 |
-
title={Perception Encoder},
|
| 55 |
author={},
|
| 56 |
journal={arXiv:xxx.xxxxx},
|
| 57 |
year={2025}
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
|
|
|
| 4 |
# Model Details
|
| 5 |
|
| 6 |
Perception Encoder (PE) is a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. It was introduced in "[Perception Encoder: The best visual embeddings
|
| 7 |
+
are not at the output of the network](https://ai.meta.com/research/publications/perception-encoder-the-best-visual-embeddings-are-not-at-the-output-of-the-network/)".
|
| 8 |
|
| 9 |
**Model Developer**: Meta
|
| 10 |
|
|
|
|
| 15 |
|
| 16 |
| Scale | Tower | Params | Width | Depth | MLP | Heads | CLIP Dim | Resolution | Patch Size | Text Context Length |
|
| 17 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
| 18 |
+
| **B** | Vision | 0.09B | 768 | 12 | 3072 | 12 | 1024 | 224 | 16 | 32 |
|
| 19 |
+
| | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 224 | 16 | 32 |
|
| 20 |
| **L** | Vision | 0.32B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
|
| 21 |
| | Text | 0.31B | 1024 | 24 | 4096 | 16 | 1024 | 336 | 14 | 32 |
|
| 22 |
+
| **G** | Vision | 1.88B | 1536 | 50 | 8960 | 16 | 1280 | 448 | 14 | 72 |
|
| 23 |
+
| | Text | 0.47B | 1280 | 24 | 5120 | 20 | 1280 | 448 | 14 | 72 |
|
| 24 |
|
| 25 |
|
| 26 |
# How to use
|
| 27 |
|
| 28 |
## PE codebase
|
| 29 |
+
We provide the pretraining code in https://github.com/facebookresearch/perception_models
|
| 30 |
|
| 31 |
You can find more details in the GitHub repo.
|
| 32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
# Citation
|
|
|
|
| 34 |
If you find our code useful for your research, please consider citing:
|
| 35 |
|
| 36 |
@article{PE,
|
| 37 |
+
title={Perception Encoder: The best visual embeddings are not at the output of the network},
|
| 38 |
author={},
|
| 39 |
journal={arXiv:xxx.xxxxx},
|
| 40 |
year={2025}
|