Update README.md
Browse files
README.md
CHANGED
@@ -3589,19 +3589,28 @@ model-index:
|
|
3589 |
value: 87.7629629899278
|
3590 |
---
|
3591 |
|
|
|
|
|
|
|
|
|
3592 |
<p align="center"><b>GME: General Multimodal Embeddings</b></p>
|
3593 |
|
3594 |
-
## GME-Qwen2VL-
|
|
|
|
|
|
|
3595 |
|
3596 |
-
|
3597 |
-
which are based on advanced [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d) multimodal large language models (MLLMs).
|
3598 |
|
3599 |
-
|
3600 |
|
3601 |
-
- **
|
|
|
|
|
3602 |
- **Dynamic Image Resolution**: Benefiting from `Qwen2-VL` and our training data, GME models support dynamic resolution image input.
|
3603 |
-
|
3604 |
-
|
|
|
3605 |
|
3606 |
**Developed by**: Tongyi Lab, Alibaba Group
|
3607 |
|
|
|
3589 |
value: 87.7629629899278
|
3590 |
---
|
3591 |
|
3592 |
+
<p align="center">
|
3593 |
+
<iframe src="images/gme_logo.pdf" width="auto" height="auto"></iframe>
|
3594 |
+
</p>
|
3595 |
+
|
3596 |
<p align="center"><b>GME: General Multimodal Embeddings</b></p>
|
3597 |
|
3598 |
+
## GME-Qwen2VL-2B
|
3599 |
+
|
3600 |
+
We are excited to present `GME-Qwen2VL` series of unified **multimodal embedding models**,
|
3601 |
+
which are based on the advanced [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d) multimodal large language models (MLLMs).
|
3602 |
|
3603 |
+
The `GME` models support three types of input: **text**, **image**, and **image-text pair**, all of which can produce universal vector representations and have powerful retrieval performance.
|
|
|
3604 |
|
3605 |
+
**Key Enhancements of GME Models**:
|
3606 |
|
3607 |
+
- **Unified Multimodal Representation**: GME models can process both single-modal and combined-modal inputs, resulting in a unified vector representation.
|
3608 |
+
- This enables versatile retrieval scenarios (Any2Any Search), supporting tasks such as text retrieval, image retrieval from text, and image-to-image searches.
|
3609 |
+
- **High Performance**: Achieves state-of-the-art (SOTA) results in our universal multimodal retrieval benchmark (**UMRB**) and demonstrate strong evaluation scores in the Multimodal Textual Evaluation Benchmark (**MTEB**).
|
3610 |
- **Dynamic Image Resolution**: Benefiting from `Qwen2-VL` and our training data, GME models support dynamic resolution image input.
|
3611 |
+
- **Strong Visual Retrieval Performance**: Enhanced by the Qwen2-VL model series, our models excel in visual document retrieval tasks that require a nuanced understanding of document screenshots.
|
3612 |
+
This capability is particularly beneficial for complex document understanding scenarios,
|
3613 |
+
such as multimodal retrieval-augmented generation (RAG) applications focused on academic papers.
|
3614 |
|
3615 |
**Developed by**: Tongyi Lab, Alibaba Group
|
3616 |
|