thenlper commited on
Commit
c013eb4
1 Parent(s): 0f07553

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -7
README.md CHANGED
@@ -3589,19 +3589,28 @@ model-index:
3589
  value: 87.7629629899278
3590
  ---
3591
 
 
 
 
 
3592
  <p align="center"><b>GME: General Multimodal Embeddings</b></p>
3593
 
3594
- ## GME-Qwen2VL-7B
 
 
 
3595
 
3596
- We are excited to present `GME-Qwen2VL` models, our first generation **multimodal embedding models** for text and images,
3597
- which are based on advanced [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d) multimodal large language models (MLLMs).
3598
 
3599
- The `GME-Qwen2VL` models support three input forms: **text**, **image**, and **image-text pair**, all of which can produce universal vector representations and have powerful retrieval performance.
3600
 
3601
- - **High Performance**: Achieves state-of-the-art (SOTA) results in our universal multimodal retrieval benchmark (**UMRB**) and strong **MTEB** evaluation scores.
 
 
3602
  - **Dynamic Image Resolution**: Benefiting from `Qwen2-VL` and our training data, GME models support dynamic resolution image input.
3603
- Our models are able to perform leadingly in the **visual document retrieval** task which requires fine-grained understanding of document screenshots.
3604
- You can control to balance performance and efficiency.
 
3605
 
3606
  **Developed by**: Tongyi Lab, Alibaba Group
3607
 
 
3589
  value: 87.7629629899278
3590
  ---
3591
 
3592
+ <p align="center">
3593
+ <iframe src="images/gme_logo.pdf" width="auto" height="auto"></iframe>
3594
+ </p>
3595
+
3596
  <p align="center"><b>GME: General Multimodal Embeddings</b></p>
3597
 
3598
+ ## GME-Qwen2VL-2B
3599
+
3600
+ We are excited to present `GME-Qwen2VL` series of unified **multimodal embedding models**,
3601
+ which are based on the advanced [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d) multimodal large language models (MLLMs).
3602
 
3603
+ The `GME` models support three types of input: **text**, **image**, and **image-text pair**, all of which can produce universal vector representations and have powerful retrieval performance.
 
3604
 
3605
+ **Key Enhancements of GME Models**:
3606
 
3607
+ - **Unified Multimodal Representation**: GME models can process both single-modal and combined-modal inputs, resulting in a unified vector representation.
3608
+ - This enables versatile retrieval scenarios (Any2Any Search), supporting tasks such as text retrieval, image retrieval from text, and image-to-image searches.
3609
+ - **High Performance**: Achieves state-of-the-art (SOTA) results in our universal multimodal retrieval benchmark (**UMRB**) and demonstrate strong evaluation scores in the Multimodal Textual Evaluation Benchmark (**MTEB**).
3610
  - **Dynamic Image Resolution**: Benefiting from `Qwen2-VL` and our training data, GME models support dynamic resolution image input.
3611
+ - **Strong Visual Retrieval Performance**: Enhanced by the Qwen2-VL model series, our models excel in visual document retrieval tasks that require a nuanced understanding of document screenshots.
3612
+ This capability is particularly beneficial for complex document understanding scenarios,
3613
+ such as multimodal retrieval-augmented generation (RAG) applications focused on academic papers.
3614
 
3615
  **Developed by**: Tongyi Lab, Alibaba Group
3616