Alibaba-NLP
/

gme-Qwen2-VL-2B-Instruct

Sentence Similarity

sentence-transformers

image-text-to-text

Inference Endpoints

Model card Files Files and versions Community

thenlper commited on 14 days ago

Commit

c013eb4

•

1 Parent(s): 0f07553

Update README.md

Files changed (1) hide show

README.md +16 -7

README.md CHANGED Viewed

@@ -3589,19 +3589,28 @@ model-index:
       value: 87.7629629899278
 ---
 <p align="center"><b>GME: General Multimodal Embeddings</b></p>
-## GME-Qwen2VL-7B
-We are excited to present `GME-Qwen2VL` models, our first generation **multimodal embedding models** for text and images,
-which are based on advanced [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d) multimodal large language models (MLLMs).
-The `GME-Qwen2VL` models support three input forms: **text**, **image**, and **image-text pair**, all of which can produce universal vector representations and have powerful retrieval performance.
-- **High Performance**: Achieves state-of-the-art (SOTA) results in our universal multimodal retrieval benchmark (**UMRB**) and strong **MTEB** evaluation scores.
 - **Dynamic Image Resolution**: Benefiting from `Qwen2-VL` and our training data, GME models support dynamic resolution image input.
-Our models are able to perform leadingly in the **visual document retrieval** task which requires fine-grained understanding of document screenshots.
-You can control to balance performance and efficiency.
 **Developed by**: Tongyi Lab, Alibaba Group

       value: 87.7629629899278
 ---
+<p align="center">
+  <iframe src="images/gme_logo.pdf" width="auto" height="auto"></iframe>
+</p>
 <p align="center"><b>GME: General Multimodal Embeddings</b></p>
+## GME-Qwen2VL-2B
+We are excited to present `GME-Qwen2VL` series of unified **multimodal embedding models**,
+which are based on the advanced [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d) multimodal large language models (MLLMs).
+The `GME` models support three types of input: **text**, **image**, and **image-text pair**, all of which can produce universal vector representations and have powerful retrieval performance.
+**Key Enhancements of GME Models**:
+- **Unified Multimodal Representation**: GME models can process both single-modal and combined-modal inputs, resulting in a unified vector representation.
+- This enables versatile retrieval scenarios (Any2Any Search), supporting tasks such as text retrieval, image retrieval from text, and image-to-image searches.
+- **High Performance**: Achieves state-of-the-art (SOTA) results in our universal multimodal retrieval benchmark (**UMRB**) and demonstrate strong evaluation scores in the Multimodal Textual Evaluation Benchmark (**MTEB**).
 - **Dynamic Image Resolution**: Benefiting from `Qwen2-VL` and our training data, GME models support dynamic resolution image input.
+- **Strong Visual Retrieval Performance**: Enhanced by the Qwen2-VL model series, our models excel in visual document retrieval tasks that require a nuanced understanding of document screenshots.
+  This capability is particularly beneficial for complex document understanding scenarios,
+  such as multimodal retrieval-augmented generation (RAG) applications focused on academic papers.
 **Developed by**: Tongyi Lab, Alibaba Group