rp-yu
/

Qwen2-VL-2b-VPT-CLIP

Image-Text-to-Text

text2text-generation

Inference Endpoints

Model card Files Files and versions Community

rp-yu commited on 1 day ago

Commit

2560534

·

verified ·

1 Parent(s): 1907699

Update README.md

Files changed (1) hide show

README.md +12 -5

README.md CHANGED Viewed

@@ -1,12 +1,19 @@
 ---
-license: apache-2.0
 datasets:
 - rp-yu/VPT_Datasets
 language:
 - en
 metrics:
 - accuracy
-base_model:
-- Qwen/Qwen2-VL-2B-Instruct
-library_name: transformers
----

 ---
+base_model:
+- Qwen/Qwen2-VL-2B-Instruct
 datasets:
 - rp-yu/VPT_Datasets
 language:
 - en
+library_name: transformers
+license: apache-2.0
 metrics:
 - accuracy
+pipeline_tag: image-text-to-text
+---
+# Introducing Visual Perception Token into Multimodal Large Language Model
+This repository contains models based on the paper [Introducing Visual Perception Token into Multimodal Large Language Model](https://arxiv.org/abs/2502.17425). These models utilize Visual Perception Tokens to enhance the visual perception capabilities of multimodal large language models (MLLMs).
+Code: https://github.com/yu-rp/VisualPerceptionToken