Qwen2ViT-600M

Introduction

This repo contains the Qwen2ViT-600M model utilized to train the EMOVA series of models. Different from traditional Vision Transformers requiring pre-defined input sizes, Qwen2ViT-600M is pre-trained with dynamic input resolutions, adapting to various image sizes and aspect ratios. This Qwen2ViT-600M checkpoint is extracted from the Qwen2-VL series of models.

Usage

To train EMOVA with Qwen2ViT-600M, we need to create a new model config, and set the mm_vision_tower parameters as follows. An example is provided here. Check more details on training EMOVA in our github repo.

mm_vision_tower=dict(
  type='Qwen2VisionTower',                                   -- Wrapper class type for EMOVA Vision Encoder
  pretrained_model_name_or_path="Emova-ollm/qwen2vit600m/",  -- HuggingFace repo of pre-trained ViT
  trainable=True                                             -- True means training ViT, False means freezing ViT
),

Citation

@article{chen2024emova,
  title={Emova: Empowering language models to see, hear and speak with vivid emotions},
  author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
  journal={arXiv preprint arXiv:2409.18042},
  year={2024}
}

@article{Qwen2-VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

Emova-ollm
/

qwen2vit600m

Qwen2ViT-600M

Introduction

Usage

Citation

Model tree for Emova-ollm/qwen2vit600m