Qwen2ViT-600M
Introduction
This repo contains the Qwen2ViT-600M model utilized to train the EMOVA series of models. Different from traditional Vision Transformers requiring pre-defined input sizes, Qwen2ViT-600M is pre-trained with dynamic input resolutions, adapting to various image sizes and aspect ratios. This Qwen2ViT-600M checkpoint is extracted from the Qwen2-VL series of models.
Usage
To train EMOVA with Qwen2ViT-600M, we need to create a new model config, and set the mm_vision_tower parameters as follows. An example is provided here. Check more details on training EMOVA in our github repo.
mm_vision_tower=dict(
type='Qwen2VisionTower', -- Wrapper class type for EMOVA Vision Encoder
pretrained_model_name_or_path="Emova-ollm/qwen2vit600m/", -- HuggingFace repo of pre-trained ViT
trainable=True -- True means training ViT, False means freezing ViT
),
Citation
@article{chen2024emova,
title={Emova: Empowering language models to see, hear and speak with vivid emotions},
author={Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and others},
journal={arXiv preprint arXiv:2409.18042},
year={2024}
}
@article{Qwen2-VL,
title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2409.12191},
year={2024}
}
- Downloads last month
- 6
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The HF Inference API does not support model that require custom code execution.