som-llava-v1.5-13b / README.md
zzxslp's picture
Update files.
9af6ca3 verified
# SoM-LLaVA Model Card
LLaVA-v1.5 mixed trained with SoM style data (QA+listing).
The model can understand tag-style visual prompts on the image (e.g., what is the object tagged with id 9?), also gained improved performance on MLLM benchmarks (POPE, MME, SEED, MM-Vet, LLav-wild), even when the input testing images has no tags.
**For more information about SoM-LLaVA, check our [github page](https://github.com/zzxslp/SoM-LLaVA) and [paper](https://arxiv.org/abs/2404.16375)!**
## Getting Started
This model should be used in the [official LLaVA repo](https://github.com/haotian-liu/LLaVA) for training and evalution.
If you would like to load the model in HF style, check the converted model weights: [[SoM-LLaVA-v1.5-13B-HF](https://huggingface.co/zzxslp/som-llava-v1.5-13b-hf)]
## Citation
If you find our data or model useful for your research and applications, please cite our paper:
```
@article{yan2024list,
title={List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs},
author={Yan, An and Yang, Zhengyuan and Wu, Junda and Zhu, Wanrong and Yang, Jianwei and Li, Linjie and Lin, Kevin and Wang, Jianfeng and McAuley, Julian and Gao, Jianfeng and others},
journal={arXiv preprint arXiv:2404.16375},
year={2024}
}
```