ASMv2 Model Card

Model details

Model type: ASMv2 is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on multimodal instruction-following data. It integrates the Relation Conversation (ReC) ability while maintaining powerful general capabilities. This model is also endowed with grounding and referring capabilities, exhibiting state-of-the-art performance on region-level tasks, and can be naturally adapted to the Scene Graph Generation task in an open-ended manner.

Model date: ASMv2 was trained in January 2024.

Paper or resources for more information: https://github.com/OpenGVLab/all-seeing

License

ASMv2 is open-sourced under the Apache License 2.0.

Where to send questions or comments about the model: https://github.com/OpenGVLab/all-seeing/issues

Intended use

Primary intended uses: The primary use of ASMv2 is research on large multimodal models and chatbots.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Training dataset

The pretrain phase employs 5M filtered samples from CC12M, 10M filtered samples from AS-1B, and 15M filtered samples from GRiT.

The instruction-tuning phase employs 4M samples collected from a variety of sources, including image-level datasets

See here for more details.

Evaluation dataset

A collection of 20 benchmarks, including 5 academic VQA benchmarks, 7 multimodal benchmarks specifically proposed for instruction-following LMMs, 3 referring expression comprehension benchmarks, 2 region captioning benchmarks, 1 referring question answering benchmark, 1 scene graph generation benchmark, and 1 relation comprehension benchmark.

Downloads last month: 18

Collection including OpenGVLab/ASMv2

All-Seeing Project

Collection

11 items • Updated Sep 28, 2025 • 8