Overview
Qwen2.5-VL-7B-Instruct is a multimodal vision‑language model designed for advanced visual understanding, agentic reasoning, and structured output generation. It excels at interpreting complex visual content including text, charts, graphics, and layouts. The model supports video comprehension (limited to 8 seconds in this version) and can identify key events by locating relevant segments. Additionally, it delivers reliable structured extraction from documents such as invoices, forms, and tables. This repository hosts the quantized and compiled model for Ara240 DNPU.
Model Description
This is a quantized and compiled version of Qwen/Qwen2.5-VL-7B-Instruct optimized for Ara240 DNPU.
- Base Model: Qwen/Qwen2.5-VL-7B-Instruct
- Original Model Authors: Qwen Team, Alibaba Cloud
- Original License: Apache-2.0
- Modified by: NXP
Performance
- SpecD - Uses a small draft model to generate speculated tokens, which the main model then verifies.
- TTFT: - Time to first token (TTFT). Reported for an 8 second clip and "Describe the video" prompt.
- Avg. Token Rate: Average token rate over the context length.
| Model | Runtime | Context Length | SpecD | Params (billion) |
Time To First Token (s) |
Avg. Token rate (Tokens/second) |
DDR Memory (GB) |
|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | r2.0.4 | 4096 | false | 7.61 | 9.26 | 6.255 | 11.97 |
Note: Complete benchmarks across full context size range will be documented in future release.
Modifications
This model is a derivative work with the following changes from the original:
- Quantization: W4A8 - EQAT (wikitext-2-raw-v1)
- Compilation: Compiled for Ara240 DNPU
- Format: Converted to DVM format for NPU deployment
Original model available at: Qwen/Qwen2.5-VL-7B-Instruct.
Limitations and Biases
This model inherits all limitations from the original Qwen/Qwen2.5-VL-7B-Instruct model. Additional limitations:
- Hardware-specific: Only runs on Ara240 DNPU
- Quantization effects: May have accuracy differences due to quantization
License
This model is released under the Apache License 2.0, the same license as the original Qwen/Qwen2.5-VL-7B-Instruct model.
Citation
- Qwen Team, “Qwen2.5-VL.” Jan. 2025. [Online]. Available: https://qwenlm.github.io/blog/qwen2.5-vl/
- P. Wang et al., “Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution,” arXiv preprint arXiv:2409.12191, 2024.
- J. Bai et al., “Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond,” arXiv preprint arXiv:2308.12966, 2023.
If you use this model, please cite both this work and the original model:
@misc{qwen2.5-VL,
title = {Qwen2.5-VL},
url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
author = {Qwen Team},
month = {January},
year = {2025}
}
@article{Qwen2VL,
title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2409.12191},
year={2024}
}
@article{Qwen-VL,
title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023}
}
- Downloads last month
- 221
Model tree for nxp/Qwen2.5-VL-7B-Instruct-Ara240
Base model
Qwen/Qwen2.5-VL-7B-Instruct