Overview

Qwen2.5-VL-7B-Instruct is a multimodal vision‑language model designed for advanced visual understanding, agentic reasoning, and structured output generation. It excels at interpreting complex visual content including text, charts, graphics, and layouts. The model supports video comprehension (limited to 8 seconds in this version) and can identify key events by locating relevant segments. Additionally, it delivers reliable structured extraction from documents such as invoices, forms, and tables. This repository hosts the quantized and compiled model for Ara240 DNPU.

Model Description

This is a quantized and compiled version of Qwen/Qwen2.5-VL-7B-Instruct optimized for Ara240 DNPU.

  • Base Model: Qwen/Qwen2.5-VL-7B-Instruct
  • Original Model Authors: Qwen Team, Alibaba Cloud
  • Original License: Apache-2.0
  • Modified by: NXP

Performance

  • SpecD - Uses a small draft model to generate speculated tokens, which the main model then verifies.
  • TTFT: - Time to first token (TTFT). Reported for an 8 second clip and "Describe the video" prompt.
  • Avg. Token Rate: Average token rate over the context length.
Model Runtime Context Length SpecD Params
(billion)
Time To First Token
(s)
Avg. Token rate
(Tokens/second)
DDR Memory
(GB)
Qwen2.5-VL-7B-Instruct r2.0.4 4096 false 7.61 9.26 6.255 11.97

Note: Complete benchmarks across full context size range will be documented in future release.

Modifications

This model is a derivative work with the following changes from the original:

  • Quantization: W4A8 - EQAT (wikitext-2-raw-v1)
  • Compilation: Compiled for Ara240 DNPU
  • Format: Converted to DVM format for NPU deployment

Original model available at: Qwen/Qwen2.5-VL-7B-Instruct.

Limitations and Biases

This model inherits all limitations from the original Qwen/Qwen2.5-VL-7B-Instruct model. Additional limitations:

  • Hardware-specific: Only runs on Ara240 DNPU
  • Quantization effects: May have accuracy differences due to quantization

License

This model is released under the Apache License 2.0, the same license as the original Qwen/Qwen2.5-VL-7B-Instruct model.

Citation

  • Qwen Team, “Qwen2.5-VL.” Jan. 2025. [Online]. Available: https://qwenlm.github.io/blog/qwen2.5-vl/
  • P. Wang et al., “Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution,” arXiv preprint arXiv:2409.12191, 2024.
  • J. Bai et al., “Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond,” arXiv preprint arXiv:2308.12966, 2023.

If you use this model, please cite both this work and the original model:

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}
Downloads last month
221
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nxp/Qwen2.5-VL-7B-Instruct-Ara240

Finetuned
(1084)
this model

Papers for nxp/Qwen2.5-VL-7B-Instruct-Ara240