Overview

Qwen2.5-VL-7B-Instruct is a multimodal vision‑language model designed for advanced visual understanding, agentic reasoning, and structured output generation. It excels at interpreting complex visual content including text, charts, graphics, and layouts. The model supports video comprehension (limited to 8 seconds in this version) and can identify key events by locating relevant segments. Additionally, it delivers reliable structured extraction from documents such as invoices, forms, and tables. This repository hosts the quantized and compiled model for Ara240 DNPU.

Model Description

This is a quantized and compiled version of Qwen/Qwen2.5-VL-7B-Instruct optimized for Ara240 DNPU.

Base Model: Qwen/Qwen2.5-VL-7B-Instruct
Original Model Authors: Qwen Team, Alibaba Cloud
Original License: Apache-2.0
Modified by: NXP

Performance

SpecD - Uses a small draft model to generate speculated tokens, which the main model then verifies.
TTFT: - Time to first token (TTFT). Reported for an 8 second clip and "Describe the video" prompt.
Avg. Token Rate: Average token rate over the context length.

Model	Runtime	Context Length	SpecD	Params _(billion)	Time To First Token _(s)	Avg. Token rate _{(Tokens/second)}	DDR Memory _(GB)
Qwen2.5-VL-7B-Instruct	r2.0.4	4096	false	7.61	9.26	6.255	11.97

Note: Complete benchmarks across full context size range will be documented in future release.

Modifications

This model is a derivative work with the following changes from the original:

Quantization: W4A8 - EQAT (wikitext-2-raw-v1)
Compilation: Compiled for Ara240 DNPU
Format: Converted to DVM format for NPU deployment

Original model available at: Qwen/Qwen2.5-VL-7B-Instruct.

Limitations and Biases

This model inherits all limitations from the original Qwen/Qwen2.5-VL-7B-Instruct model. Additional limitations:

Hardware-specific: Only runs on Ara240 DNPU
Quantization effects: May have accuracy differences due to quantization

License

This model is released under the Apache License 2.0, the same license as the original Qwen/Qwen2.5-VL-7B-Instruct model.

Citation

Qwen Team, “Qwen2.5-VL.” Jan. 2025. [Online]. Available: https://qwenlm.github.io/blog/qwen2.5-vl/
P. Wang et al., “Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution,” arXiv preprint arXiv:2409.12191, 2024.
J. Bai et al., “Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond,” arXiv preprint arXiv:2308.12966, 2023.

If you use this model, please cite both this work and the original model:

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}

Downloads last month: 221

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nxp/Qwen2.5-VL-7B-Instruct-Ara240

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(1084)

this model

Papers for nxp/Qwen2.5-VL-7B-Instruct-Ara240

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 80

Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 12