AdaptVision-7B / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add metadata, structured paper link, project page, and code links
6f60d3d verified
|
raw
history blame
1.35 kB
metadata
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

AdaptVision is an efficient Vision-Language Model (VLM) paradigm designed to achieve adaptive visual token acquisition through a coarse-to-fine approach. Inspired by human active vision mechanisms, this model addresses the significant computational overhead in VLMs by autonomously determining the minimum number of visual tokens required for each sample. It selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary.

The model was presented in the paper: AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

For more details, please visit the project page. The official code can be found on the GitHub repository.

Citation

If you find this project useful in your research, please consider citing:

@article{lin2025adapt,
  title={AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition},
  author={Zichuan Lin and Yicheng Liu and Yang Yang and Lvfang Tao and Deheng Ye},
  journal={arXiv preprint arXiv:2512.03794},
  year={2025}
}