P3ngLiu's picture
Update README.md
62b9926 verified
metadata
license: apache-2.0
base_model:
  - Qwen/Qwen2.5-VL-3B-Instruct
tags:
  - object-detection
  - multimodal
  - REC
  - VLM
  - zero-shot-object-detection
language:
  - zh
  - en

VLM-FO1: Qwen2.5-VL-3B-v01

This repository contains the VLM-FO1_Qwen2.5-VL-3B-v01 model, an implementation of the VLM-FO1 framework built on the Qwen2.5-VL-3B base model.

VLM-FO1 is a novel plug-and-play framework designed to bridge the gap between the high-level reasoning of Vision-Language Models (VLMs) and the need for fine-grained visual perception.

Model Details

Model Description

VLM-FO1 endows pre-trained VLMs with superior fine-grained perception without compromising their inherent high-level reasoning and general understanding capabilities. It operates as a plug-and-play module that can be integrated with any existing VLM, establishing an effective and flexible paradigm for building the next generation of perception-aware models.

VLM-FO1 excels at a wide range of fine-grained perception tasks, including Object Grounding, Region Generative Understanding, Visual Region Reasoning, and more.

🧩 Plug-and-Play Modularity: Our framework is designed as a set of enhancement modules that can be seamlessly integrated with any pre-trained VLM, preserving its original weights and capabilities.

🧠 Hybrid Region Encoder (HFRE): We introduce a novel Dual-Vision Encoder architecture that fuses semantic-rich features with perception-enhanced features, creating powerful region tokens that capture both high-level meaning and fine-grained spatial detail.

🎯 State-of-the-Art Performance: VLM-FO1 achieves SOTA results across a diverse suite of benchmarks.

Preserves General Abilities: Our two-stage training strategy ensures that fine-grained perception is gained without causing catastrophic forgetting of the base model's powerful general visual understanding abilities.

Model Sources

Citation

@article{liu2025vlm,
  title={VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs},
  author={Liu, Peng and Shen, Haozhan and Fang, Chunxin and Sun, Zhicheng and Liao, Jiajia and Zhao, Tiancheng},
  journal={arXiv preprint arXiv:2509.25916},
  year={2025}
}