Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
base_model:
|
| 5 |
+
- Qwen/Qwen2.5-VL-3B-Instruct
|
| 6 |
+
pipeline_tag: visual-question-answering
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
<div align="center">
|
| 10 |
+
<h1>Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors</h1>
|
| 11 |
+
</div>
|
| 12 |
+
|
| 13 |
+
<h5 align="center">
|
| 14 |
+
|
| 15 |
+
[](https://arxiv.org/abs/2509.00969)
|
| 16 |
+
[](https://github.com/NIneeeeeem/LangDC)
|
| 17 |
+
[](https://huggingface.co/Wangxc1000/LangDC)
|
| 18 |
+
|
| 19 |
+
<br>
|
| 20 |
+
|
| 21 |
+
</h5>
|
| 22 |
+
|
| 23 |
+
## LangDC Overview
|
| 24 |
+
|
| 25 |
+
Current large video-language models face efficiency issues due to processing massive visual tokens. Existing fixed-ratio token compression ignores varying semantic density across video clips. Consequently, this lead to inadequate representation of information-rich clips due to insufficient tokens and unnecessary computation on static or content-poor ones. To address this, we propose LangDC, a Language-aware Dynamic Token Compressor. LangDC leverages a lightweight language model to describe video clips, converting them into soft caption tokens as visual representations. Trained with our proposed semantic density-aware supervision, LangDC aims to 1) cover key visual cues necessary for downstream task reasoning and 2) dynamically adjust compression ratios based on scene richness, reflected by descriptions length.
|
| 26 |
+
|
| 27 |
+
<p align="center">
|
| 28 |
+
<img src="asset/motivation_comparision.png" alt="Comparison of LangDC and existing token compressors.">
|
| 29 |
+
</p>
|
| 30 |
+
|
| 31 |
+
## Contributions
|
| 32 |
+
1) We propose LangDC, a novel language-aware token compression strategy. Using soft language tokens for visual representation, it adaptively adjusts compression ratios, improving token utilization over fixed-ratio techniques.
|
| 33 |
+
|
| 34 |
+
2) We propose semantic density-aware supervision for the token compressors. By explicitly providing reconstruction targets for token compression, we enable the derivation of a more compact feature set that is not only aware of information richness but also preserves key visual cues.
|
| 35 |
+
|
| 36 |
+
3) Experimental results demonstrate that our method reduces FLOPs by 49\% relative to the strong baseline VideoGPT+, while maintaining competitive performance. Additional qualitative results show adaptive compression based on video clip semantic density.
|
| 37 |
+
|
| 38 |
+
<p align="center">
|
| 39 |
+
<img src="asset/fig_method.png" alt="Overview of the LangDC.">
|
| 40 |
+
</p>
|
| 41 |
+
|
| 42 |
+
## Installation
|
| 43 |
+
|
| 44 |
+
We recommend setting up a conda environment for the project:
|
| 45 |
+
```shell
|
| 46 |
+
conda create --name=langdc python=3.11
|
| 47 |
+
conda activate langdc
|
| 48 |
+
|
| 49 |
+
git clone https://github.com/NIneeeeeem/LangDC.git
|
| 50 |
+
cd LangDC
|
| 51 |
+
|
| 52 |
+
pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124
|
| 53 |
+
pip install transformers==4.41.0
|
| 54 |
+
|
| 55 |
+
pip install -r requirements.txt
|
| 56 |
+
|
| 57 |
+
export PYTHONPATH="./:$PYTHONPATH"
|
| 58 |
+
```
|
| 59 |
+
Additionally, install [FlashAttention](https://github.com/HazyResearch/flash-attention) for training,
|
| 60 |
+
```shell
|
| 61 |
+
pip install ninja
|
| 62 |
+
|
| 63 |
+
git clone https://github.com/HazyResearch/flash-attention.git
|
| 64 |
+
cd flash-attention
|
| 65 |
+
python setup.py install
|
| 66 |
+
```
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## Quantitative Evaluation 📊
|
| 70 |
+
We provide instructions to reproduce LangDC results on VideoMME, MVBench, LongVideoBench, VSIBench and four open-ended QA Benchmark. Please follow the instructions at [eval/README.md](eval/README.md).
|
| 71 |
+
|
| 72 |
+
To reproduce the results in Table 1 of the Motivation chapter, please refer to [this repository](https://github.com/NIneeeeeem/VideoGPT-tokenadapter.git).
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
## Citations 📜:
|
| 76 |
+
|
| 77 |
+
If you're using LangDC in your research or applications, please give us a star ⭐ to support us and cite using this BibTeX:
|
| 78 |
+
```bibtex
|
| 79 |
+
@misc{wang2025seeing,
|
| 80 |
+
title={Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors},
|
| 81 |
+
author={Xiangchen Wang and Jinrui Zhang and Teng Wang and Haigang Zhang and Feng Zheng},
|
| 82 |
+
year={2025},
|
| 83 |
+
eprint={2509.00969},
|
| 84 |
+
archivePrefix={arXiv},
|
| 85 |
+
primaryClass={cs.CV}
|
| 86 |
+
}
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
## Acknowledgements
|
| 90 |
+
|
| 91 |
+
+ [Video-ChatGPT+](https://github.com/mbzuai-oryx/VideoGPT-plus): A pioneering attempt in Video-based conversation models.
|
| 92 |
+
+ [LLaVA](https://github.com/haotian-liu/LLaVA): Our code base is build upon LLaVA and Video-ChatGPT+.
|