Wangxc1000 commited on
Commit
fd44a5e
·
verified ·
1 Parent(s): f5abe3f

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - Qwen/Qwen2.5-VL-3B-Instruct
6
+ pipeline_tag: visual-question-answering
7
+ ---
8
+
9
+ <div align="center">
10
+ <h1>Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors</h1>
11
+ </div>
12
+
13
+ <h5 align="center">
14
+
15
+ [![arXiv](https://img.shields.io/badge/LangDC-2509.00969-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2509.00969)
16
+ [![GitHub](https://img.shields.io/badge/GitHub-Code-green?logo=github)](https://github.com/NIneeeeeem/LangDC)
17
+ [![HuggingFace](https://img.shields.io/badge/🤗%20Model-Huggingface-yellow)](https://huggingface.co/Wangxc1000/LangDC)
18
+
19
+ <br>
20
+
21
+ </h5>
22
+
23
+ ## LangDC Overview
24
+
25
+ Current large video-language models face efficiency issues due to processing massive visual tokens. Existing fixed-ratio token compression ignores varying semantic density across video clips. Consequently, this lead to inadequate representation of information-rich clips due to insufficient tokens and unnecessary computation on static or content-poor ones. To address this, we propose LangDC, a Language-aware Dynamic Token Compressor. LangDC leverages a lightweight language model to describe video clips, converting them into soft caption tokens as visual representations. Trained with our proposed semantic density-aware supervision, LangDC aims to 1) cover key visual cues necessary for downstream task reasoning and 2) dynamically adjust compression ratios based on scene richness, reflected by descriptions length.
26
+
27
+ <p align="center">
28
+ <img src="asset/motivation_comparision.png" alt="Comparison of LangDC and existing token compressors.">
29
+ </p>
30
+
31
+ ## Contributions
32
+ 1) We propose LangDC, a novel language-aware token compression strategy. Using soft language tokens for visual representation, it adaptively adjusts compression ratios, improving token utilization over fixed-ratio techniques.
33
+
34
+ 2) We propose semantic density-aware supervision for the token compressors. By explicitly providing reconstruction targets for token compression, we enable the derivation of a more compact feature set that is not only aware of information richness but also preserves key visual cues.
35
+
36
+ 3) Experimental results demonstrate that our method reduces FLOPs by 49\% relative to the strong baseline VideoGPT+, while maintaining competitive performance. Additional qualitative results show adaptive compression based on video clip semantic density.
37
+
38
+ <p align="center">
39
+ <img src="asset/fig_method.png" alt="Overview of the LangDC.">
40
+ </p>
41
+
42
+ ## Installation
43
+
44
+ We recommend setting up a conda environment for the project:
45
+ ```shell
46
+ conda create --name=langdc python=3.11
47
+ conda activate langdc
48
+
49
+ git clone https://github.com/NIneeeeeem/LangDC.git
50
+ cd LangDC
51
+
52
+ pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu124
53
+ pip install transformers==4.41.0
54
+
55
+ pip install -r requirements.txt
56
+
57
+ export PYTHONPATH="./:$PYTHONPATH"
58
+ ```
59
+ Additionally, install [FlashAttention](https://github.com/HazyResearch/flash-attention) for training,
60
+ ```shell
61
+ pip install ninja
62
+
63
+ git clone https://github.com/HazyResearch/flash-attention.git
64
+ cd flash-attention
65
+ python setup.py install
66
+ ```
67
+ ---
68
+
69
+ ## Quantitative Evaluation 📊
70
+ We provide instructions to reproduce LangDC results on VideoMME, MVBench, LongVideoBench, VSIBench and four open-ended QA Benchmark. Please follow the instructions at [eval/README.md](eval/README.md).
71
+
72
+ To reproduce the results in Table 1 of the Motivation chapter, please refer to [this repository](https://github.com/NIneeeeeem/VideoGPT-tokenadapter.git).
73
+
74
+
75
+ ## Citations 📜:
76
+
77
+ If you're using LangDC in your research or applications, please give us a star ⭐ to support us and cite using this BibTeX:
78
+ ```bibtex
79
+ @misc{wang2025seeing,
80
+ title={Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors},
81
+ author={Xiangchen Wang and Jinrui Zhang and Teng Wang and Haigang Zhang and Feng Zheng},
82
+ year={2025},
83
+ eprint={2509.00969},
84
+ archivePrefix={arXiv},
85
+ primaryClass={cs.CV}
86
+ }
87
+ ```
88
+
89
+ ## Acknowledgements
90
+
91
+ + [Video-ChatGPT+](https://github.com/mbzuai-oryx/VideoGPT-plus): A pioneering attempt in Video-based conversation models.
92
+ + [LLaVA](https://github.com/haotian-liu/LLaVA): Our code base is build upon LLaVA and Video-ChatGPT+.