Sudong Wang
commited on
Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 4 |
+
datasets:
|
| 5 |
+
- longvideotool/LongVT-Parquet
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
library_name: transformers
|
| 8 |
+
pipeline_tag: video-text-to-text
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# LongVT: Incentivizing βThinking with Long Videosβ via Native Tool Calling
|
| 12 |
+
|
| 13 |
+
<div align="center">
|
| 14 |
+
|
| 15 |
+
[](https://huggingface.co/collections/lmms-lab/longvt)
|
| 16 |
+
[](https://arxiv.org/abs/2511.16334)
|
| 17 |
+
[](https://evolvinglmms-lab.github.io/LongVT/)
|
| 18 |
+
[](https://github.com/EvolvingLMMs-Lab/LongVT)
|
| 19 |
+
</div>
|
| 20 |
+
|
| 21 |
+
## Overview
|
| 22 |
+
|
| 23 |
+
Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought.
|
| 24 |
+
However, they remain vulnerable to hallucination, especially when processing long-form videos where evidence is sparse and temporally dispersed.
|
| 25 |
+
Inspired by how humans comprehend long videos-by first skimming globally and then examining relevant clips for details-we introduce **LongVT**, an end-to-end agentic framework that enables ``Thinking with **Long** **V**ideos'' via interleaved Multimodal Chain-of-**T**ool-Thought.
|
| 26 |
+
Specifically, we exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.
|
| 27 |
+
|
| 28 |
+
This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence.
|
| 29 |
+
Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named **VideoSIAH** to facilitate both training and evaluation.
|
| 30 |
+
Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively.
|
| 31 |
+
Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation.
|
| 32 |
+
With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks.
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
## Model Card
|
| 36 |
+
|
| 37 |
+
The model is the RL version of the LongVT and was trained on https://huggingface.co/datasets/longvideotool/LongVT-Parquet.
|
| 38 |
+
|
| 39 |
+
## Usage & Evaluation
|
| 40 |
+
|
| 41 |
+
For detailed instructions on inference and evaluation, please refer to our [GitHub repository](https://github.com/EvolvingLMMs-Lab/LongVT). We recommend using the scripts and environment provided there to reproduce our results.
|
| 42 |
+
|
| 43 |
+
## Evaluation Results
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
| Model | Reasoning Prompt | Tool Calling | VideoMME<br>(β1018s) | VideoMMMU<br>(subtitle) | VideoMMMU<br>(adaptation) | VideoMMMU<br>(comprehension) | LVBench<br>(β4101s) | VideoSIAH-Eval<br>(β1688s) | Average Score |
|
| 47 |
+
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 48 |
+
| **Proprietary LMMs** | | | | | | | | | |
|
| 49 |
+
| GPT-4o | β | β | 77.2<sup>β </sup> | 66.0<sup>β </sup> | 62.0<sup>β </sup> | 55.7<sup>β </sup> | 30.8<sup>β </sup> | 17.4 | 51.5 |
|
| 50 |
+
| Gemini 1.5 Pro | β | β | 81.3<sup>β </sup> | 59.0<sup>β </sup> | 53.3<sup>β </sup> | 49.3<sup>β </sup> | 33.1<sup>β </sup> | - | 55.2 |
|
| 51 |
+
| **Open-Source (Sparse)** | | | | | | | | | |
|
| 52 |
+
| Qwen2.5-VL-7B | β | β | <u>62.6</u> | <u>37.3</u> | 28.0 | 36.7 | 30.7 | <u>28.1</u> | 37.2 |
|
| 53 |
+
| Video-R1-7B | β | β | 61.0 | 36.3 | 40.7 | 52.3 | 37.2 | 27.9 | <u>42.6</u> |
|
| 54 |
+
| VideoRFT-7B | β | β | 60.9 | 36.7 | 42.0 | <u>53.0</u> | 34.7 | 26.5 | 42.3 |
|
| 55 |
+
| Video-Thinker-7B | β | β | 61.0 | 34.3 | <u>44.7</u> | <u>53.0</u> | **52.2** | 10.4 | <u>42.6</u> |
|
| 56 |
+
| LongVT-7B-SFT (Ours) | β | β | 12.5 | **37.7** | **46.0** | **58.3** | 36.0 | 26.8 | 36.2 |
|
| 57 |
+
| **LongVT-7B-RL (Ours)** | β | β | **66.1** | 32.7 | <u>44.7</u> | 50.0 | <u>37.8</u> | **31.0** | **43.7** |
|
| 58 |
+
| **Open-Source (Dense)** | | | | | | | | | |
|
| 59 |
+
| Qwen2.5-VL-7B | β | β | 64.3 | 35.7 | **44.3** | **56.7** | 40.9 | 33.8 | 46.0 |
|
| 60 |
+
| Video-R1-7B | β | β | 60.5 | <u>37.3</u> | 38.7 | 46.3 | 40.1 | 33.1 | 42.7 |
|
| 61 |
+
| VideoRFT-7B | β | β | 49.2 | **37.7** | 40.7 | 48.7 | 18.7 | 26.9 | 37.0 |
|
| 62 |
+
| Video-Thinker-7B | β | β | 60.8 | **37.7** | 42.7 | 55.3 | **54.3** | 6.6 | 42.9 |
|
| 63 |
+
| LongVT-7B-SFT (Ours) | β | β | 64.9 | 32.3 | 42.0 | 49.7 | 41.1 | 34.8 | 44.1 |
|
| 64 |
+
| LongVT-7B-RL (Ours) | β | β | <u>66.1</u> | **37.7** | 42.3 | <u>56.3</u> | <u>41.4</u> | <u>35.9</u> | <u>46.6</u> |
|
| 65 |
+
| **LongVT-7B-RFT (Ours)** | β | β | **67.0** | 35.7 | <u>43.7</u> | **56.7** | 41.3 | **42.0** | **47.7** |
|
| 66 |
+
|
| 67 |
+
> **Performance Comparison with Existing Video-Centric LMMs across Various Long Video Understanding and Reasoning Benchmarks.** The best and second-best result among open-source models in each column is marked in **bold** and <u>underlined</u>, respectively. The numbers with "β" denote the average video duration of each benchmark. <sup>β </sup> indicates results sourced from official reports. **Reasoning Prompt** indicates whether a standard reasoning-style prompt (β) or a direct question-answering prompt (β) is applied; **Tool Calling** denotes whether native tool calling is enabled (β) or disabled (β) in the prompt.
|
| 68 |
+
|
| 69 |
+
## Citation
|
| 70 |
+
|
| 71 |
+
If you find LongVT useful for your research and applications, please cite using this BibTeX:
|
| 72 |
+
|
| 73 |
+
```bibtex
|
| 74 |
+
@misc{zhang2025openmmreasonerpushingfrontiersmultimodal,
|
| 75 |
+
title={OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe},
|
| 76 |
+
author={Kaichen Zhang and Keming Wu and Zuhao Yang and Kairui Hu and Bin Wang and Ziwei Liu and Xingxuan Li and Lidong Bing},
|
| 77 |
+
year={2025},
|
| 78 |
+
eprint={2511.16334},
|
| 79 |
+
archivePrefix={arXiv},
|
| 80 |
+
primaryClass={cs.AI},
|
| 81 |
+
url={https://arxiv.org/abs/2511.16334},
|
| 82 |
+
}
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
## Acknowledgements
|
| 86 |
+
|
| 87 |
+
We gratefully acknowledge the following open-source projects that made this work possible:
|
| 88 |
+
|
| 89 |
+
- [**lmms-eval**](https://github.com/EvolvingLMMs-Lab/lmms-eval) for providing the comprehensive evaluation framework for large multimodal models.
|
| 90 |
+
- [**lmms-engine**](https://github.com/EvolvingLMMs-Lab/lmms-engine) for the SFT training infrastructure and tools.
|
| 91 |
+
- [**verl**](https://github.com/volcengine/verl) for the reinforcement learning training framework.
|
| 92 |
+
|
| 93 |
+
We thank the developers and contributors of these projects for their excellent work and for making their code publicly available.
|