Instructions to use minuzero/VideoKR-Qwen3-VL-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use minuzero/VideoKR-Qwen3-VL-8B with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("minuzero/VideoKR-Qwen3-VL-8B") model = AutoModelForImageTextToText.from_pretrained("minuzero/VideoKR-Qwen3-VL-8B") - Notebooks
- Google Colab
- Kaggle
VideoKR-Qwen3-VL-8B
📄 ArXiv | 💻 Code | 🤗 Collection
About
This repository contains the VideoKR-Qwen3-VL-8B model presented in VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding (ICML 2026 Spotlight).
VideoKR-Qwen3-VL-8B is obtained through a standard SFT → GRPO pipeline on Qwen3-VL-8B-Instruct:
- Supervised fine-tuning on VideoKR-SFT-201K with CoT rationales → VideoKR-Qwen3-VL-8B-SFT
- GRPO reinforcement learning on VideoKR-RL-114K with verifiable rewards → this model
VideoKR is the first large-scale training corpus designed for knowledge- and reasoning-intensive video understanding, containing 315K video reasoning examples over 145K newly collected, CC-licensed expert-domain videos across 82 professional subjects.
Links
| Resource | Link |
|---|---|
| Training data | minuzero/VideoKR-Train |
| Evaluation data | minuzero/VideoKR-Eval |
| SFT checkpoint (Qwen2.5-VL) | minuzero/VideoKR-Qwen2.5-VL-7B-SFT |
| GRPO checkpoint (Qwen2.5-VL) | minuzero/VideoKR-Qwen2.5-VL-7B |
| SFT checkpoint (Qwen3-VL) | minuzero/VideoKR-Qwen3-VL-8B-SFT |
Performance
Results with 128 input frames. Within the Qwen3-VL-8B group, bold = best, underline = second best.
| Model | Video-MME | MVBench | LongVBench | General Avg | VideoMMMU | MMVU | SciVidBench | VideoKR-Eval | Knowledge Avg |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-8B-Instruct | 68.2 | 67.9 | 61.6 | 65.9 | 61.8 | 59.6 | 33.4 | 39.0 | 48.5 |
| OneThinker | 65.8 | 69.3 | 61.4 | 65.5 | 62.9 | 61.6 | 33.8 | 38.3 | 49.2 |
| VideoAuto-R1 | 68.7 | 68.8 | 58.8 | 65.4 | 63.1 | 59.6 | 32.7 | 43.8 | 49.8 |
| Qwen3-VL-8B-Thinking | 67.6 | 68.0 | 60.0 | 65.2 | 64.9 | 60.5 | 33.0 | 41.5 | 50.0 |
| VideoKR (SFT + RL) | 67.8 | 67.0 | 61.5 | 65.4 | 63.0 | 64.8 | 32.8 | 45.3 | 51.5 |
VideoKR achieves the highest knowledge-intensive average (+3.0 over base, +1.5 over Qwen3-VL-8B-Thinking) among all Qwen3-VL-8B based methods, while maintaining competitive general video reasoning performance.
Evaluation
cd /path/to/VideoKR/lmms_eval
conda activate videokr_eval
export CUDA_VISIBLE_DEVICES=0
export VIDEOKR_MODEL=minuzero/VideoKR-Qwen3-VL-8B
export TASKS=videokr_eval
export BATCH_SIZE=1
export RUN_NAME=videokr_eval
bash examples/models/videokr_vllm.sh
Citation
If you find VideoKR useful in your research, please cite our paper:
@misc{fu2026videokrknowledgereasoningintensivevideo,
title={VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding},
author={Lin Fu and Zheyuan Yang and Yang Wang and Tingyu Song and Arman Cohan and Yilun Zhao},
year={2026},
eprint={2606.05259},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.05259},
}
- Downloads last month
- 47