VideoKR-Qwen3-VL-8B

About

This repository contains the VideoKR-Qwen3-VL-8B model presented in VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding (ICML 2026 Spotlight).

VideoKR-Qwen3-VL-8B is obtained through a standard SFT → GRPO pipeline on Qwen3-VL-8B-Instruct:

Supervised fine-tuning on VideoKR-SFT-201K with CoT rationales → VideoKR-Qwen3-VL-8B-SFT
GRPO reinforcement learning on VideoKR-RL-114K with verifiable rewards → this model

VideoKR is the first large-scale training corpus designed for knowledge- and reasoning-intensive video understanding, containing 315K video reasoning examples over 145K newly collected, CC-licensed expert-domain videos across 82 professional subjects.

Links

Resource	Link
Training data	minuzero/VideoKR-Train
Evaluation data	minuzero/VideoKR-Eval
SFT checkpoint (Qwen2.5-VL)	minuzero/VideoKR-Qwen2.5-VL-7B-SFT
GRPO checkpoint (Qwen2.5-VL)	minuzero/VideoKR-Qwen2.5-VL-7B
SFT checkpoint (Qwen3-VL)	minuzero/VideoKR-Qwen3-VL-8B-SFT

Performance

Results with 128 input frames. Within the Qwen3-VL-8B group, bold = best, underline = second best.

Model	Video-MME	MVBench	LongVBench	General Avg	VideoMMMU	MMVU	SciVidBench	VideoKR-Eval	Knowledge Avg
Qwen3-VL-8B-Instruct	68.2	67.9	61.6	65.9	61.8	59.6	33.4	39.0	48.5
OneThinker	65.8	69.3	61.4	65.5	62.9	61.6	33.8	38.3	49.2
VideoAuto-R1	68.7	68.8	58.8	65.4	63.1	59.6	32.7	43.8	49.8
Qwen3-VL-8B-Thinking	67.6	68.0	60.0	65.2	64.9	60.5	33.0	41.5	50.0
VideoKR (SFT + RL)	67.8	67.0	61.5	65.4	63.0	64.8	32.8	45.3	51.5

VideoKR achieves the highest knowledge-intensive average (+3.0 over base, +1.5 over Qwen3-VL-8B-Thinking) among all Qwen3-VL-8B based methods, while maintaining competitive general video reasoning performance.

Evaluation

cd /path/to/VideoKR/lmms_eval
conda activate videokr_eval

export CUDA_VISIBLE_DEVICES=0
export VIDEOKR_MODEL=minuzero/VideoKR-Qwen3-VL-8B
export TASKS=videokr_eval
export BATCH_SIZE=1
export RUN_NAME=videokr_eval

bash examples/models/videokr_vllm.sh

Citation

If you find VideoKR useful in your research, please cite our paper:

@misc{fu2026videokrknowledgereasoningintensivevideo,
      title={VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding}, 
      author={Lin Fu and Zheyuan Yang and Yang Wang and Tingyu Song and Arman Cohan and Yilun Zhao},
      year={2026},
      eprint={2606.05259},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.05259}, 
}