VideoKR-Qwen3-VL-8B

📄 ArXiv  ï½œ  💻 Code  ï½œ  🤗 Collection

About

This repository contains the VideoKR-Qwen3-VL-8B model presented in VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding (ICML 2026 Spotlight).

VideoKR-Qwen3-VL-8B is obtained through a standard SFT → GRPO pipeline on Qwen3-VL-8B-Instruct:

  1. Supervised fine-tuning on VideoKR-SFT-201K with CoT rationales → VideoKR-Qwen3-VL-8B-SFT
  2. GRPO reinforcement learning on VideoKR-RL-114K with verifiable rewards → this model

VideoKR is the first large-scale training corpus designed for knowledge- and reasoning-intensive video understanding, containing 315K video reasoning examples over 145K newly collected, CC-licensed expert-domain videos across 82 professional subjects.

Links

Resource Link
Training data minuzero/VideoKR-Train
Evaluation data minuzero/VideoKR-Eval
SFT checkpoint (Qwen2.5-VL) minuzero/VideoKR-Qwen2.5-VL-7B-SFT
GRPO checkpoint (Qwen2.5-VL) minuzero/VideoKR-Qwen2.5-VL-7B
SFT checkpoint (Qwen3-VL) minuzero/VideoKR-Qwen3-VL-8B-SFT

Performance

Results with 128 input frames. Within the Qwen3-VL-8B group, bold = best, underline = second best.

Model Video-MME MVBench LongVBench General Avg VideoMMMU MMVU SciVidBench VideoKR-Eval Knowledge Avg
Qwen3-VL-8B-Instruct 68.2 67.9 61.6 65.9 61.8 59.6 33.4 39.0 48.5
OneThinker 65.8 69.3 61.4 65.5 62.9 61.6 33.8 38.3 49.2
VideoAuto-R1 68.7 68.8 58.8 65.4 63.1 59.6 32.7 43.8 49.8
Qwen3-VL-8B-Thinking 67.6 68.0 60.0 65.2 64.9 60.5 33.0 41.5 50.0
VideoKR (SFT + RL) 67.8 67.0 61.5 65.4 63.0 64.8 32.8 45.3 51.5

VideoKR achieves the highest knowledge-intensive average (+3.0 over base, +1.5 over Qwen3-VL-8B-Thinking) among all Qwen3-VL-8B based methods, while maintaining competitive general video reasoning performance.

Evaluation

cd /path/to/VideoKR/lmms_eval
conda activate videokr_eval

export CUDA_VISIBLE_DEVICES=0
export VIDEOKR_MODEL=minuzero/VideoKR-Qwen3-VL-8B
export TASKS=videokr_eval
export BATCH_SIZE=1
export RUN_NAME=videokr_eval

bash examples/models/videokr_vllm.sh

Citation

If you find VideoKR useful in your research, please cite our paper:

@misc{fu2026videokrknowledgereasoningintensivevideo,
      title={VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding}, 
      author={Lin Fu and Zheyuan Yang and Yang Wang and Tingyu Song and Arman Cohan and Yilun Zhao},
      year={2026},
      eprint={2606.05259},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.05259}, 
}
Downloads last month
47
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for minuzero/VideoKR-Qwen3-VL-8B

Finetuned
(295)
this model
Quantizations
1 model

Dataset used to train minuzero/VideoKR-Qwen3-VL-8B

Collection including minuzero/VideoKR-Qwen3-VL-8B

Paper for minuzero/VideoKR-Qwen3-VL-8B