Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

Svetlana Orlova, Niccolò Cavagnero, Gijs Dubbelman

Eindhoven University of Technology.

Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model.

💻 Code

Code, inference scripts, and setup instructions are in the GitHub repo: tue-mps/towards-video-image-frozen.

📍 Models

For every task we ship three checkpoints, all using the streaming protocol:

RVM — original RVM (frozen ViT + frozen GatedTransformerCore); only the readout is tuned.
DINOv3 + RVM_RNN — frozen DINOv3 image encoder + GatedTransformerCore + readout, both trained from scratch.
DINOv3 + GMMix — frozen DINOv3 image encoder + GatedMambaMix temporal core + readout, both trained from scratch.

frozen / fine-tuned columns list which parts of the model are kept frozen vs. trained.

Something-Something v2 — action recognition

Model	Acc %	ckpt	`--arch`	frozen	fine-tuned
RVM	46.9	`ckpts/SthSthV2/RVM_readout.ckpt`	`rvm`	Encoder, RNN module	Readout
DINOv3 + RVM_RNN	67.1	`ckpts/SthSthV2/DINOv3+Gated.ckpt`	`dinov3_rvmrnn`	Encoder	RNN module, Readout
DINOv3 + GMMix	66.9	`ckpts/SthSthV2/DINOv3+GMM.ckpt`	`dinov3_gatedmambamix`	Encoder	RNN module, Readout

Waymo Open — object tracking

Model	mIoU	ckpt	`--arch`	frozen	fine-tuned
RVM	72.7	`ckpts/Waymo/RVM_readout.ckpt`	`rvm`	Encoder, RNN module	Readout
DINOv3 + RVM_RNN	85.7	`ckpts/Waymo/DINOv3+Gated.ckpt`	`dinov3_rvmrnn`	Encoder	RNN module, Readout
DINOv3 + GMMix	85.0	`ckpts/Waymo/DINOv3+GMM.ckpt`	`dinov3_gatedmambamix`	Encoder	RNN module, Readout

Perception Test — point tracking

Model	AJ	ckpt	`--arch`	frozen	fine-tuned
RVM	61.3	`ckpts/PerceptionTest/RVM_readout.ckpt`	`rvm`	Encoder, RNN module	Readout
DINOv3 + RVM_RNN	63.7	`ckpts/PerceptionTest/DINOv3+Gated.ckpt`	`dinov3_rvmrnn`	Encoder	RNN module, Readout
DINOv3 + GMMix	69.4	`ckpts/PerceptionTest/DINOv3+GMM.ckpt`	`dinov3_gatedmambamix`	Encoder	RNN module, Readout

ScanNet — depth estimation

Model	AbsRel ↓	ckpt	`--arch`	frozen	fine-tuned
RVM	0.1293	`ckpts/ScanNet/RVM_readout.ckpt`	`rvm`	Encoder, RNN module	Readout
DINOv3 + RVM_RNN	0.0900	`ckpts/ScanNet/DINOv3+Gated.ckpt`	`dinov3_rvmrnn`	Encoder	RNN module, Readout
DINOv3 + GMMix	0.0885	`ckpts/ScanNet/DINOv3+GMM.ckpt`	`dinov3_gatedmambamix`	Encoder	RNN module, Readout

NuScenes — camera pose estimation

Model	RPEₜᵣ (mm) ↓	ckpt	`--arch`	frozen	fine-tuned
RVM	36.00	`ckpts/NuScenes/RVM_readout.ckpt`	`rvm`	Encoder, RNN module	Readout
DINOv3 + RVM_RNN	29.37	`ckpts/NuScenes/DINOv3+Gated.ckpt`	`dinov3_rvmrnn`	Encoder	RNN module, Readout
DINOv3 + GMMix	28.09	`ckpts/NuScenes/DINOv3+GMM.ckpt`	`dinov3_gatedmambamix`	Encoder	RNN module, Readout

✏️ Citation

If you find this work useful, please cite our paper:

@inproceedings{orlova2026frozen,
  title     = {Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models},
  author    = {Orlova, Svetlana and Cavagnero, Niccol\`o and Dubbelman, Gijs},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
  year      = {2026},
}

Please also cite the works we build on:

@misc{simeoni2025dinov3,
  title={{DINOv3}},
  author={Sim{\'e}oni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Micha{\"e}l and Massa, Francisco and Haziza, Daniel and Wehrstedt, Luca and Wang, Jianyuan and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Sentana, Leonel and Roberts, Claire and Vedaldi, Andrea and Tolan, Jamie and Brandt, John and Couprie, Camille and Mairal, Julien and J{\'e}gou, Herv{\'e} and Labatut, Patrick and Bojanowski, Piotr},
  year={2025},
  eprint={2508.10104},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2508.10104},
}

@article{zoran2025recurrent,
  title={Recurrent Video Masked Autoencoders},
  author={Zoran, Daniel and Parthasarathy, Nikhil and Yang, Yi and Hudson, Drew A and Carreira, Joao and Zisserman, Andrew},
  journal={arXiv preprint arXiv:2512.13684},
  year={2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for tue-mps/towards-video-image-frozen