YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models
Svetlana Orlova, NiccolΓ² Cavagnero, Gijs Dubbelman
Eindhoven University of Technology.
![]() |
![]() |
![]() |
![]() |
![]() |
Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model.
π» Code
Code, inference scripts, and setup instructions are in the GitHub repo: tue-mps/towards-video-image-frozen.
π Models
For every task we ship three checkpoints, all using the streaming protocol:
- RVM β original RVM (frozen ViT + frozen GatedTransformerCore); only the readout is tuned.
- DINOv3 + RVM_RNN β frozen DINOv3 image encoder + GatedTransformerCore + readout, both trained from scratch.
- DINOv3 + GMMix β frozen DINOv3 image encoder + GatedMambaMix temporal core + readout, both trained from scratch.
frozen / fine-tuned columns list which parts of the model are kept frozen vs. trained.
Something-Something v2 β action recognition
| Model | Acc % | ckpt | --arch |
frozen | fine-tuned |
|---|---|---|---|---|---|
| RVM | 46.9 | ckpts/SthSthV2/RVM_readout.ckpt |
rvm |
Encoder, RNN module | Readout |
| DINOv3 + RVM_RNN | 67.1 | ckpts/SthSthV2/DINOv3+Gated.ckpt |
dinov3_rvmrnn |
Encoder | RNN module, Readout |
| DINOv3 + GMMix | 66.9 | ckpts/SthSthV2/DINOv3+GMM.ckpt |
dinov3_gatedmambamix |
Encoder | RNN module, Readout |
Waymo Open β object tracking
| Model | mIoU | ckpt | --arch |
frozen | fine-tuned |
|---|---|---|---|---|---|
| RVM | 72.7 | ckpts/Waymo/RVM_readout.ckpt |
rvm |
Encoder, RNN module | Readout |
| DINOv3 + RVM_RNN | 85.7 | ckpts/Waymo/DINOv3+Gated.ckpt |
dinov3_rvmrnn |
Encoder | RNN module, Readout |
| DINOv3 + GMMix | 85.0 | ckpts/Waymo/DINOv3+GMM.ckpt |
dinov3_gatedmambamix |
Encoder | RNN module, Readout |
Perception Test β point tracking
| Model | AJ | ckpt | --arch |
frozen | fine-tuned |
|---|---|---|---|---|---|
| RVM | 61.3 | ckpts/PerceptionTest/RVM_readout.ckpt |
rvm |
Encoder, RNN module | Readout |
| DINOv3 + RVM_RNN | 63.7 | ckpts/PerceptionTest/DINOv3+Gated.ckpt |
dinov3_rvmrnn |
Encoder | RNN module, Readout |
| DINOv3 + GMMix | 69.4 | ckpts/PerceptionTest/DINOv3+GMM.ckpt |
dinov3_gatedmambamix |
Encoder | RNN module, Readout |
ScanNet β depth estimation
| Model | AbsRel β | ckpt | --arch |
frozen | fine-tuned |
|---|---|---|---|---|---|
| RVM | 0.1293 | ckpts/ScanNet/RVM_readout.ckpt |
rvm |
Encoder, RNN module | Readout |
| DINOv3 + RVM_RNN | 0.0900 | ckpts/ScanNet/DINOv3+Gated.ckpt |
dinov3_rvmrnn |
Encoder | RNN module, Readout |
| DINOv3 + GMMix | 0.0885 | ckpts/ScanNet/DINOv3+GMM.ckpt |
dinov3_gatedmambamix |
Encoder | RNN module, Readout |
NuScenes β camera pose estimation
| Model | RPEβα΅£ (mm) β | ckpt | --arch |
frozen | fine-tuned |
|---|---|---|---|---|---|
| RVM | 36.00 | ckpts/NuScenes/RVM_readout.ckpt |
rvm |
Encoder, RNN module | Readout |
| DINOv3 + RVM_RNN | 29.37 | ckpts/NuScenes/DINOv3+Gated.ckpt |
dinov3_rvmrnn |
Encoder | RNN module, Readout |
| DINOv3 + GMMix | 28.09 | ckpts/NuScenes/DINOv3+GMM.ckpt |
dinov3_gatedmambamix |
Encoder | RNN module, Readout |
βοΈ Citation
If you find this work useful, please cite our paper:
@inproceedings{orlova2026frozen,
title = {Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models},
author = {Orlova, Svetlana and Cavagnero, Niccol\`o and Dubbelman, Gijs},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
year = {2026},
}
Please also cite the works we build on:
@misc{simeoni2025dinov3,
title={{DINOv3}},
author={Sim{\'e}oni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Micha{\"e}l and Massa, Francisco and Haziza, Daniel and Wehrstedt, Luca and Wang, Jianyuan and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Sentana, Leonel and Roberts, Claire and Vedaldi, Andrea and Tolan, Jamie and Brandt, John and Couprie, Camille and Mairal, Julien and J{\'e}gou, Herv{\'e} and Labatut, Patrick and Bojanowski, Piotr},
year={2025},
eprint={2508.10104},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.10104},
}
@article{zoran2025recurrent,
title={Recurrent Video Masked Autoencoders},
author={Zoran, Daniel and Parthasarathy, Nikhil and Yang, Yi and Hudson, Drew A and Carreira, Joao and Zisserman, Andrew},
journal={arXiv preprint arXiv:2512.13684},
year={2025}
}




