YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models

Svetlana Orlova, NiccolΓ² Cavagnero, Gijs Dubbelman

Eindhoven University of Technology.

arXiv GitHub

Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this approach before investing in computing for video pre-training. Our empirical findings across multiple video understanding tasks suggest that strong temporal performance can emerge without large-scale video pre-training, motivating future work on recurrent video foundation models obtained by pre-training a temporal module on top of a frozen image foundation model.

πŸ’» Code

Code, inference scripts, and setup instructions are in the GitHub repo: tue-mps/towards-video-image-frozen.

πŸ“ Models

For every task we ship three checkpoints, all using the streaming protocol:

  1. RVM β€” original RVM (frozen ViT + frozen GatedTransformerCore); only the readout is tuned.
  2. DINOv3 + RVM_RNN β€” frozen DINOv3 image encoder + GatedTransformerCore + readout, both trained from scratch.
  3. DINOv3 + GMMix β€” frozen DINOv3 image encoder + GatedMambaMix temporal core + readout, both trained from scratch.

frozen / fine-tuned columns list which parts of the model are kept frozen vs. trained.

Something-Something v2 β€” action recognition

Model Acc % ckpt --arch frozen fine-tuned
RVM 46.9 ckpts/SthSthV2/RVM_readout.ckpt rvm Encoder, RNN module Readout
DINOv3 + RVM_RNN 67.1 ckpts/SthSthV2/DINOv3+Gated.ckpt dinov3_rvmrnn Encoder RNN module, Readout
DINOv3 + GMMix 66.9 ckpts/SthSthV2/DINOv3+GMM.ckpt dinov3_gatedmambamix Encoder RNN module, Readout

Waymo Open β€” object tracking

Model mIoU ckpt --arch frozen fine-tuned
RVM 72.7 ckpts/Waymo/RVM_readout.ckpt rvm Encoder, RNN module Readout
DINOv3 + RVM_RNN 85.7 ckpts/Waymo/DINOv3+Gated.ckpt dinov3_rvmrnn Encoder RNN module, Readout
DINOv3 + GMMix 85.0 ckpts/Waymo/DINOv3+GMM.ckpt dinov3_gatedmambamix Encoder RNN module, Readout

Perception Test β€” point tracking

Model AJ ckpt --arch frozen fine-tuned
RVM 61.3 ckpts/PerceptionTest/RVM_readout.ckpt rvm Encoder, RNN module Readout
DINOv3 + RVM_RNN 63.7 ckpts/PerceptionTest/DINOv3+Gated.ckpt dinov3_rvmrnn Encoder RNN module, Readout
DINOv3 + GMMix 69.4 ckpts/PerceptionTest/DINOv3+GMM.ckpt dinov3_gatedmambamix Encoder RNN module, Readout

ScanNet β€” depth estimation

Model AbsRel ↓ ckpt --arch frozen fine-tuned
RVM 0.1293 ckpts/ScanNet/RVM_readout.ckpt rvm Encoder, RNN module Readout
DINOv3 + RVM_RNN 0.0900 ckpts/ScanNet/DINOv3+Gated.ckpt dinov3_rvmrnn Encoder RNN module, Readout
DINOv3 + GMMix 0.0885 ckpts/ScanNet/DINOv3+GMM.ckpt dinov3_gatedmambamix Encoder RNN module, Readout

NuScenes β€” camera pose estimation

Model RPEβ‚œα΅£ (mm) ↓ ckpt --arch frozen fine-tuned
RVM 36.00 ckpts/NuScenes/RVM_readout.ckpt rvm Encoder, RNN module Readout
DINOv3 + RVM_RNN 29.37 ckpts/NuScenes/DINOv3+Gated.ckpt dinov3_rvmrnn Encoder RNN module, Readout
DINOv3 + GMMix 28.09 ckpts/NuScenes/DINOv3+GMM.ckpt dinov3_gatedmambamix Encoder RNN module, Readout

✏️ Citation

If you find this work useful, please cite our paper:

@inproceedings{orlova2026frozen,
  title     = {Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models},
  author    = {Orlova, Svetlana and Cavagnero, Niccol\`o and Dubbelman, Gijs},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
  year      = {2026},
}

Please also cite the works we build on:

@misc{simeoni2025dinov3,
  title={{DINOv3}},
  author={Sim{\'e}oni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Micha{\"e}l and Massa, Francisco and Haziza, Daniel and Wehrstedt, Luca and Wang, Jianyuan and Darcet, Timoth{\'e}e and Moutakanni, Th{\'e}o and Sentana, Leonel and Roberts, Claire and Vedaldi, Andrea and Tolan, Jamie and Brandt, John and Couprie, Camille and Mairal, Julien and J{\'e}gou, Herv{\'e} and Labatut, Patrick and Bojanowski, Piotr},
  year={2025},
  eprint={2508.10104},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2508.10104},
}

@article{zoran2025recurrent,
  title={Recurrent Video Masked Autoencoders},
  author={Zoran, Daniel and Parthasarathy, Nikhil and Yang, Yi and Hudson, Drew A and Carreira, Joao and Zisserman, Andrew},
  journal={arXiv preprint arXiv:2512.13684},
  year={2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for tue-mps/towards-video-image-frozen