V-JEPA2 Enhanced Encoder for Video-Evoked BOLD Responses

This repository contains an enhanced V-JEPA2-large fMRI encoder trained to predict video-evoked BOLD responses on the fsaverage surface. It keeps the same V-JEPA2 backbone family as the basic encoder and improves the readout with 24 layer features, PLSTorch rank-16 decoders, and 32-member bootstrap bagging with member-wise select-then-average layer selection.

The default AutoModel variant is enhanced_all, trained on all available BoldMoments/Lahner2024 and McMahon2023 datasets.

The dashed reference is the all-vertex human ceiling after inverse Spearman-Brown correction, square root transform, and Spearman-Brown correction.

Input/Output Contract

Input: one 3-second RGB video clip, represented as a float tensor shaped [B, T, C, H, W] with values in [0, 1].
Output: one vector of predicted z-scored fMRI beta responses per video, shaped [B, 20484].
Decoder-only input: precomputed pooled V-JEPA2 layer features can be passed to forward_features(features).

The video-input path resizes frames to 224 x 224 and applies ImageNet normalization before V-JEPA2 feature extraction.

Loading

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "epfl-neuroai/vjepa2-encoder-enhanced",
    trust_remote_code=True,
)
model.eval()

video = torch.zeros(1, 16, 3, 224, 224)
with torch.no_grad():
    prediction = model(video)

print(prediction.shape)  # [1, 20484]

For decoder-only debugging:

model = AutoModel.from_pretrained(
    "epfl-neuroai/vjepa2-encoder-enhanced",
    trust_remote_code=True,
    load_vjepa=False,
)

Variants

enhanced_all (default): enhanced decoder trained on all BMD/Lahner + McMahon data.
basic_all: six-layer Ridge baseline trained on all BMD/Lahner + McMahon data.
enhanced_joint_train_to_joint_val: enhanced decoder trained on joint train split and evaluated on joint validation.
basic_joint_train_to_joint_val: matched six-layer Ridge joint validation baseline.
enhanced_bmd_to_mcmahon, basic_bmd_to_mcmahon: train on BMD/Lahner, transfer to McMahon.
enhanced_mcmahon_to_bmd, basic_mcmahon_to_bmd: train on McMahon, transfer to BMD/Lahner.

Pass variant=... to from_pretrained to load a non-default checkpoint.

Held-Out Metrics

variant	self-val corr	self-val MSE	transfer corr	transfer MSE
`basic_bmd_to_mcmahon`	0.252398	0.110704	0.141452	0.130703
`enhanced_bmd_to_mcmahon`	0.367588	0.104918	0.160433	0.131168
`basic_mcmahon_to_bmd`	0.381099	0.112007	0.099842	0.157052
`enhanced_mcmahon_to_bmd`	0.587974	0.095665	0.108977	0.145696
`basic_joint_train_to_joint_val`	0.249488	0.114624	-	-
`enhanced_joint_train_to_joint_val`	0.344799	0.110238	-	-

Data

This checkpoint was trained using data from the BOLD Moments Dataset (BMD/Lahner) and the McMahon social interaction video fMRI dataset. This repository does not include the underlying fMRI datasets or stimulus videos.

Files

enhanced_all.pth: default enhanced all-data decoder checkpoint.
basic_*.pth, enhanced_*.pth: comparison and evaluation decoder checkpoints.
vitl.pt: local V-JEPA2-large backbone weights.
config.json, configuration_vjepa2_fmri_encoder.py, modeling_vjepa2_fmri_encoder.py: custom Transformers files for AutoModel loading.
metrics.json: held-out metrics used by the model card.
assets/comparison_metrics.png: joint validation comparison plot.

Backbone Source

The loader uses the V-JEPA2 Torch Hub architecture with pretrained=False, then loads the local vitl.pt weights directly. This avoids relying on a moving external checkpoint URL while preserving compatibility with the decoder feature hooks.

Citations

If you use this checkpoint, please cite the V-JEPA/V-JEPA 2 backbone papers and source datasets:

@article{bardes2024revisiting,
  title={Revisiting Feature Prediction for Learning Visual Representations from Video},
  author={Bardes, Adrien and Garrido, Quentin and Ponce, Jean and Chen, Xinlei and Rabbat, Michael and LeCun, Yann and Assran, Mahmoud and Ballas, Nicolas},
  journal={arXiv preprint arXiv:2404.08471},
  year={2024}
}

@article{assran2025vjepa2,
  title={V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
  author={Assran, Mido and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and others},
  journal={arXiv preprint arXiv:2506.09985},
  year={2025}
}

@article{tang2025diverse,
  title={Diverse perceptual representations across visual pathways emerge from a single objective},
  author={Tang, Yingtian and Gokce, Abdulkadir and Al-Karkari, Khaled Jedoui and Yamins, Daniel and Schrimpf, Martin},
  journal={bioRxiv},
  pages={2025--07},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}

@article{lahner2024modeling,
  title={Modeling short visual events through the BOLD moments video fMRI dataset and metadata},
  author={Lahner, Benjamin and Dwivedi, Kshitij and Iamshchinina, Polina and Graumann, Monika and Lascelles, Alex and Roig, Gemma and Gifford, Alessandro Thomas and Pan, Bowen and Jin, SouYoung and Ratan Murty, N Apurva and others},
  journal={Nature communications},
  volume={15},
  number={1},
  pages={6241},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

@article{mcmahon2023hierarchical,
  title={Hierarchical organization of social action features along the lateral visual pathway},
  author={McMahon, Emalie and Bonner, Michael F and Isik, Leyla},
  journal={Current Biology},
  volume={33},
  number={23},
  pages={5035--5047},
  year={2023},
  publisher={Elsevier}
}

Downloads last month: 4

Papers for epfl-neuroai/vjepa2-encoder-enhanced

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Paper • 2506.09985 • Published Jun 11, 2025 • 32

Revisiting Feature Prediction for Learning Visual Representations from Video

Paper • 2404.08471 • Published Feb 15, 2024 • 1