Instructions to use epfl-neuroai/vjepa2-encoder-enhanced with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use epfl-neuroai/vjepa2-encoder-enhanced with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("feature-extraction", model="epfl-neuroai/vjepa2-encoder-enhanced", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("epfl-neuroai/vjepa2-encoder-enhanced", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
V-JEPA2 Enhanced Encoder for Video-Evoked BOLD Responses
This repository contains an enhanced V-JEPA2-large fMRI encoder trained to predict video-evoked BOLD responses on the fsaverage surface. It keeps the same V-JEPA2 backbone family as the basic encoder and improves the readout with 24 layer features, PLSTorch rank-16 decoders, and 32-member bootstrap bagging with member-wise select-then-average layer selection.
The default AutoModel variant is enhanced_all, trained on all available BoldMoments/Lahner2024 and McMahon2023 datasets.
The dashed reference is the all-vertex human ceiling after inverse Spearman-Brown correction, square root transform, and Spearman-Brown correction.
Input/Output Contract
- Input: one 3-second RGB video clip, represented as a float tensor shaped
[B, T, C, H, W]with values in[0, 1]. - Output: one vector of predicted z-scored fMRI beta responses per video, shaped
[B, 20484]. - Decoder-only input: precomputed pooled V-JEPA2 layer features can be passed to
forward_features(features).
The video-input path resizes frames to 224 x 224 and applies ImageNet normalization before V-JEPA2 feature extraction.
Loading
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
"epfl-neuroai/vjepa2-encoder-enhanced",
trust_remote_code=True,
)
model.eval()
video = torch.zeros(1, 16, 3, 224, 224)
with torch.no_grad():
prediction = model(video)
print(prediction.shape) # [1, 20484]
For decoder-only debugging:
model = AutoModel.from_pretrained(
"epfl-neuroai/vjepa2-encoder-enhanced",
trust_remote_code=True,
load_vjepa=False,
)
Variants
enhanced_all(default): enhanced decoder trained on all BMD/Lahner + McMahon data.basic_all: six-layer Ridge baseline trained on all BMD/Lahner + McMahon data.enhanced_joint_train_to_joint_val: enhanced decoder trained on joint train split and evaluated on joint validation.basic_joint_train_to_joint_val: matched six-layer Ridge joint validation baseline.enhanced_bmd_to_mcmahon,basic_bmd_to_mcmahon: train on BMD/Lahner, transfer to McMahon.enhanced_mcmahon_to_bmd,basic_mcmahon_to_bmd: train on McMahon, transfer to BMD/Lahner.
Pass variant=... to from_pretrained to load a non-default checkpoint.
Held-Out Metrics
| variant | self-val corr | self-val MSE | transfer corr | transfer MSE |
|---|---|---|---|---|
basic_bmd_to_mcmahon |
0.252398 | 0.110704 | 0.141452 | 0.130703 |
enhanced_bmd_to_mcmahon |
0.367588 | 0.104918 | 0.160433 | 0.131168 |
basic_mcmahon_to_bmd |
0.381099 | 0.112007 | 0.099842 | 0.157052 |
enhanced_mcmahon_to_bmd |
0.587974 | 0.095665 | 0.108977 | 0.145696 |
basic_joint_train_to_joint_val |
0.249488 | 0.114624 | - | - |
enhanced_joint_train_to_joint_val |
0.344799 | 0.110238 | - | - |
Data
This checkpoint was trained using data from the BOLD Moments Dataset (BMD/Lahner) and the McMahon social interaction video fMRI dataset. This repository does not include the underlying fMRI datasets or stimulus videos.
Files
enhanced_all.pth: default enhanced all-data decoder checkpoint.basic_*.pth,enhanced_*.pth: comparison and evaluation decoder checkpoints.vitl.pt: local V-JEPA2-large backbone weights.config.json,configuration_vjepa2_fmri_encoder.py,modeling_vjepa2_fmri_encoder.py: custom Transformers files forAutoModelloading.metrics.json: held-out metrics used by the model card.assets/comparison_metrics.png: joint validation comparison plot.
Backbone Source
The loader uses the V-JEPA2 Torch Hub architecture with pretrained=False, then loads the local vitl.pt weights directly. This avoids relying on a moving external checkpoint URL while preserving compatibility with the decoder feature hooks.
Citations
If you use this checkpoint, please cite the V-JEPA/V-JEPA 2 backbone papers and source datasets:
@article{bardes2024revisiting,
title={Revisiting Feature Prediction for Learning Visual Representations from Video},
author={Bardes, Adrien and Garrido, Quentin and Ponce, Jean and Chen, Xinlei and Rabbat, Michael and LeCun, Yann and Assran, Mahmoud and Ballas, Nicolas},
journal={arXiv preprint arXiv:2404.08471},
year={2024}
}
@article{assran2025vjepa2,
title={V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
author={Assran, Mido and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and others},
journal={arXiv preprint arXiv:2506.09985},
year={2025}
}
@article{tang2025diverse,
title={Diverse perceptual representations across visual pathways emerge from a single objective},
author={Tang, Yingtian and Gokce, Abdulkadir and Al-Karkari, Khaled Jedoui and Yamins, Daniel and Schrimpf, Martin},
journal={bioRxiv},
pages={2025--07},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}
@article{lahner2024modeling,
title={Modeling short visual events through the BOLD moments video fMRI dataset and metadata},
author={Lahner, Benjamin and Dwivedi, Kshitij and Iamshchinina, Polina and Graumann, Monika and Lascelles, Alex and Roig, Gemma and Gifford, Alessandro Thomas and Pan, Bowen and Jin, SouYoung and Ratan Murty, N Apurva and others},
journal={Nature communications},
volume={15},
number={1},
pages={6241},
year={2024},
publisher={Nature Publishing Group UK London}
}
@article{mcmahon2023hierarchical,
title={Hierarchical organization of social action features along the lateral visual pathway},
author={McMahon, Emalie and Bonner, Michael F and Isik, Leyla},
journal={Current Biology},
volume={33},
number={23},
pages={5035--5047},
year={2023},
publisher={Elsevier}
}
- Downloads last month
- 4
