ImageNet-1K Self-Supervised ViT Baselines
A collection of open and reproducible self-supervised Vision Transformer (ViT) checkpoints trained exclusively on ImageNet-1K (1.45M images).
Part of the Open-Knowledge-AI ImageNet-1K SSL collection. All models in this collection are pre-trained exclusively on ImageNet-1K using a unified codebase and training strategy, enabling direct comparison across methods and durations.
Codebase: github.com/Open-Knowledge-AI/lite_ssl
Method
This model is trained with LeJEPA: a heuristic free, negative free and teacher free SSL method. It combines SIGReg (Sketched Isotropic Gaussian Regularisation) promoting Gaussian distributed latent embeddings with an Invariance loss to preserve discriminative structure.
The backbone follows the ViT-v2 design from DINOv2 β without register tokens.
The main branch of this repository exposes the model trained for 300epochs. please refer to other revisions for other models.
Training
| Setting | Value |
|---|---|
| Dataset | ImageNet-1K (1.45M images) |
| Architecture | ViT-B/16 (ViT-v2, no registers) |
| Epochs | 100,300 |
| Views | 8 total β 2 global (224Γ224) + 6 local (96Γ96) |
| Precision | BF16 mixed (bf16-mixed) |
| GPUs | 8Γ (4 devices Γ 2 nodes, DDP via PyTorch Lightning) |
| Global batch size | 1024 (128 per GPU) |
| Optimizer | AdamW, layerwise LR decay 0.9, patch embed LR mult 0.2 (from DiNOv2) |
| Learning rate | Base 5e-4 β effective 2e-3 (scaled by batch / 256); linear warmup 10 epochs β cosine decay to 1e-6 |
| Weight decay | Cosine anneal 0.04 β 0.4 (for DiNO and iBOT), Constant 5e-2 for LeJEPA |
| Gradient clipping | Norm, max 3.0 |
The exact config used for this run is available at configs/{dino,ibot,lejepa}/vit_{s,b}16_ep{100,300}.json in the code repository.
Evaluation
Metrics are only available for the teacher weights.
| Model | IN-1K online probe (acc@1) | IN-1K linear probe (acc@1) | IN-1K k-NN (acc@1) | NYU Depth (Ξ΄1) | Pascal VOC (mAP) |
|---|---|---|---|---|---|
| DINO ViT-S/16 ep100 | 69.32 | - | - | - | - |
| DINO ViT-S/16 ep300 | 73.88 | - | - | - | - |
| DINO ViT-B/16 ep100 | 73.49 | - | - | - | - |
| DINO ViT-B/16 ep300 | Soon | - | - | - | - |
| iBOT ViT-S/16 ep100 | 69.70 | - | - | - | - |
| iBOT ViT-S/16 ep300 | 74.32 | - | - | - | - |
| iBOT ViT-B/16 ep100 | 76.50 | - | - | - | - |
| iBOT ViT-B/16 ep300 | 78.74 | - | - | - | - |
| LeJEPA ViT-S/16 ep100 | 61.85 | - | - | - | - |
| LeJEPA ViT-S/16 ep300 | 65.99 | - | - | - | - |
| LeJEPA ViT-B/16 ep100 | 69.28 | - | - | - | - |
| LeJEPA ViT-B/16 ep300 | 72.04 | - | - | - | - |
Grouped by method, DINO first since you have results for it. Let me know if you'd prefer a different ordering (e.g. by architecture, or alphabetical by method).
Online probe results are logged during pre-training. Linear probe, k-NN, and downstream evaluations are coming soon.
Checkpoints
from transformers import AutoModel
# default checkpoint (teacher checkpoint after 300 epochs)
objective = "dino" # lejepa, ibot
model_size = "s" # b
pretrain_dataset = "in1k" # currently only in1k is planned unless compute can be expanded.
hf_model_string = f"OK-AI/{objective}-vit{model_size}16-pretrain-{pretrain_dataset}"
model = AutoModel.from_pretrained(hf_model_string)
# alternate training checkpoints
epoch_variant = 100 # 300
state_dict_of = "student" # `teacher` is default for dino and ibot, lejepa only has student.
model = AutoModel.from_pretrained(
hf_model_string,
revision=f"ep{epoch_variant}/{state_dict_of}",
)
Usage
import requests
import torch
from PIL import Image
from transformers import AutoModel, AutoImageProcessor
model_id = "OK-AI/dino-vits16-pretrain-in1k"
processor = AutoImageProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval()
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# {
# "latent": cls_tokens[:, 0],
# "patch_latent": patch_tokens,
# "raw_latent": x[:, 0],
# "last_self_attention": attn,
# "logits": self.head(cls_tokens[:, 0]), # only exists for comptability, head is always identity in this case.
# }
# CLS token β use for classification, retrieval, k-NN
cls = outputs["latent"] # (1, 384)
# Patch tokens β use for dense tasks (depth, segmentation)
patches = outputs["patch_latent"] # (1, 196, 384)
Citation
@article{Sablayrolles2018Jun,
author = {Sablayrolles, Alexandre and Douze, Matthijs and Schmid, Cordelia and J{\ifmmode\acute{e}\else\'{e}\fi}gou, Herv{\ifmmode\acute{e}\else\'{e}\fi}},
title = {{Spreading vectors for similarity search}},
year = {2018},
month = jun,
doi = {10.48550/arXiv.1806.03198}
}
@article{Caron2021Apr,
author = {Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J{\ifmmode\acute{e}\else\'{e}\fi}gou, Herv{\ifmmode\acute{e}\else\'{e}\fi} and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
title = {{Emerging Properties in Self-Supervised Vision Transformers}},
year = {2021},
month = apr,
doi = {10.48550/arXiv.2104.14294}
}
@article{Zhou2021Nov,
author = {Zhou, Jinghao and Wei, Chen and Wang, Huiyu and Shen, Wei and Xie, Cihang and Yuille, Alan and Kong, Tao},
title = {{iBOT: Image BERT Pre-Training with Online Tokenizer}},
year = {2021},
month = nov,
doi = {10.48550/arXiv.2111.07832}
}
@article{Dong2022Dec,
author = {Dong, Xiaoyi and Bao, Jianmin and Zhang, Ting and Chen, Dongdong and Gu, Shuyang and Zhang, Weiming and Yuan, Lu and Chen, Dong and Wen, Fang and Yu, Nenghai},
title = {{CLIP Itself is a Strong Fine-tuner: Achieving 85.7{\%} and 88.0{\%} Top-1 Accuracy with ViT-B and ViT-L on ImageNet}},
year = {2022},
month = dec,
doi = {10.48550/arXiv.2212.06138}
}
@article{Oquab2023Apr,
author = {Oquab, Maxime and Darcet, Timoth{\ifmmode\acute{e}\else\'{e}\fi}e and Moutakanni, Th{\ifmmode\acute{e}\else\'{e}\fi}o and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Assran, Mahmoud and Ballas, Nicolas and Galuba, Wojciech and Howes, Russell and Huang, Po-Yao and Li, Shang-Wen and Misra, Ishan and Rabbat, Michael and Sharma, Vasu and Synnaeve, Gabriel and Xu, Hu and Jegou, Herv{\ifmmode\acute{e}\else\'{e}\fi} and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
title = {{DINOv2: Learning Robust Visual Features without Supervision}},
year = {2023},
month = apr,
doi = {10.48550/arXiv.2304.07193}
}
@article{Balestriero2025Nov,
author = {Balestriero, Randall and LeCun, Yann},
title = {{LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics}},
year = {2025},
month = nov,
doi = {10.48550/arXiv.2511.08544}
}
- Downloads last month
- 47