ImageNet-1K Self-Supervised ViT Baselines
A collection of open and reproducible self-supervised Vision Transformer (ViT) checkpoints trained exclusively on ImageNet-1K (1.45M images).
Part of the Open-Knowledge-AI ImageNet-1K SSL collection. All models in this collection are pre-trained exclusively on ImageNet-1K using a unified codebase and training strategy, enabling direct comparison across methods and durations.
Codebase: github.com/Open-Knowledge-AI/lite_ssl
Method
This model is trained with iBOT: a masked image modelling objective where a momentum teacher acts as an online tokenizer, combined with a DINO-style global CLS distillation objective and a KoLeo regulariser on the CLS embeddings to encourage uniform representation spread.
The backbone follows the ViT-v2 design from DINOv2 β without register tokens.
The main branch of this repository exposes the model trained for 300epochs. please refer to other revisions for other models.
Training
| Setting | Value |
|---|---|
| Dataset | ImageNet-1K (1.45M images) |
| Architecture | ViT-B/16 (ViT-v2, no registers) |
| Epochs | 100,300 |
| Views | 8 total β 2 global (224Γ224) + 6 local (96Γ96) |
| Precision | BF16 mixed (bf16-mixed) |
| GPUs | 8Γ (4 devices Γ 2 nodes, DDP via PyTorch Lightning) |
| Global batch size | 1024 (128 per GPU) |
| Optimizer | AdamW, layerwise LR decay 0.9, patch embed LR mult 0.2 (from DiNOv2) |
| Learning rate | Base 5e-4 β effective 2e-3 (scaled by batch / 256); linear warmup 10 epochs β cosine decay to 1e-6 |
| Weight decay | Cosine anneal 0.04 β 0.4 (for DiNO and iBOT), Constant 5e-2 for LeJEPA |
| Gradient clipping | Norm, max 3.0 |
The exact config used for this run is available at configs/{dino,ibot,lejepa}/vit_{s,b}16_ep{100,300}.json in the code repository.
Evaluation
Metrics are only available for the teacher weights.
| Model | IN-1K online probe (acc@1) | IN-1K linear probe (acc@1) | IN-1K k-NN (acc@1) | NYU Depth (Ξ΄1) | Pascal VOC (mAP) |
|---|---|---|---|---|---|
| DINO ViT-S/16 ep100 | 69.32 | - | - | - | - |
| DINO ViT-S/16 ep300 | 73.88 | - | - | - | - |
| DINO ViT-B/16 ep100 | 73.49 | - | - | - | - |
| DINO ViT-B/16 ep300 | Soon | - | - | - | - |
| iBOT ViT-S/16 ep100 | 69.70 | - | - | - | - |
| iBOT ViT-S/16 ep300 | 74.32 | - | - | - | - |
| iBOT ViT-B/16 ep100 | 76.50 | - | - | - | - |
| iBOT ViT-B/16 ep300 | 78.74 | - | - | - | - |
| LeJEPA ViT-S/16 ep100 | 61.85 | - | - | - | - |
| LeJEPA ViT-S/16 ep300 | 65.99 | - | - | - | - |
| LeJEPA ViT-B/16 ep100 | 69.28 | - | - | - | - |
| LeJEPA ViT-B/16 ep300 | 72.04 | - | - | - | - |
Grouped by method, DINO first since you have results for it. Let me know if you'd prefer a different ordering (e.g. by architecture, or alphabetical by method).
Online probe results are logged during pre-training. Linear probe, k-NN, and downstream evaluations are coming soon.
Checkpoints
from transformers import AutoModel
# default checkpoint (teacher checkpoint after 300 epochs)
objective = "dino" # lejepa, ibot
model_size = "s" # b
pretrain_dataset = "in1k" # currently only in1k is planned unless compute can be expanded.
hf_model_string = f"OK-AI/{objective}-vit{model_size}16-pretrain-{pretrain_dataset}"
model = AutoModel.from_pretrained(hf_model_string)
# alternate training checkpoints
epoch_variant = 100 # 300
state_dict_of = "student" # `teacher` is default for dino and ibot, lejepa only has student.
model = AutoModel.from_pretrained(
hf_model_string,
revision=f"ep{epoch_variant}/{state_dict_of}",
)
Usage
import requests
import torch
from PIL import Image
from transformers import AutoModel, AutoImageProcessor
model_id = "OK-AI/dino-vits16-pretrain-in1k"
processor = AutoImageProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval()
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# {
# "latent": cls_tokens[:, 0],
# "patch_latent": patch_tokens,
# "raw_latent": x[:, 0],
# "last_self_attention": attn,
# "logits": self.head(cls_tokens[:, 0]), # only exists for comptability, head is always identity in this case.
# }
# CLS token β use for classification, retrieval, k-NN
cls = outputs["latent"] # (1, 384)
# Patch tokens β use for dense tasks (depth, segmentation)
patches = outputs["patch_latent"] # (1, 196, 384)
Citation
@article{Sablayrolles2018Jun,
author = {Sablayrolles, Alexandre and Douze, Matthijs and Schmid, Cordelia and J{\ifmmode\acute{e}\else\'{e}\fi}gou, Herv{\ifmmode\acute{e}\else\'{e}\fi}},
title = {{Spreading vectors for similarity search}},
year = {2018},
month = jun,
doi = {10.48550/arXiv.1806.03198}
}
@article{Caron2021Apr,
author = {Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J{\ifmmode\acute{e}\else\'{e}\fi}gou, Herv{\ifmmode\acute{e}\else\'{e}\fi} and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
title = {{Emerging Properties in Self-Supervised Vision Transformers}},
year = {2021},
month = apr,
doi = {10.48550/arXiv.2104.14294}
}
@article{Zhou2021Nov,
author = {Zhou, Jinghao and Wei, Chen and Wang, Huiyu and Shen, Wei and Xie, Cihang and Yuille, Alan and Kong, Tao},
title = {{iBOT: Image BERT Pre-Training with Online Tokenizer}},
year = {2021},
month = nov,
doi = {10.48550/arXiv.2111.07832}
}
@article{Dong2022Dec,
author = {Dong, Xiaoyi and Bao, Jianmin and Zhang, Ting and Chen, Dongdong and Gu, Shuyang and Zhang, Weiming and Yuan, Lu and Chen, Dong and Wen, Fang and Yu, Nenghai},
title = {{CLIP Itself is a Strong Fine-tuner: Achieving 85.7{\%} and 88.0{\%} Top-1 Accuracy with ViT-B and ViT-L on ImageNet}},
year = {2022},
month = dec,
doi = {10.48550/arXiv.2212.06138}
}
@article{Oquab2023Apr,
author = {Oquab, Maxime and Darcet, Timoth{\ifmmode\acute{e}\else\'{e}\fi}e and Moutakanni, Th{\ifmmode\acute{e}\else\'{e}\fi}o and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Assran, Mahmoud and Ballas, Nicolas and Galuba, Wojciech and Howes, Russell and Huang, Po-Yao and Li, Shang-Wen and Misra, Ishan and Rabbat, Michael and Sharma, Vasu and Synnaeve, Gabriel and Xu, Hu and Jegou, Herv{\ifmmode\acute{e}\else\'{e}\fi} and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
title = {{DINOv2: Learning Robust Visual Features without Supervision}},
year = {2023},
month = apr,
doi = {10.48550/arXiv.2304.07193}
}
@article{Balestriero2025Nov,
author = {Balestriero, Randall and LeCun, Yann},
title = {{LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics}},
year = {2025},
month = nov,
doi = {10.48550/arXiv.2511.08544}
}
- Downloads last month
- 61