ImageNet-1K Self-Supervised ViT Baselines

A collection of open and reproducible self-supervised Vision Transformer (ViT) checkpoints trained exclusively on ImageNet-1K (1.45M images).

Part of the Open-Knowledge-AI ImageNet-1K SSL collection. All models in this collection are pre-trained exclusively on ImageNet-1K using a unified codebase and training strategy, enabling direct comparison across methods and durations.

Codebase: github.com/Open-Knowledge-AI/lite_ssl

Method

This model is trained with iBOT: a masked image modelling objective where a momentum teacher acts as an online tokenizer, combined with a DINO-style global CLS distillation objective and a KoLeo regulariser on the CLS embeddings to encourage uniform representation spread.

The backbone follows the ViT-v2 design from DINOv2 — without register tokens.

The main branch of this repository exposes the model trained for 300epochs. please refer to other revisions for other models.

Training

Setting	Value
Dataset	ImageNet-1K (1.45M images)
Architecture	ViT-B/16 (ViT-v2, no registers)
Epochs	100,300
Views	8 total — 2 global (224×224) + 6 local (96×96)
Precision	BF16 mixed (`bf16-mixed`)
GPUs	8× (4 devices × 2 nodes, DDP via PyTorch Lightning)
Global batch size	1024 (128 per GPU)
Optimizer	AdamW, layerwise LR decay 0.9, patch embed LR mult 0.2 (from DiNOv2)
Learning rate	Base 5e-4 → effective 2e-3 (scaled by batch / 256); linear warmup 10 epochs → cosine decay to 1e-6
Weight decay	Cosine anneal 0.04 → 0.4 (for DiNO and iBOT), Constant 5e-2 for LeJEPA
Gradient clipping	Norm, max 3.0

The exact config used for this run is available at configs/{dino,ibot,lejepa}/vit_{s,b}16_ep{100,300}.json in the code repository.

Evaluation

Metrics are only available for the teacher weights.

Model	IN-1K online probe (acc@1)	IN-1K linear probe (acc@1)	IN-1K k-NN (acc@1)	NYU Depth (δ1)	Pascal VOC (mAP)
DINO ViT-S/16 ep100	69.32	-	-	-	-
DINO ViT-S/16 ep300	73.88	-	-	-	-
DINO ViT-B/16 ep100	73.49	-	-	-	-
DINO ViT-B/16 ep300	Soon	-	-	-	-
iBOT ViT-S/16 ep100	69.70	-	-	-	-
iBOT ViT-S/16 ep300	74.32	-	-	-	-
iBOT ViT-B/16 ep100	76.50	-	-	-	-
iBOT ViT-B/16 ep300	78.74	-	-	-	-
LeJEPA ViT-S/16 ep100	61.85	-	-	-	-
LeJEPA ViT-S/16 ep300	65.99	-	-	-	-
LeJEPA ViT-B/16 ep100	69.28	-	-	-	-
LeJEPA ViT-B/16 ep300	72.04	-	-	-	-

Grouped by method, DINO first since you have results for it. Let me know if you'd prefer a different ordering (e.g. by architecture, or alphabetical by method).

Online probe results are logged during pre-training. Linear probe, k-NN, and downstream evaluations are coming soon.

Checkpoints

from transformers import AutoModel

# default checkpoint (teacher checkpoint after 300 epochs)
objective = "dino"  # lejepa, ibot
model_size = "s"  # b
pretrain_dataset = "in1k"  # currently only in1k is planned unless compute can be expanded.

hf_model_string = f"OK-AI/{objective}-vit{model_size}16-pretrain-{pretrain_dataset}"
model = AutoModel.from_pretrained(hf_model_string)

# alternate training checkpoints
epoch_variant = 100  # 300
state_dict_of = "student"  # `teacher` is default for dino and ibot, lejepa only has student.
model = AutoModel.from_pretrained(
    hf_model_string,
    revision=f"ep{epoch_variant}/{state_dict_of}",
)

Usage

import requests

import torch

from PIL import Image
from transformers import AutoModel, AutoImageProcessor

model_id = "OK-AI/dino-vits16-pretrain-in1k"

processor = AutoImageProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval()

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# {
#     "latent": cls_tokens[:, 0],
#     "patch_latent": patch_tokens,
#     "raw_latent": x[:, 0],
#     "last_self_attention": attn,
#     "logits": self.head(cls_tokens[:, 0]),  # only exists for comptability, head is always identity in this case.
# }

# CLS token — use for classification, retrieval, k-NN
cls = outputs["latent"]        # (1, 384)

# Patch tokens — use for dense tasks (depth, segmentation)
patches = outputs["patch_latent"]   # (1, 196, 384)

Citation

@article{Sablayrolles2018Jun,
    author = {Sablayrolles, Alexandre and Douze, Matthijs and Schmid, Cordelia and J{\ifmmode\acute{e}\else\'{e}\fi}gou, Herv{\ifmmode\acute{e}\else\'{e}\fi}},
    title = {{Spreading vectors for similarity search}},
    year = {2018},
    month = jun,
    doi = {10.48550/arXiv.1806.03198}
}

@article{Caron2021Apr,
    author = {Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J{\ifmmode\acute{e}\else\'{e}\fi}gou, Herv{\ifmmode\acute{e}\else\'{e}\fi} and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
    title = {{Emerging Properties in Self-Supervised Vision Transformers}},
    year = {2021},
    month = apr,
    doi = {10.48550/arXiv.2104.14294}
}

@article{Zhou2021Nov,
    author = {Zhou, Jinghao and Wei, Chen and Wang, Huiyu and Shen, Wei and Xie, Cihang and Yuille, Alan and Kong, Tao},
    title = {{iBOT: Image BERT Pre-Training with Online Tokenizer}},
    year = {2021},
    month = nov,
    doi = {10.48550/arXiv.2111.07832}
}

@article{Dong2022Dec,
    author = {Dong, Xiaoyi and Bao, Jianmin and Zhang, Ting and Chen, Dongdong and Gu, Shuyang and Zhang, Weiming and Yuan, Lu and Chen, Dong and Wen, Fang and Yu, Nenghai},
    title = {{CLIP Itself is a Strong Fine-tuner: Achieving 85.7{\%} and 88.0{\%} Top-1 Accuracy with ViT-B and ViT-L on ImageNet}},
    year = {2022},
    month = dec,
    doi = {10.48550/arXiv.2212.06138}
}

@article{Oquab2023Apr,
    author = {Oquab, Maxime and Darcet, Timoth{\ifmmode\acute{e}\else\'{e}\fi}e and Moutakanni, Th{\ifmmode\acute{e}\else\'{e}\fi}o and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Assran, Mahmoud and Ballas, Nicolas and Galuba, Wojciech and Howes, Russell and Huang, Po-Yao and Li, Shang-Wen and Misra, Ishan and Rabbat, Michael and Sharma, Vasu and Synnaeve, Gabriel and Xu, Hu and Jegou, Herv{\ifmmode\acute{e}\else\'{e}\fi} and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
    title = {{DINOv2: Learning Robust Visual Features without Supervision}},
    year = {2023},
    month = apr,
    doi = {10.48550/arXiv.2304.07193}
}

@article{Balestriero2025Nov,
    author = {Balestriero, Randall and LeCun, Yann},
    title = {{LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics}},
    year = {2025},
    month = nov,
    doi = {10.48550/arXiv.2511.08544}
}

Downloads last month: 61

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train OK-AI/ibot-vitb16-pretrain-in1k

Space using OK-AI/ibot-vitb16-pretrain-in1k 1

Collection including OK-AI/ibot-vitb16-pretrain-in1k

ImageNet-1K Self-Supervised ViT Baselines

Collection

A collection of open and reproducible self-supervised Vision Transformer (ViT) checkpoints trained exclusively on ImageNet-1K (1.43M images). • 7 items • Updated 6 days ago • 1