ImageNet-1K Self-Supervised ViT Baselines

A collection of open and reproducible self-supervised Vision Transformer (ViT) checkpoints trained exclusively on ImageNet-1K (1.45M images).

Part of the Open-Knowledge-AI ImageNet-1K SSL collection. All models in this collection are pre-trained exclusively on ImageNet-1K using a unified codebase and training strategy, enabling direct comparison across methods and durations.

Codebase: github.com/Open-Knowledge-AI/lite_ssl

Method

This model is trained with LeJEPA: a heuristic free, negative free and teacher free SSL method. It combines SIGReg (Sketched Isotropic Gaussian Regularisation) promoting Gaussian distributed latent embeddings with an Invariance loss to preserve discriminative structure.

The backbone follows the ViT-v2 design from DINOv2 — without register tokens.

The main branch of this repository exposes the model trained for 300epochs. please refer to other revisions for other models.

Training

Setting	Value
Dataset	ImageNet-1K (1.45M images)
Architecture	ViT-B/16 (ViT-v2, no registers)
Epochs	100,300
Views	8 total — 2 global (224×224) + 6 local (96×96)
Precision	BF16 mixed (`bf16-mixed`)
GPUs	8× (4 devices × 2 nodes, DDP via PyTorch Lightning)
Global batch size	1024 (128 per GPU)
Optimizer	AdamW, layerwise LR decay 0.9, patch embed LR mult 0.2 (from DiNOv2)
Learning rate	Base 5e-4 → effective 2e-3 (scaled by batch / 256); linear warmup 10 epochs → cosine decay to 1e-6
Weight decay	Cosine anneal 0.04 → 0.4 (for DiNO and iBOT), Constant 5e-2 for LeJEPA
Gradient clipping	Norm, max 3.0

The exact config used for this run is available at configs/{dino,ibot,lejepa}/vit_{s,b}16_ep{100,300}.json in the code repository.

Evaluation

Metrics are only available for the teacher weights.

Model	IN-1K online probe (acc@1)	IN-1K linear probe (acc@1)	IN-1K k-NN (acc@1)	NYU Depth (δ1)	Pascal VOC (mAP)
DINO ViT-S/16 ep100	69.32	-	-	-	-
DINO ViT-S/16 ep300	73.88	-	-	-	-
DINO ViT-B/16 ep100	73.49	-	-	-	-
DINO ViT-B/16 ep300	Soon	-	-	-	-
iBOT ViT-S/16 ep100	69.70	-	-	-	-
iBOT ViT-S/16 ep300	74.32	-	-	-	-
iBOT ViT-B/16 ep100	76.50	-	-	-	-
iBOT ViT-B/16 ep300	78.74	-	-	-	-
LeJEPA ViT-S/16 ep100	61.85	-	-	-	-
LeJEPA ViT-S/16 ep300	65.99	-	-	-	-
LeJEPA ViT-B/16 ep100	69.28	-	-	-	-
LeJEPA ViT-B/16 ep300	72.04	-	-	-	-

Grouped by method, DINO first since you have results for it. Let me know if you'd prefer a different ordering (e.g. by architecture, or alphabetical by method).

Online probe results are logged during pre-training. Linear probe, k-NN, and downstream evaluations are coming soon.

Checkpoints

from transformers import AutoModel

# default checkpoint (teacher checkpoint after 300 epochs)
objective = "dino"  # lejepa, ibot
model_size = "s"  # b
pretrain_dataset = "in1k"  # currently only in1k is planned unless compute can be expanded.

hf_model_string = f"OK-AI/{objective}-vit{model_size}16-pretrain-{pretrain_dataset}"
model = AutoModel.from_pretrained(hf_model_string)

# alternate training checkpoints
epoch_variant = 100  # 300
state_dict_of = "student"  # `teacher` is default for dino and ibot, lejepa only has student.
model = AutoModel.from_pretrained(
    hf_model_string,
    revision=f"ep{epoch_variant}/{state_dict_of}",
)

Usage

import requests

import torch

from PIL import Image
from transformers import AutoModel, AutoImageProcessor

model_id = "OK-AI/dino-vits16-pretrain-in1k"

processor = AutoImageProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval()

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# {
#     "latent": cls_tokens[:, 0],
#     "patch_latent": patch_tokens,
#     "raw_latent": x[:, 0],
#     "last_self_attention": attn,
#     "logits": self.head(cls_tokens[:, 0]),  # only exists for comptability, head is always identity in this case.
# }

# CLS token — use for classification, retrieval, k-NN
cls = outputs["latent"]        # (1, 384)

# Patch tokens — use for dense tasks (depth, segmentation)
patches = outputs["patch_latent"]   # (1, 196, 384)

Citation

@article{Sablayrolles2018Jun,
    author = {Sablayrolles, Alexandre and Douze, Matthijs and Schmid, Cordelia and J{\ifmmode\acute{e}\else\'{e}\fi}gou, Herv{\ifmmode\acute{e}\else\'{e}\fi}},
    title = {{Spreading vectors for similarity search}},
    year = {2018},
    month = jun,
    doi = {10.48550/arXiv.1806.03198}
}

@article{Caron2021Apr,
    author = {Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J{\ifmmode\acute{e}\else\'{e}\fi}gou, Herv{\ifmmode\acute{e}\else\'{e}\fi} and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
    title = {{Emerging Properties in Self-Supervised Vision Transformers}},
    year = {2021},
    month = apr,
    doi = {10.48550/arXiv.2104.14294}
}

@article{Zhou2021Nov,
    author = {Zhou, Jinghao and Wei, Chen and Wang, Huiyu and Shen, Wei and Xie, Cihang and Yuille, Alan and Kong, Tao},
    title = {{iBOT: Image BERT Pre-Training with Online Tokenizer}},
    year = {2021},
    month = nov,
    doi = {10.48550/arXiv.2111.07832}
}

@article{Dong2022Dec,
    author = {Dong, Xiaoyi and Bao, Jianmin and Zhang, Ting and Chen, Dongdong and Gu, Shuyang and Zhang, Weiming and Yuan, Lu and Chen, Dong and Wen, Fang and Yu, Nenghai},
    title = {{CLIP Itself is a Strong Fine-tuner: Achieving 85.7{\%} and 88.0{\%} Top-1 Accuracy with ViT-B and ViT-L on ImageNet}},
    year = {2022},
    month = dec,
    doi = {10.48550/arXiv.2212.06138}
}

@article{Oquab2023Apr,
    author = {Oquab, Maxime and Darcet, Timoth{\ifmmode\acute{e}\else\'{e}\fi}e and Moutakanni, Th{\ifmmode\acute{e}\else\'{e}\fi}o and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Assran, Mahmoud and Ballas, Nicolas and Galuba, Wojciech and Howes, Russell and Huang, Po-Yao and Li, Shang-Wen and Misra, Ishan and Rabbat, Michael and Sharma, Vasu and Synnaeve, Gabriel and Xu, Hu and Jegou, Herv{\ifmmode\acute{e}\else\'{e}\fi} and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
    title = {{DINOv2: Learning Robust Visual Features without Supervision}},
    year = {2023},
    month = apr,
    doi = {10.48550/arXiv.2304.07193}
}

@article{Balestriero2025Nov,
    author = {Balestriero, Randall and LeCun, Yann},
    title = {{LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics}},
    year = {2025},
    month = nov,
    doi = {10.48550/arXiv.2511.08544}
}

Downloads last month: 47

Safetensors

Model size

91.7M params

Tensor type

F32

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train OK-AI/lejepa-vitb16-pretrain-in1k

Space using OK-AI/lejepa-vitb16-pretrain-in1k 1

Collection including OK-AI/lejepa-vitb16-pretrain-in1k

ImageNet-1K Self-Supervised ViT Baselines

Collection

A collection of open and reproducible self-supervised Vision Transformer (ViT) checkpoints trained exclusively on ImageNet-1K (1.43M images). • 7 items • Updated 6 days ago • 1