ImageNet-1K Self-Supervised ViT Baselines

A collection of open and reproducible self-supervised Vision Transformer (ViT) checkpoints trained exclusively on ImageNet-1K (1.45M images).

Part of the Open-Knowledge-AI ImageNet-1K SSL collection. All models in this collection are pre-trained exclusively on ImageNet-1K using a unified codebase and training strategy, enabling direct comparison across methods and durations.

Codebase: github.com/Open-Knowledge-AI/lite_ssl


Method

This model is trained with LeJEPA: a heuristic free, negative free and teacher free SSL method. It combines SIGReg (Sketched Isotropic Gaussian Regularisation) promoting Gaussian distributed latent embeddings with an Invariance loss to preserve discriminative structure.

The backbone follows the ViT-v2 design from DINOv2 β€” without register tokens.

The main branch of this repository exposes the model trained for 300epochs. please refer to other revisions for other models.


Training

Setting Value
Dataset ImageNet-1K (1.45M images)
Architecture ViT-B/16 (ViT-v2, no registers)
Epochs 100,300
Views 8 total β€” 2 global (224Γ—224) + 6 local (96Γ—96)
Precision BF16 mixed (bf16-mixed)
GPUs 8Γ— (4 devices Γ— 2 nodes, DDP via PyTorch Lightning)
Global batch size 1024 (128 per GPU)
Optimizer AdamW, layerwise LR decay 0.9, patch embed LR mult 0.2 (from DiNOv2)
Learning rate Base 5e-4 β†’ effective 2e-3 (scaled by batch / 256); linear warmup 10 epochs β†’ cosine decay to 1e-6
Weight decay Cosine anneal 0.04 β†’ 0.4 (for DiNO and iBOT), Constant 5e-2 for LeJEPA
Gradient clipping Norm, max 3.0

The exact config used for this run is available at configs/{dino,ibot,lejepa}/vit_{s,b}16_ep{100,300}.json in the code repository.


Evaluation

Metrics are only available for the teacher weights.

Model IN-1K online probe (acc@1) IN-1K linear probe (acc@1) IN-1K k-NN (acc@1) NYU Depth (Ξ΄1) Pascal VOC (mAP)
DINO ViT-S/16 ep100 69.32 - - - -
DINO ViT-S/16 ep300 73.88 - - - -
DINO ViT-B/16 ep100 73.49 - - - -
DINO ViT-B/16 ep300 Soon - - - -
iBOT ViT-S/16 ep100 69.70 - - - -
iBOT ViT-S/16 ep300 74.32 - - - -
iBOT ViT-B/16 ep100 76.50 - - - -
iBOT ViT-B/16 ep300 78.74 - - - -
LeJEPA ViT-S/16 ep100 61.85 - - - -
LeJEPA ViT-S/16 ep300 65.99 - - - -
LeJEPA ViT-B/16 ep100 69.28 - - - -
LeJEPA ViT-B/16 ep300 72.04 - - - -

Grouped by method, DINO first since you have results for it. Let me know if you'd prefer a different ordering (e.g. by architecture, or alphabetical by method).

Online probe results are logged during pre-training. Linear probe, k-NN, and downstream evaluations are coming soon.


Checkpoints

from transformers import AutoModel

# default checkpoint (teacher checkpoint after 300 epochs)
objective = "dino"  # lejepa, ibot
model_size = "s"  # b
pretrain_dataset = "in1k"  # currently only in1k is planned unless compute can be expanded.

hf_model_string = f"OK-AI/{objective}-vit{model_size}16-pretrain-{pretrain_dataset}"
model = AutoModel.from_pretrained(hf_model_string)

# alternate training checkpoints
epoch_variant = 100  # 300
state_dict_of = "student"  # `teacher` is default for dino and ibot, lejepa only has student.
model = AutoModel.from_pretrained(
    hf_model_string,
    revision=f"ep{epoch_variant}/{state_dict_of}",
)

Usage

import requests

import torch

from PIL import Image
from transformers import AutoModel, AutoImageProcessor

model_id = "OK-AI/dino-vits16-pretrain-in1k"

processor = AutoImageProcessor.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model.eval()

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# {
#     "latent": cls_tokens[:, 0],
#     "patch_latent": patch_tokens,
#     "raw_latent": x[:, 0],
#     "last_self_attention": attn,
#     "logits": self.head(cls_tokens[:, 0]),  # only exists for comptability, head is always identity in this case.
# }

# CLS token β€” use for classification, retrieval, k-NN
cls = outputs["latent"]        # (1, 384)

# Patch tokens β€” use for dense tasks (depth, segmentation)
patches = outputs["patch_latent"]   # (1, 196, 384)

Citation

@article{Sablayrolles2018Jun,
    author = {Sablayrolles, Alexandre and Douze, Matthijs and Schmid, Cordelia and J{\ifmmode\acute{e}\else\'{e}\fi}gou, Herv{\ifmmode\acute{e}\else\'{e}\fi}},
    title = {{Spreading vectors for similarity search}},
    year = {2018},
    month = jun,
    doi = {10.48550/arXiv.1806.03198}
}

@article{Caron2021Apr,
    author = {Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J{\ifmmode\acute{e}\else\'{e}\fi}gou, Herv{\ifmmode\acute{e}\else\'{e}\fi} and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand},
    title = {{Emerging Properties in Self-Supervised Vision Transformers}},
    year = {2021},
    month = apr,
    doi = {10.48550/arXiv.2104.14294}
}

@article{Zhou2021Nov,
    author = {Zhou, Jinghao and Wei, Chen and Wang, Huiyu and Shen, Wei and Xie, Cihang and Yuille, Alan and Kong, Tao},
    title = {{iBOT: Image BERT Pre-Training with Online Tokenizer}},
    year = {2021},
    month = nov,
    doi = {10.48550/arXiv.2111.07832}
}

@article{Dong2022Dec,
    author = {Dong, Xiaoyi and Bao, Jianmin and Zhang, Ting and Chen, Dongdong and Gu, Shuyang and Zhang, Weiming and Yuan, Lu and Chen, Dong and Wen, Fang and Yu, Nenghai},
    title = {{CLIP Itself is a Strong Fine-tuner: Achieving 85.7{\%} and 88.0{\%} Top-1 Accuracy with ViT-B and ViT-L on ImageNet}},
    year = {2022},
    month = dec,
    doi = {10.48550/arXiv.2212.06138}
}

@article{Oquab2023Apr,
    author = {Oquab, Maxime and Darcet, Timoth{\ifmmode\acute{e}\else\'{e}\fi}e and Moutakanni, Th{\ifmmode\acute{e}\else\'{e}\fi}o and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Assran, Mahmoud and Ballas, Nicolas and Galuba, Wojciech and Howes, Russell and Huang, Po-Yao and Li, Shang-Wen and Misra, Ishan and Rabbat, Michael and Sharma, Vasu and Synnaeve, Gabriel and Xu, Hu and Jegou, Herv{\ifmmode\acute{e}\else\'{e}\fi} and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr},
    title = {{DINOv2: Learning Robust Visual Features without Supervision}},
    year = {2023},
    month = apr,
    doi = {10.48550/arXiv.2304.07193}
}

@article{Balestriero2025Nov,
    author = {Balestriero, Randall and LeCun, Yann},
    title = {{LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics}},
    year = {2025},
    month = nov,
    doi = {10.48550/arXiv.2511.08544}
}
Downloads last month
47
Safetensors
Model size
91.7M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train OK-AI/lejepa-vitb16-pretrain-in1k

Space using OK-AI/lejepa-vitb16-pretrain-in1k 1

Collection including OK-AI/lejepa-vitb16-pretrain-in1k