Instructions to use lukaskuhndkfz/LeVLJEPA-ViT-B-DataComp-200k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- timm
How to use lukaskuhndkfz/LeVLJEPA-ViT-B-DataComp-200k with timm:
import timm model = timm.create_model("hf_hub:lukaskuhndkfz/LeVLJEPA-ViT-B-DataComp-200k", pretrained=True) - Notebooks
- Google Colab
- Kaggle
LeVLJEPA: Non-Contrastive Vision-Language Pretraining
LeVLJEPA is the first fully non-contrastive end-to-end vision-language pretraining method. It learns image-text structure through cross-modal prediction with stop-gradient targets and per-modality SIGReg regularization, using no negatives, temperature, momentum encoder, or teacher-student schedule. Rather than optimizing global image-text alignment, LeVLJEPA produces a vision encoder whose dense per-token features are designed for downstream use as a frozen backbone in vision-language models and dense prediction systems.
This checkpoint is the ViT-Base/16 encoder trained on DataComp-large for 200,000 steps (≈819M samples seen) at batch size 4,096.
Key results:
- 🏆 Strongest VLM backbone: highest accuracy on GQA, VQAv2, and POPE across two LLM families (Llama-1B, Qwen-1.5B), with only a frozen bridge trained
- 🧩 Stronger dense features: outperforms CLIP and SigLIP on ADE20K and COCO-Stuff semantic segmentation with a frozen linear head
- 🎯 Background robustness: most robust to background substitution on ImageNet-9 (Mixed-Same / Mixed-Rand)
- ⚖️ Global-feature parity: on par with contrastive baselines under linear probing (73% attentive-probing top-1 on ImageNet)
- 🔧 Simple and stable: no negatives, no temperature, no momentum encoder; trains stably at DataComp-L scale
Model summary
| Property | Value |
|---|---|
| Method | LeVLJEPA |
| Vision encoder | vit_base_patch16_224 (timm) |
| Text encoder | GPT-2 (12L / 12H / 768D) |
| Embedding dim | 256 |
| Pre-projection head | Linear→BN→GELU→Linear (width 2048) |
| Training objective | Cross-modal prediction + SIGReg |
| Training data | DataComp-large (≈92M pairs after link rot) |
| Training steps | 200,000 (≈819M samples seen) |
| Batch size | 4,096 |
Usage
Extract dense patch features or a pooled image embedding from the frozen vision encoder:
import torch
import timm
import torch.nn as nn
from PIL import Image
from torchvision import transforms
from transformers import GPT2Config, GPT2Model, AutoTokenizer
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
HIDDEN, EMBED, REPO = 768, 256, "lukaskuhndkfz/LeVLJEPA-ViT-B-DataComp-200k"
vision_encoder = timm.create_model("vit_base_patch16_224", pretrained=False, num_classes=0, dynamic_img_size=True)
vision_pre_proj = nn.Sequential(nn.Linear(HIDDEN, 2048), nn.BatchNorm1d(2048), nn.GELU(), nn.Linear(2048, EMBED))
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
text_encoder = GPT2Model(GPT2Config(n_embd=HIDDEN, n_layer=12, n_head=12, n_inner=HIDDEN * 4,
vocab_size=tokenizer.vocab_size, attn_pdrop=0.0, resid_pdrop=0.0, embd_pdrop=0.0))
text_pre_proj = nn.Sequential(nn.Linear(HIDDEN, 2048), nn.BatchNorm1d(2048), nn.GELU(), nn.Linear(2048, EMBED))
vw = load_file(hf_hub_download(REPO, "vision_encoder.safetensors"))
tw = load_file(hf_hub_download(REPO, "text_encoder.safetensors"))
vision_encoder.load_state_dict({k[8:]: v for k, v in vw.items() if k.startswith("encoder.")})
vision_pre_proj.load_state_dict({k[9:]: v for k, v in vw.items() if k.startswith("pre_proj.")})
text_encoder.load_state_dict({k[8:]: v for k, v in tw.items() if k.startswith("encoder.")})
text_pre_proj.load_state_dict({k[9:]: v for k, v in tw.items() if k.startswith("pre_proj.")})
vision_encoder.eval()
text_encoder.eval()
transform = transforms.Compose([
transforms.Resize(224), transforms.CenterCrop(224), transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])
image = transform(Image.open("image.jpg").convert("RGB")).unsqueeze(0)
with torch.no_grad():
image_features = vision_pre_proj(vision_encoder(image))
ids = tokenizer("a photo of a cat", add_special_tokens=False, truncation=True, max_length=76)["input_ids"] + [tokenizer.eos_token_id]
pad = 77 - len(ids)
input_ids = torch.tensor([ids + [tokenizer.pad_token_id] * pad])
attention_mask = torch.tensor([[1] * len(ids) + [0] * pad])
with torch.no_grad():
hidden = text_encoder(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
idx = (attention_mask.sum(1) - 1).view(-1, 1, 1).expand(-1, 1, HIDDEN)
text_features = text_pre_proj(hidden.gather(1, idx).squeeze(1))
For dense patch features (the intended downstream use as a VLM/segmentation backbone), take the token sequence instead of the pooled output:
with torch.no_grad():
patch_tokens = vision_encoder.forward_features(image)
Files
| File | Contents |
|---|---|
vision_encoder.safetensors |
Vision encoder (encoder.*), pre-projection head (pre_proj.*), cross-modal projector (projector.*) |
text_encoder.safetensors |
Text encoder (encoder.*), pre-projection head (pre_proj.*), cross-modal projector (projector.*) |
config.json |
Architecture and training hyperparameters |
Citation
@article{kuhn2026levljepa,
title = {LeVLJEPA: End-to-End Vision-Language Pretraining Without Contrastive Negatives},
author = {Kuhn, Lukas and Serra, Giuseppe and Balestriero, Randall and Buettner, Florian},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2026}
}
License
Released under CC BY-NC 4.0 for non-commercial use only.
- Downloads last month
- 20