LeVLJEPA: Non-Contrastive Vision-Language Pretraining

arXiv Project Page GitHub

LeVLJEPA is the first fully non-contrastive end-to-end vision-language pretraining method. It learns image-text structure through cross-modal prediction with stop-gradient targets and per-modality SIGReg regularization, using no negatives, temperature, momentum encoder, or teacher-student schedule. Rather than optimizing global image-text alignment, LeVLJEPA produces a vision encoder whose dense per-token features are designed for downstream use as a frozen backbone in vision-language models and dense prediction systems.

This checkpoint is the ViT-Base/16 encoder trained on DataComp-large for 200,000 steps (≈819M samples seen) at batch size 4,096.

Key results:

  • 🏆 Strongest VLM backbone: highest accuracy on GQA, VQAv2, and POPE across two LLM families (Llama-1B, Qwen-1.5B), with only a frozen bridge trained
  • 🧩 Stronger dense features: outperforms CLIP and SigLIP on ADE20K and COCO-Stuff semantic segmentation with a frozen linear head
  • 🎯 Background robustness: most robust to background substitution on ImageNet-9 (Mixed-Same / Mixed-Rand)
  • ⚖️ Global-feature parity: on par with contrastive baselines under linear probing (73% attentive-probing top-1 on ImageNet)
  • 🔧 Simple and stable: no negatives, no temperature, no momentum encoder; trains stably at DataComp-L scale

Model summary

Property Value
Method LeVLJEPA
Vision encoder vit_base_patch16_224 (timm)
Text encoder GPT-2 (12L / 12H / 768D)
Embedding dim 256
Pre-projection head Linear→BN→GELU→Linear (width 2048)
Training objective Cross-modal prediction + SIGReg
Training data DataComp-large (≈92M pairs after link rot)
Training steps 200,000 (≈819M samples seen)
Batch size 4,096

Usage

Extract dense patch features or a pooled image embedding from the frozen vision encoder:

import torch
import timm
import torch.nn as nn
from PIL import Image
from torchvision import transforms
from transformers import GPT2Config, GPT2Model, AutoTokenizer
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download

HIDDEN, EMBED, REPO = 768, 256, "lukaskuhndkfz/LeVLJEPA-ViT-B-DataComp-200k"

vision_encoder = timm.create_model("vit_base_patch16_224", pretrained=False, num_classes=0, dynamic_img_size=True)
vision_pre_proj = nn.Sequential(nn.Linear(HIDDEN, 2048), nn.BatchNorm1d(2048), nn.GELU(), nn.Linear(2048, EMBED))

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
text_encoder = GPT2Model(GPT2Config(n_embd=HIDDEN, n_layer=12, n_head=12, n_inner=HIDDEN * 4,
    vocab_size=tokenizer.vocab_size, attn_pdrop=0.0, resid_pdrop=0.0, embd_pdrop=0.0))
text_pre_proj = nn.Sequential(nn.Linear(HIDDEN, 2048), nn.BatchNorm1d(2048), nn.GELU(), nn.Linear(2048, EMBED))

vw = load_file(hf_hub_download(REPO, "vision_encoder.safetensors"))
tw = load_file(hf_hub_download(REPO, "text_encoder.safetensors"))
vision_encoder.load_state_dict({k[8:]: v for k, v in vw.items() if k.startswith("encoder.")})
vision_pre_proj.load_state_dict({k[9:]: v for k, v in vw.items() if k.startswith("pre_proj.")})
text_encoder.load_state_dict({k[8:]: v for k, v in tw.items() if k.startswith("encoder.")})
text_pre_proj.load_state_dict({k[9:]: v for k, v in tw.items() if k.startswith("pre_proj.")})
vision_encoder.eval()
text_encoder.eval()

transform = transforms.Compose([
    transforms.Resize(224), transforms.CenterCrop(224), transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
])

image = transform(Image.open("image.jpg").convert("RGB")).unsqueeze(0)
with torch.no_grad():
    image_features = vision_pre_proj(vision_encoder(image))

ids = tokenizer("a photo of a cat", add_special_tokens=False, truncation=True, max_length=76)["input_ids"] + [tokenizer.eos_token_id]
pad = 77 - len(ids)
input_ids = torch.tensor([ids + [tokenizer.pad_token_id] * pad])
attention_mask = torch.tensor([[1] * len(ids) + [0] * pad])
with torch.no_grad():
    hidden = text_encoder(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
    idx = (attention_mask.sum(1) - 1).view(-1, 1, 1).expand(-1, 1, HIDDEN)
    text_features = text_pre_proj(hidden.gather(1, idx).squeeze(1))

For dense patch features (the intended downstream use as a VLM/segmentation backbone), take the token sequence instead of the pooled output:

with torch.no_grad():
    patch_tokens = vision_encoder.forward_features(image)

Files

File Contents
vision_encoder.safetensors Vision encoder (encoder.*), pre-projection head (pre_proj.*), cross-modal projector (projector.*)
text_encoder.safetensors Text encoder (encoder.*), pre-projection head (pre_proj.*), cross-modal projector (projector.*)
config.json Architecture and training hyperparameters

Citation

@article{kuhn2026levljepa,
  title   = {LeVLJEPA: End-to-End Vision-Language Pretraining Without Contrastive Negatives},
  author  = {Kuhn, Lukas and Serra, Giuseppe and Balestriero, Randall and Buettner, Florian},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026}
}

License

Released under CC BY-NC 4.0 for non-commercial use only.

Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for lukaskuhndkfz/LeVLJEPA-ViT-B-DataComp-200k