google-research-datasets/conceptual_12m
Updated • 49 • 26
Contrastive Language-Image Mamba Pretraining (CLIMP) using Mamba2-1.3B as text encoder.
| Component | Details |
|---|---|
| Vision Encoder | VMamba-Base (128-256-512-1024 dims, depths [2,2,15,2]) |
| Text Encoder | Mamba2-1.3B (AntonV/mamba2-1.3b-hf) |
| Projection Dim | 768 |
| Training Data | CC12M |
| Image Resolution | 224x224 |
| Loss | Symmetric InfoNCE (learned temperature) |
from models import load_climp
from data.utils import transform_image
model = load_climp("mamba2")
transform = transform_image(224)
See the demo repository for evaluation code.
CLIMP: Contrastive Language-Image Mamba Pretraining
@article{climp2026,
title={CLIMP: Contrastive Language-Image Mamba Pretraining},
author={Shabtay, Nimrod and Zimerman, Itamar and Schwartz, Eli and Giryes, Raja},
journal={arXiv preprint arXiv:2601.06891},
year={2026}
}