Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)


RuCLIP (Russian Contrastive Language–Image Pretraining) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. RuCLIP builds on a large body of work on zero-shot transfer, computer vision, natural language processing and multimodal learning.

Model was trained by Sber AI and SberDevices teams.

  • Task: text ranking; image ranking; zero-shot image classification;
  • Type: encoder
  • Num Parameters: 430M
  • Training Data Volume: 240 million text-image pairs
  • Language: Russian
  • Context Length: 77
  • Transformer Layers: 12
  • Transformer Width: 768
  • Transformer Heads: 12
  • Image Size: 224
  • Vision Layers: 24
  • Vision Width: 1024
  • Vision Patch Size: 14

Usage Github

pip install ruclip
clip, processor = ruclip.load("ruclip-vit-large-patch14-224", device="cuda")


We have evaluated the performance on the following datasets:

Dataset Metric Name Metric Result
Food101 acc 0.597
CIFAR10 acc 0.878
CIFAR100 acc 0.511
Birdsnap acc 0.172
SUN397 acc 0.484
Stanford Cars acc 0.559
DTD acc 0.370
MNIST acc 0.337
STL10 acc 0.934
PCam acc 0.520
CLEVR acc 0.152
Rendered SST2 acc 0.529
ImageNet acc 0.426
FGVC Aircraft mean-per-class 0.046
Oxford Pets mean-per-class 0.604
Caltech101 mean-per-class 0.777
Flowers102 mean-per-class 0.455
HatefulMemes roc-auc 0.530


Downloads last month
Unable to determine this model’s pipeline type. Check the docs .