YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ruclip-vit-base-patch16-224

RuCLIP (Russian Contrastive Language–Image Pretraining) is a multimodal model for obtaining images and text similarities and rearranging captions and pictures. RuCLIP builds on a large body of work on zero-shot transfer, computer vision, natural language processing and multimodal learning.

Model was trained by Sber AI and SberDevices teams.

  • Task: text ranking; image ranking; zero-shot image classification;
  • Type: encoder
  • Num Parameters: 150M
  • Training Data Volume: 240 million text-image pairs
  • Language: Russian
  • Context Length: 77
  • Transformer Layers: 12
  • Transformer Width: 512
  • Transformer Heads: 8
  • Image Size: 224
  • Vision Layers: 12
  • Vision Width: 768
  • Vision Patch Size: 16

Usage Github

pip install ruclip
clip, processor = ruclip.load("ruclip-vit-base-patch16-224", device="cuda")

Performance

We have evaluated the performance on the following datasets:

Dataset Metric Name Metric Result
Food101 acc 0.552
CIFAR10 acc 0.810
CIFAR100 acc 0.496
Birdsnap acc 0.117
SUN397 acc 0.462
Stanford Cars acc 0.487
DTD acc 0.401
MNIST acc 0.464
STL10 acc 0.932
PCam acc 0.505
CLEVR acc 0.128
Rendered SST2 acc 0.527
ImageNet acc 0.401
FGVC Aircraft mean-per-class 0.043
Oxford Pets mean-per-class 0.595
Caltech101 mean-per-class 0.775
Flowers102 mean-per-class 0.388
HatefulMemes roc-auc 0.516

Authors

Downloads last month
19
Inference API
Unable to determine this model’s pipeline type. Check the docs .