# ruclip-vit-base-patch32-224

**RuCLIP** (**Ru**ssian **C**ontrastive **L**anguage–**I**mage **P**retraining) is a multimodal model 
for obtaining images and text similarities and rearranging captions and pictures. 
RuCLIP builds on a large body of work on zero-shot transfer, computer vision, natural language processing and 
multimodal learning. 

Model was trained by [Sber AI](https://github.com/sberbank-ai) and [SberDevices](https://sberdevices.ru/) teams.  
* Task: `text ranking`; `image ranking`; `zero-shot image classification`;
* Type: `encoder`
* Num Parameters: `150M`
* Training Data Volume: `240 million text-image pairs`
* Language: `Russian`
* Image Size: `224`

## Performance
We have evaluated the performance on the following datasets:

| dataset       | metric name    | metric result       |
|:--------------|:---------------|:--------------------|
| Food101       | acc            | 0.505		      	   |
| CIFAR10       | acc            | 0.818               |
| CIFAR100      | acc            | 0.504               |
| Birdsnap      | acc            | 0.115               |
| SUN397        | acc            | 0.452               |
| Stanford Cars | acc            | 0.433               |
| DTD           | acc            | 0.380	             |
| MNIST         | acc            | 0.447	             |
| STL10         | acc            | 0.932	             |
| PCam          | acc            | 0.501               |
| CLEVR         | acc            | 0.148               |
| Rendered SST2 | acc            | 0.489               |
| ImageNet      | acc            | 0.375               |
| FGVC Aircraft | mean-per-class | 0.033               |
| Oxford Pets   | mean-per-class | 0.560               |
| Caltech101    | mean-per-class | 0.786               |
| Flowers102    | mean-per-class | 0.401               |
| HatefulMemes  | roc-auc        | 0.564               |