Model Card: ruCLIP
Disclaimer: The code for using model you can found here.
Model Details
The ruCLIP model was developed by researchers at SberDevices and Sber AI based on origin OpenAI paper.
Model Type
The model uses a ViT-B/32 Transformer architecture (initialized from OpenAI checkpoint and freezed while training) as an image encoder and uses ruGPT3Small as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
Documents
Our habr post.
Usage
Code for using model you can obtain in our repo.
from clip.evaluate.utils import (
get_text_batch, get_image_batch, get_tokenizer,
show_test_images, load_weights_only
)
import torch
# Load model and tokenizer
model, args = load_weights_only("ViT-B/32-small")
model = model.cuda().float().eval()
tokenizer = get_tokenizer()
# Load test images and prepare for model
images, texts = show_test_images(args)
input_ids, attention_mask = get_text_batch(["Это " + desc for desc in texts], tokenizer, args)
img_input = get_image_batch(images, args.img_transform, args)
# Call model
with torch.no_grad():
logits_per_image, logits_per_text = model(
img_input={"x": img_input},
text_input={"x": input_ids, "attention_mask": attention_mask}
)
Performance
We evaluate our model on CIFAR100 and CIFAR10 datasets.
zero-shot classification CIFAR100 top1 accuracy 0.4057; top5 accuracy 0.6975.
zero-shot classification CIFAR10 top1 accuracy 0.7803; top5 accuracy 0.9834.