visheratin/nllb-clip-large-siglip

Model Summary

NLLB-CLIP-SigLIP is a model that combines a text encoder from the NLLB model and an image encoder from the SigLIP model. This allows us to extend the model capabilities to 201 languages of the Flores-200. NLLB-CLIP sets state-of-the-art on the Crossmodal-3600 dataset by performing very well on low-resource languages. You can find more details about the model in the paper.

This version performs much better than the standard version. You can see the results here and here.

NB: There is even better version of this model available!

How to use

This model is integrated into OpenCLIP so that you can use it as any other model:

!pip install -U open_clip_torch

from open_clip import create_model_from_pretrained, get_tokenizer
from PIL import Image
import requests
import torch

model, transform = create_model_from_pretrained("nllb-clip-large-siglip", "v1", device="cuda")

tokenizer = get_tokenizer("nllb-clip-large-siglip")

class_options = ["бабочка", "butterfly", "kat"]
class_langs = ["rus_Cyrl", "eng_Latn", "afr_Latn"]

text_inputs = []
for i in range(len(class_options)):
    tokenizer.set_language(class_langs[i])
    text_inputs.append(tokenizer(class_options[i]))
text_inputs = torch.stack(text_inputs).squeeze(1).to("cuda")

image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)

image_inputs = transform(image).unsqueeze(0).to("cuda")

with torch.inference_mode():
    logits_per_image, logits_per_text = model.get_logits(image_inputs, text_inputs)

print(logits_per_image.softmax(dim=-1))

Acknowledgements

I thank ML Collective for providing Google Cloud compute resources to train the OpenCLIP-compatible version of NLLB-CLIP.

visheratin
/

nllb-clip-large-siglip

Model Summary

How to use

Acknowledgements

Dataset used to train visheratin/nllb-clip-large-siglip