visheratin/nllb-siglip-mrl-large

Model Summary

NLLB-SigLIP-MRL is a model that combines a text encoder from the NLLB model and an image encoder from the SigLIP model. This allows us to extend the model capabilities to 201 languages of the Flores-200. This version of the model was trained using a variation of Matryoshka Representation learning to enable the generation of embeddings of sizes [32, 64, 128, 256, 512] in addition to the original 1152. Based on the benchmarks below, embeddings of sizes 256 and 512 preserve 90%+ of the full embedding quality.

The full embedding model sets new state-of-the-art for multilingual image and text retrieval on both XTD10 and Crossmodal-3600.

Dataset	image retrieval R@1, avg	text retrieval R@1, avg	image retrieval R@5, avg	text retrieval R@5, avg	image retrieval R@10, avg	text retrieval R@10, avg
Crossmodal-3600	0.6079	0.5741	0.8333	0.8174	0.8922	0.8816
XTD10	0.6997	0.6433	0.8988	0.8848	0.9503	0.9449

How to use

Variable resolutions

If you want to use the model that supports variable embedding sizes, you can do it as follows:

!pip install -U transformers open_clip_torch

from transformers import AutoModel
from PIL import Image
import requests
import torch

model = AutoModel.from_pretrained("visheratin/nllb-siglip-mrl-large", device="cpu", trust_remote_code=True)

image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)

class_options = ["бабочка", "butterfly", "kat"]
class_langs = ["rus_Cyrl", "eng_Latn", "afr_Latn"]

image_logits, text_logits = model.get_logits(
    images=[image],
    texts=class_options,
    langs=class_langs,
    resolution=512 # set resolution here or set `None` to use the original resolution
)

print(torch.softmax(image_logits, dim=1))

OpenCLIP

This model is also integrated into OpenCLIP so that you can use it as any other model:

!pip install -U open_clip_torch

from open_clip import create_model_from_pretrained, get_tokenizer
from PIL import Image
import requests
import torch

model, transform = create_model_from_pretrained("nllb-clip-large-siglip", "mrl", device="cuda")

tokenizer = get_tokenizer("nllb-clip-large-siglip")

class_options = ["бабочка", "butterfly", "kat"]
class_langs = ["rus_Cyrl", "eng_Latn", "afr_Latn"]

text_inputs = []
for i in range(len(class_options)):
    tokenizer.set_language(class_langs[i])
    text_inputs.append(tokenizer(class_options[i]))
text_inputs = torch.stack(text_inputs).squeeze(1).to("cuda")

image_path = "https://huggingface.co/spaces/jjourney1125/swin2sr/resolve/main/samples/butterfly.jpg"
image = Image.open(requests.get(image_path, stream=True).raw)

image_inputs = transform(image).unsqueeze(0).to("cuda")

with torch.inference_mode():
    logits_per_image, logits_per_text = model.get_logits(image_inputs, text_inputs)

print(logits_per_image.softmax(dim=-1))

Acknowledgements

I thank ML Collective for providing Google Cloud compute resources.

visheratin
/

nllb-siglip-mrl-large

Model Summary

How to use

Variable resolutions

OpenCLIP

Acknowledgements

Dataset used to train visheratin/nllb-siglip-mrl-large