metadata
license: mit
language:
- ar
- kn
- ar
- ka
- af
- kk
- am
- km
- ar
- ky
- ar
- ko
- as
- lo
- az
- ml
- az
- mr
- be
- mk
- bn
- my
- bs
- nl
- bg
- ca
- 'no'
- cs
- ne
- ku
- pl
- cy
- pt
- da
- ro
- de
- ru
- el
- sa
- en
- si
- eo
- sk
- et
- sl
- eu
- sd
- fi
- so
- fr
- es
- gd
- sr
- ga
- su
- gl
- sv
- gu
- sw
- ha
- ta
- he
- te
- hi
- th
- hr
- tr
- hu
- ug
- hy
- uk
- id
- ur
- is
- vi
- it
- xh
- jv
- zh
- ja
pipeline_tag: zero-shot-image-classification
tags:
- siglip
- clip
- mexma
Model Summary
MEXMA-SigLIP is a model that combines the MEXMA multilingual text encoder and an image encoder from the SigLIP model. This allows us to get a high-performance CLIP model for 80 languages. MEXMA-SigLIP sets state-of-the-art on the Crossmodal-3600 dataset across commercial use-friendly models.
How to use
from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch
model = AutoModel.from_pretrained("visheratin/mexma-siglip", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip")
processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip")
img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
img = processor(images=img, return_tensors="pt")["pixel_values"]
img = img.to(torch.bfloat16).to("cuda")
with torch.inference_mode():
text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
probs = image_logits.softmax(dim=-1)
print(probs)
Acknowledgements
I thank ML Collective and Lambda for providing compute resources to train the model.