mexma-siglip / README.md
visheratin's picture
Update README.md
c284b0d verified
metadata
license: mit
language:
  - ar
  - kn
  - ar
  - ka
  - af
  - kk
  - am
  - km
  - ar
  - ky
  - ar
  - ko
  - as
  - lo
  - az
  - ml
  - az
  - mr
  - be
  - mk
  - bn
  - my
  - bs
  - nl
  - bg
  - ca
  - 'no'
  - cs
  - ne
  - ku
  - pl
  - cy
  - pt
  - da
  - ro
  - de
  - ru
  - el
  - sa
  - en
  - si
  - eo
  - sk
  - et
  - sl
  - eu
  - sd
  - fi
  - so
  - fr
  - es
  - gd
  - sr
  - ga
  - su
  - gl
  - sv
  - gu
  - sw
  - ha
  - ta
  - he
  - te
  - hi
  - th
  - hr
  - tr
  - hu
  - ug
  - hy
  - uk
  - id
  - ur
  - is
  - vi
  - it
  - xh
  - jv
  - zh
  - ja
pipeline_tag: zero-shot-image-classification
tags:
  - siglip
  - clip
  - mexma

Model Summary

MEXMA-SigLIP is a model that combines the MEXMA multilingual text encoder and an image encoder from the SigLIP model. This allows us to get a high-performance CLIP model for 80 languages. MEXMA-SigLIP sets state-of-the-art on the Crossmodal-3600 dataset across commercial use-friendly models.

How to use

from transformers import AutoModel, AutoTokenizer, AutoImageProcessor
from PIL import Image
import requests
import torch

model = AutoModel.from_pretrained("visheratin/mexma-siglip", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda")
tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip")
processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip")

img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw)
img = processor(images=img, return_tensors="pt")["pixel_values"]
img = img.to(torch.bfloat16).to("cuda")
with torch.inference_mode():
    text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda")
    image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img)
    probs = image_logits.softmax(dim=-1)
    print(probs)

Acknowledgements

I thank ML Collective and Lambda for providing compute resources to train the model.