Model card for ViT-B-16-SigLIP-i18n-256

A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI.

This model has been converted from Open-CLIP : timm/ViT-B-16-SigLIP-i18n-256 to huggingface CLIPVisionModel

from transformers import CLIPVisionModel, CLIPImageProcessor
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"

image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(images=image, return_tensors="pt", padding=True)

vision_tower = CLIPVisionModel.from_pretrained('ikala/ViT-B-16-SigLIP-i18n-256-hf')
outputs = vision_tower(**inputs)

logits_per_image = outputs.pooler_output  # this is the image-text similarity score

There's still a slight difference where hf's CLIPVision model uses a [CLS] embedding as pool embedding while SigLIP uses global attention pooler to get the final latent feature.

Downloads last month
15
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.