--- tags: - clip - siglip library_name: transformers pipeline_tag: zero-shot-image-classification license: apache-2.0 datasets: - webli language: - en --- # Model card for ViT-SO400M-14-SigLIP-384 A SigLIP (Sigmoid loss for Language-Image Pre-training) model trained on WebLI. This model has been converted from Open-CLIP : [timm/ViT-SO400M-14-SigLIP-384](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) to huggingface CLIPVisionModel ```Python from transformers import CLIPVisionModel, CLIPImageProcessor from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) inputs = image_processor(images=image, return_tensors="pt", padding=True) vision_tower = CLIPVisionModel.from_pretrained('ikala/ViT-SO400M-14-SigLIP-384-hf') outputs = vision_tower(**inputs) logits_per_image = outputs.pooler_output # this is the image-text similarity score ``` There's still a slight difference where hf's CLIPVision model uses a [CLS] embedding as pool embedding while SigLIP uses global attention pooler to get the final latent feature.