Zero-Shot Image Classification
vision
ariG23498's picture
ariG23498 HF staff
Upload README.md with huggingface_hub
aa5a509 verified
metadata
license: apache-2.0
tags:
  - vision
pipeline_tag: zero-shot-image-classification

SigLIP 2 Base

SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features.

Intended uses

You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks).

Training procedure

SigLIP 2 adds some clever training objectives on top of SigLIP:

  1. Decoder loss
  2. Global-local and masked prediction loss
  3. Aspect ratio and resolution adaptibility

Training data

SigLIP 2 is pre-trained on the WebLI dataset (Chen et al., 2023).

Compute

The model was trained on up to 2048 TPU-v5e chips.

Evaluation results

Evaluation of SigLIP 2 is shown below (taken from the paper).

Evaluation Table

BibTeX entry and citation info

@misc{tschannen2025siglip2multilingualvisionlanguage,
      title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features}, 
      author={Michael Tschannen and Alexey Gritsenko and Xiao Wang and Muhammad Ferjad Naeem and Ibrahim Alabdulmohsin and Nikhil Parthasarathy and Talfan Evans and Lucas Beyer and Ye Xia and Basil Mustafa and Olivier Hénaff and Jeremiah Harmsen and Andreas Steiner and Xiaohua Zhai},
      year={2025},
      eprint={2502.14786},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.14786}, 
}