--- license: cc-by-4.0 --- **Model Details** VisMin-CLIP is a fine-tuned version of the pretrained CLIP model, designed to enhance fine-grained and compositional abilities beyond the base model. Fine-tuning was conducted using the [OpenCLIP](https://github.com/mlfoundations/open_clip) library, an open-source implementation of OpenAI’s CLIP. **Model Summary** - Model Date: July 2024 - Model type: Vision-language Foundation Model (image+text) - Parent Model: [openai/clip-vit-large-patch14](openai/clip-vit-large-patch14) **Usage** Similar to any OpenCLIP model can easily be loaded from the checkpoint: ```python import open_clip model_cls_name = "ViT-L-14" checkpoint_path = "path/to/checkpoint" model, _, preprocess = open_clip.create_model_and_transforms( model_name=model_cls_name, pretrained=checkpoint_path, device=device ) tokenizer = open_clip.get_tokenizer(model_cls_name) model = model.to(device).eval() ``` Once loaded, you can encode the image and text to do zero-shot image classification: ```python import torch from PIL import Image url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) image = preprocess(image).unsqueeze(0) text = tokenizer(["a diagram", "a dog", "a cat"]) with torch.no_grad(), torch.cuda.amp.autocast(): image_features = model.encode_image(image) text_features = model.encode_text(text) image_features /= image_features.norm(dim=-1, keepdim=True) text_features /= text_features.norm(dim=-1, keepdim=True) text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1) print("Label probs:", text_probs) ``` **Bibtex** If you use VisMin-CLIP in your work, please cite it as follows: ``` @article{vismin2024, title={VisMin: Visual Minimal-Change Understanding}, author={Awal, Rabiul and Ahmadi, Saba and Zhang, Le and Agrawal, Aishwarya}, year={2024} } ```