---
license: mit
train: false
inference: false
pipeline_tag: image-classification
---
## CLIP-ViT-H-14-laion2B-2bit_g16_s128-HQQ
This is a version of the ViT-H-14 vision model based on timm's ```vit_huge_patch14_clip_224.laion2b``` quantized to 2-bit via Half-Quadratic Quantization (HQQ): https://mobiusml.github.io/hqq_blog/
This 2-bit model achieves a 0.716 zero-shot top-1 accuracy on Imagenet, outperforming a full-precision ViT-B-32 (0.664).
### Basic Usage
To run the model, install the HQQ library from https://github.com/mobiusml/hqq and use it as follows:
``` Python
from hqq.engine.timm import HQQtimm
model = HQQtimm.from_quantized("mobiuslabsgmbh/CLIP-ViT-H-14-laion2B-2bit_g16_s128-HQQ")
```
### Zero-Shot Classification
For zero-shot classification you'd need the text model as well, here's a complete example:
``` Python
!pip install open_clip_torch
!pip install Pillow
import torch
import numpy as np
import open_clip
orig_model, _ , preprocess = open_clip.create_model_and_transforms('ViT-H-14', pretrained='laion2B-s32B-b79K')
tokenizer = open_clip.get_tokenizer('ViT-H-14')
model_text = orig_model.encode_text
from hqq.engine.timm import HQQtimm
model_visual = HQQtimm.from_quantized("mobiuslabsgmbh/CLIP-ViT-H-14-laion2B-2bit_g16_s128-HQQ")
###############################################################
#Add your own templates here, we provide simple ones below.
#https://github.com/openai/CLIP/blob/main/data/prompts.md for the complete list
TEMPLATES = (
lambda c: f'itap of a {c}.',
lambda c: f'a origami {c}.',
lambda c: f'a bad photo of the {c}.',
lambda c: f'a photo of the large {c}.',
lambda c: f'a photo of the small {c}.',
lambda c: f'a {c} in a video game.',
lambda c: f'art of the {c}.',
)
@torch.no_grad()
def forward_image(img):
x = preprocess(img).unsqueeze(0)
f = model_visual(x.half().cuda())
f /= torch.norm(f, p=2, dim=-1, keepdim=True)
return f
@torch.no_grad()
def forward_text(text_batch_list, normalize=True):
inputs = tokenizer(text_batch_list)
f = model_text(inputs)
if(normalize):
f /= torch.norm(f, p=2, dim=-1, keepdim=True)
del inputs
return f.half().to('cuda')
def forward_text_with_templates(text, templates=TEMPLATES, normalize=True):
f = forward_text([t(text) for t in templates], normalize=False).mean(axis=0)
if(normalize):
f /= torch.norm(f, p=2, dim=-1, keepdim=True)
return f
def classifier_zero_shot_with_pil(img, classes):
classifiers = torch.cat([forward_text_with_templates(c).reshape([1, -1]) for c in classes], axis=0)
img_features = forward_image(img)
scores = torch.matmul(img_features, classifiers.T)[0].detach().cpu().numpy()
out = classes[np.argmax(scores)]
return out
###############################################################
from PIL import Image
import requests
#img_path_or_url = 'https://hips.hearstapps.com/hmg-prod/images/cute-photos-of-cats-looking-at-camera-1593184780.jpg' #Cat
#img_path_or_url = 'https://www.shutterstock.com/image-photo/photo-cute-golden-retriever-running-600nw-2291249193.jpg' #Dog
img_path_or_url = "https://my-sweet-usa.de/cdn/shop/products/1727.jpg" #bag of chips
img = Image.open(requests.get(img_path_or_url, stream=True).raw)
classes = ['cat', 'dog', 'car', 'tiger', 'bag of chips']
out = classifier_zero_shot_with_pil(img, classes)
print("It's a picture of a " + out) #It's a picture of a bag of chips
```
*Limitations*:
-Only supports single GPU runtime.
-Doesn't support finetuning the linear layers.