README.md · apple/DFN2B-CLIP-ViT-L-14-39B at bd1bf8818b21dc57d56091f27b4470b5dc0b1870

metadata

license: other
license_name: apple-sample-code-license
license_link: LICENSE

A CLIP (Contrastive Language-Image Pre-training) model trained on DFN-2B. Data Filtering Networks (DFNs) are small networks used to automatically filter large pools of uncurated data. This model was trained on 2B images that were filtered from a pool of 12.8B uncurated image-text pairs (12.8B image-text pairs from CommonPool-12.8B).

This model has been converted to PyTorch from the original JAX checkpoints from Axlearn (https://github.com/apple/axlearn). These weights are directly usable in OpenCLIP (image + text).

Model Details

Model Type: Contrastive Image-Text, Zero-Shot Image Classification.
Dataset: DFN-2b
Papers:
- Data Filtering Networks: https://arxiv.org/abs/2309.17425
Examples Seen: 39B

Model Metrics

Eval Dataset	Metric
ImageNet 1k	0.8219
Caltech-101	0.9500
CIFAR-10	0.9864
CIFAR-100	0.8934
CLEVR Counts	0.3403
CLEVR Distance	0.2321
Country211	0.3198
Describable Textures	0.6681
EuroSAT	0.6819
FGVC Aircraft	0.4829
Food-101	0.9498
GTSRB	0.6329
ImageNet Sketch	0.7043
ImageNet v2	0.7570
ImageNet-A	0.6745
ImageNet-O	0.3605
ImageNet-R	0.9184
KITTI Vehicle Distance	0.2391
MNIST	0.8745
ObjectNet	0.7477
Oxford Flowers-102	0.8784
Oxford-IIIT Pet	0.9611
Pascal VOC 2007	0.8472
PatchCamelyon	0.6418
Rendered SST2	0.5815
RESISC45	0.7300
Stanford Cars	0.9465
STL-10	0.9889
SUN397	0.7594
SVHN	0.6573
Flickr	0.8467
MSCOCO	0.5957
WinoGAViL	0.5551
iWildCam	0.1857
Camelyon17	0.6540
FMoW	0.1824
Dollar Street	0.6822
GeoDE	0.9253
Average	0.68039

Model Usage

With OpenCLIP

import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer 

model, preprocess = create_model_from_pretrained('hf-hub:apple/DFN2B-CLIP-ViT-L-14')
tokenizer = get_tokenizer('ViT-L-14')

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)

labels_list = ["a dog", "a cat", "a donut", "a beignet"]
text = tokenizer(labels_list, context_length=model.context_length)

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features = F.normalize(image_features, dim=-1)
    text_features = F.normalize(text_features, dim=-1)

    text_probs = torch.sigmoid(image_features @ text_features.T * model.logit_scale.exp() + model.logit_bias)

zipped_list = list(zip(labels_list, [round(p.item(), 3) for p in text_probs[0]]))
print("Label probabilities: ", zipped_list)

Citation

@article{fang2023data,
  title={Data Filtering Networks},
  author={Fang, Alex and Jose, Albin Madappally and Jain, Amit and Schmidt, Ludwig and Toshev, Alexander and Shankar, Vaishaal},
  journal={arXiv preprint arXiv:2309.17425},
  year={2023}
}