apple/DFN2B-CLIP-ViT-L-14-39B

A CLIP (Contrastive Language-Image Pre-training) model trained on DFN-2B. Data Filtering Networks (DFNs) are small networks used to automatically filter large pools of uncurated data. This model was trained on 2B images that were filtered from a pool of 12.8B uncurated image-text pairs (12.8B image-text pairs from CommonPool-12.8B).

This model has been converted to PyTorch from the original JAX checkpoints from Axlearn (https://github.com/apple/axlearn). These weights are directly usable in OpenCLIP (image + text).

Model Details

Model Type: Contrastive Image-Text, Zero-Shot Image Classification.
Dataset: DFN-2b
Papers:
- Data Filtering Networks: https://arxiv.org/abs/2309.17425
Examples Seen: 39B

Model Metrics

Eval Dataset	Metric
ImageNet 1k	0.8219
Caltech-101	0.9500
CIFAR-10	0.9864
CIFAR-100	0.8934
CLEVR Counts	0.3403
CLEVR Distance	0.2321
Country211	0.3198
Describable Textures	0.6681
EuroSAT	0.6819
FGVC Aircraft	0.4829
Food-101	0.9498
GTSRB	0.6329
ImageNet Sketch	0.7043
ImageNet v2	0.7570
ImageNet-A	0.6745
ImageNet-O	0.3605
ImageNet-R	0.9184
KITTI Vehicle Distance	0.2391
MNIST	0.8745
ObjectNet	0.7477
Oxford Flowers-102	0.8784
Oxford-IIIT Pet	0.9611
Pascal VOC 2007	0.8472
PatchCamelyon	0.6418
Rendered SST2	0.5815
RESISC45	0.7300
Stanford Cars	0.9465
STL-10	0.9889
SUN397	0.7594
SVHN	0.6573
Flickr	0.8467
MSCOCO	0.5957
WinoGAViL	0.5551
iWildCam	0.1857
Camelyon17	0.6540
FMoW	0.1824
Dollar Street	0.6822
GeoDE	0.9253
Average	0.68039

Citation

@article{fang2023data,
  title={Data Filtering Networks},
  author={Fang, Alex and Jose, Albin Madappally and Jain, Amit and Schmidt, Ludwig and Toshev, Alexander and Shankar, Vaishaal},
  journal={arXiv preprint arXiv:2309.17425},
  year={2023}
}

apple
/

DFN2B-CLIP-ViT-L-14-39B

Model Details

Model Metrics

Citation

Collection including apple/DFN2B-CLIP-ViT-L-14-39B

DFN Models + Data