Model Card for CLIP ViT-H/14 frozen xlm roberta large - LAION-5B

Table of Contents

  1. Model Details
  2. Uses
  3. Training Details
  4. Evaluation
  5. Acknowledgements
  6. Citation
  7. How To Get Started With the Model

Model Details

Model Description

A CLIP ViT-H/14 frozen xlm roberta large model trained with the LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/open_clip).

Model training done by Romain Beaumont on the stability.ai cluster.


Direct Use

Zero-shot image classification, image and text retrieval, among others.

Downstream Use

Image classification and other image task fine-tuning, linear probe image classification, image generation guiding and conditioning, among others.

Training Details

Training Data

This model was trained with the full LAION-5B (https://laion.ai/blog/laion-5b/).

Training Procedure

Training with batch size 90k for 13B sample of laion5B, see https://wandb.ai/rom1504/open-clip/reports/xlm-roberta-large-unfrozen-vit-h-14-frozen--VmlldzoyOTc3ODY3

Model is H/14 on visual side, xlm roberta large initialized with pretrained weights on text side.

The H/14 was initialized from https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K and kept frozen during training.


Evaluation done with code in the LAION CLIP Benchmark suite.

Testing Data, Factors & Metrics

Testing Data

The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/abs/1910.04867) w/ additional robustness datasets) for classification and COCO and Flickr for retrieval.


The model achieves imagenet 1k 77.0% (vs 78% for the english H/14) results_xlm_roberta_large.png

On zero shot classification on imagenet with translated prompts this model reaches:

This model reaches strong results in both english and other languages.


Acknowledging stability.ai for the compute used to train this model.



In addition to forthcoming LAION-5B (https://laion.ai/blog/laion-5b/) paper, please cite:

How To Get Started With the Model


