laion
/

CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg

Zero-Shot Image Classification

OpenCLIP

clip

Model card Files Files and versions Community

rwightman HF staff commited on Feb 27, 2023

Commit

0b91920

•

1 Parent(s): bd0c5a9

Update README.md

Browse files

Files changed (1) hide show

README.md +25 -7

README.md CHANGED Viewed

@@ -24,9 +24,9 @@ A series of CLIP ConvNeXt-XXLarge (a custom `timm` ConvNeXt size) models trained
 | Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
 | ----- | ------- | ---------- | ------------ | --------- |
-| [convnext_xxlarge.laion2b_s34b_b82k-augreg](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) | LAION-2B | 256x256 |  RRC (0.33, 1.0), RE (0.35), SD (0.1) | 79.1 |
-| [convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind) | LAION-2B | 256x256 |  RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
-| [convnext_xxlarge.laion2b_s34b_b82k-augreg-soup](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup) | LAION-2B | 256x256 |  N/A | 79.4 |
 RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only
 The core training run was performed in pieces over a period of ~ 2 months. The global batch size for the core run was 81920. The last ~10% of training was re-done at a 95744 global batch size w/ higher LR and aug than original finish. The two were averaged together in a 'soup'. See more details in [Training Details](#training-details).
@@ -111,9 +111,9 @@ Many difficulties w/ both model numerical stability and cluster stability and pe
 |233 - 249           |Booster   |1024  |256    |A100 40GB | 80     |51k     | 50         |amp + bf16|0.98       |
 |250 - 256           |Stability |1024  |128    |A100 40GB | 80     |27-31k  | 26-30      |amp + bf16|0.98       |
-JUWELS Booster has 4x A100 GPU per node w/ 4x HDR-200 IB adapters per node (200Gbit/sec per GPU). Stability setup used was 8x A100 GPU per node w/ 400Gbit/sec EFA connectivity per node (~50 GBit/sec per GPU). Significant variation in training efficiency (throughput per GPU) as observed across the various configurations. The 1024 GPU configurations across both clusters were particularly prone to crashing (or very difficult to get running w/ a 'good' set of GPUs).
-For 256x256 models, a slurm script w/ srun below for a 128 8-GPU (40GB A100) configuration:
 ```
 srun --cpu_bind=v --accel-bind=gn python -m training.main \
@@ -144,12 +144,13 @@ srun --cpu_bind=v --accel-bind=gn python -m training.main \
     --report-to "tensorboard"
 ```
-For the rewind of last 10%, a higher global batch size of 95744 was used w/ a higher LR and slightly increased augmentation strength. The slurm srun cmd for 136 8-GPU (40GB A100) nodes:
 |Checkpoint Interval |Cluster  |# GPUs|# Nodes|GPU       |local BS|sample/s|sample/s/gpu|precision |adam beta2 |
 |--------------------|---------|------|-------|----------|--------|--------|------------|----------|-----------|
 |231 - 256           |stability|1088  |136    |A100 40GB | 88     |32-35k  | 29-32      |amp + bf16|0.98       |
 ```
 srun --cpu_bind=v --accel-bind=gn python -m training.main \
     --save-frequency 1 \
@@ -195,7 +196,7 @@ These models achieve between 79.1 and 79.4 top-1 zero-shot accuracy on ImageNet-
 ![](convnext_xxlarge_zero_shot.png)
-Zoom:
 ![](convnext_xxlarge_zero_shot_zoom.png)
@@ -292,3 +293,20 @@ OpenAI CLIP paper
   howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
 }
 ```

 | Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
 | ----- | ------- | ---------- | ------------ | --------- |
+| [convnext_xxlarge.laion2b_s34b_b82k-augreg](https://huggingface.co/laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) | LAION-2B | 256x256 |  RRC (0.33, 1.0), RE (0.35), SD (0.1) | 79.1 |
+| [convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind](https://huggingface.co/laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind) | LAION-2B | 256x256 |  RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
+| [convnext_xxlarge.laion2b_s34b_b82k-augreg-soup](https://huggingface.co/laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup) | LAION-2B | 256x256 |  N/A | 79.4 |
 RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only
 The core training run was performed in pieces over a period of ~ 2 months. The global batch size for the core run was 81920. The last ~10% of training was re-done at a 95744 global batch size w/ higher LR and aug than original finish. The two were averaged together in a 'soup'. See more details in [Training Details](#training-details).
 |233 - 249           |Booster   |1024  |256    |A100 40GB | 80     |51k     | 50         |amp + bf16|0.98       |
 |250 - 256           |Stability |1024  |128    |A100 40GB | 80     |27-31k  | 26-30      |amp + bf16|0.98       |
+JUWELS Booster has 4x A100 GPU per node w/ 4x HDR-200 IB adapters per node (200Gbit/sec per GPU). Stability setup used was 8x A100 GPU per node w/ 400Gbit/sec EFA networking per node (50 GBit/sec per GPU). Significant variation in training efficiency (throughput per GPU) as observed across the various configurations. The 1024 GPU configurations across both clusters were particularly prone to crashing (or very difficult to get running w/ a 'good' set of GPUs).
+A slurm srun command line below for a 128 8-GPU (40GB A100) configuration:
 ```
 srun --cpu_bind=v --accel-bind=gn python -m training.main \
     --report-to "tensorboard"
 ```
+For the rewind of last 10%, a higher global batch size of 95744 was used w/ a higher LR and slightly increased augmentation strength.
 |Checkpoint Interval |Cluster  |# GPUs|# Nodes|GPU       |local BS|sample/s|sample/s/gpu|precision |adam beta2 |
 |--------------------|---------|------|-------|----------|--------|--------|------------|----------|-----------|
 |231 - 256           |stability|1088  |136    |A100 40GB | 88     |32-35k  | 29-32      |amp + bf16|0.98       |
+The slurm srun command line for 136 8-GPU (40GB A100) nodes:
 ```
 srun --cpu_bind=v --accel-bind=gn python -m training.main \
     --save-frequency 1 \
 ![](convnext_xxlarge_zero_shot.png)
+A zoom-in on final 10% w/ rewind:
 ![](convnext_xxlarge_zero_shot_zoom.png)
   howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
 }
 ```
+```
+@InProceedings{pmlr-v162-wortsman22a,
+  title = 	 {Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time},
+  author =       {Wortsman, Mitchell and Ilharco, Gabriel and Gadre, Samir Ya and Roelofs, Rebecca and Gontijo-Lopes, Raphael and Morcos, Ari S and Namkoong, Hongseok and Farhadi, Ali and Carmon, Yair and Kornblith, Simon and Schmidt, Ludwig},
+  booktitle = 	 {Proceedings of the 39th International Conference on Machine Learning},
+  pages = 	 {23965--23998},
+  year = 	 {2022},
+  editor = 	 {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan},
+  volume = 	 {162},
+  series = 	 {Proceedings of Machine Learning Research},
+  month = 	 {17--23 Jul},
+  publisher =    {PMLR},
+  pdf = 	 {https://proceedings.mlr.press/v162/wortsman22a/wortsman22a.pdf},
+  url = 	 {https://proceedings.mlr.press/v162/wortsman22a.html}
+}
+```