laion
/

CLIP-ViT-L-14-laion2B-s32B-b82K

@@ -6,11 +6,12 @@ license: mit
 #  Table of Contents
 1. [Model Details](#model-details)
-1. [Uses](#uses)
-1. [Training Details](#training-details)
-1. [Evaluation](#evaluation)
-1. [Citation](#citation)
-1. [How To Get Started With the Model](#how-to-get-started-with-the-model)
 # Model Details
@@ -19,6 +20,8 @@ license: mit
 A CLIP ViT L/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/open_clip).
 # Uses
 As per the original OpenAI CLIP models, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model.
@@ -55,7 +58,74 @@ This model was trained with the 2 Billion sample English subset of LAION-5B (htt
 ## Training Procedure
-**TODO** - add SLURM script, hparams.
 # Evaluation
@@ -71,7 +141,15 @@ The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/ab
 ## Results
-**TODO** - full zero-shot and retrieval benchmark results
 # Citation

 #  Table of Contents
 1. [Model Details](#model-details)
+2. [Uses](#uses)
+3. [Training Details](#training-details)
+4. [Evaluation](#evaluation)
+5. [Acknolwedgements](#acknowledgements)
+6. [Citation](#citation)
+7. [How To Get Started With the Model](#how-to-get-started-with-the-model)
 # Model Details
 A CLIP ViT L/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/open_clip).
+Model training ('babysitting') done by Ross Wightman on the [JUWELS Booster](https://apps.fz-juelich.de/jsc/hps/juwels/booster-overview.html) supercomputer. See acknowledgements below.
 # Uses
 As per the original OpenAI CLIP models, this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model.
 ## Training Procedure
+The model was trained on 384 A100 GPUs using 200M sample 'virtual' epochs where dataset shards were sampled with replacement. The model was trained with 160 virtual epochs for a total of 32B samples seen.
+The first 68 epochs were trained with float16 AMP, global batch size 79K (208 per GPU). Initially running to epoch 75, where the loss spiked and training failed with NaN.
+Romain Beaumont was training H/14 and g/14 models at the same time on Stability cluster and hit similar instabilities. Collectively we tried restarts with,
+* different dataset shuffle seed
+* different LR
+* gradient clipping
+* modifications to the architecture
+  * Norm modifications (stable norm for final, post embed norm for text transformer) as per https://github.com/mlfoundations/open_clip/pull/153 thanks to Phil Wang
+  * Extra attention block norms ala Normformer (https://arxiv.org/abs/2110.09456)
+  * Scaled cosine attention ala Swin-V2 (https://arxiv.org/abs/2111.09883)
+None of the above ended up working. Most blew up within the same epoch as original, with the exception of architecture mods.
+  * Normformer mods signifcantly altered the network such that resuming did not quickly converge to previous performance, this was abandoned but might be worth trying from start.
+  * Scaled cosine attn initially looked promising and lasted until epoch 90 before loss suddenly increased and appeared to remain 'stuck'.
+In the end, restarting at epoch 69 with `float32` precision solved all instabilities and training continued from there with global batch size 86k (224 per GPU). On A100 GPUs, `float32` had a minimal impact on the throughput once `tf32` matmuls were enabled in PyTorch. Approximately 10% slower than `float16 AMP`. Romain similary changed the precision but ended up using `bfloat16 AMP` to resolve issues.
+### Slum Script
+```
+#SBATCH --nodes=96
+#SBATCH --gres=gpu:4
+#SBATCH --ntasks-per-node=4
+#SBATCH --cpus-per-task=6
+#SBATCH --wait-all-nodes=1
+#SBATCH --job-name=open_clip_laion2b
+# load low-level libraries
+ml purge
+source /conda/bin/activate pytorch-112
+export NCCL_ASYNC_ERROR_HANDLING=1
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+export MASTER_PORT=12802
+### get the first node name as master address - customized for vgg slurm
+### e.g. master(gnodee[2-5],gnoded1) == gnodee2
+echo "NODELIST="${SLURM_NODELIST}
+master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
+export MASTER_ADDR=$master_addr"i"
+echo "MASTER_ADDR="$MASTER_ADDR
+cd /home/me/open_clip
+export PYTHONPATH="$PYTHONPATH:$PWD/src"
+srun --cpu_bind=none,v --accel-bind=gn python -u src/training/main.py \
+    --save-frequency 1 \
+    --zeroshot-frequency 1 \
+    --train-data="/data/laion2B-en/{00000..23295}.tar" \
+    --train-num-samples=200000000 \
+    --warmup 10000 \
+    --lr "1e-3" \
+    --batch-size=224 \
+    --epochs=160 \
+    --workers=6 \
+    --model ViT-L-14 \
+    --name "L14-laion2B" \
+    --report-to "tensorboard" \
+    --seed 0 \
+    --precision 'fp32' \
+    --ddp-static-graph \
+    --local-loss \
+    --dataset-resampled \
+    --gather-with-grad \
+    --grad-checkpointing
+```
 # Evaluation
 ## Results
+The model achieves a 75.3 zero-shot top-1 accuracy on ImageNet-1k.
+An initial round of benchmarks have been performed on a wider range of datasets, currently viewable at https://github.com/LAION-AI/CLIP_benchmark/blob/main/benchmark/results.ipynb
+**TODO** - create table for just this model's metrics.
+# Acknowledgements
+Acknowledging the Gauss Centre for Supercomputing e.V. (http://gauss-centre.eu) for funding this part of work by providing computing time through the John von Neumann Institute for Computing (NIC) on the GCS Supercomputer JUWELS Booster at Jülich Supercomputing Centre (JSC).
 # Citation