puffy310 commited on
Commit
e30b686
1 Parent(s): 1155b51

Update README.md

Browse files

If there are some things that contributors would like to change that is okay.

Files changed (1) hide show
  1. README.md +35 -26
README.md CHANGED
@@ -1,40 +1,49 @@
1
- # Model Card: DALL·E Mini
2
 
3
- This model is a reproduction of OpenAI’s DALL·E. Please see [this link](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini-Generate-images-from-any-text-prompt--VmlldzoyMDE4NDAy) for project-specific details. Below, we include the original DALL·E model card available on [the OpenAI github](https://github.com/openai/DALL-E/edit/master/model_card.md).
4
 
5
- ## Model Details
6
 
7
- The dVAE was developed by researchers at OpenAI to reduce the memory footprint of the transformer trained on the
8
- text-to-image generation task. The details involved in training the dVAE are described in [the paper][dalle_paper]. This
9
- model card describes the first version of the model, released in February 2021. The model consists of a convolutional
10
- encoder and decoder whose architectures are described [here](dall_e/encoder.py) and [here](dall_e/decoder.py), respectively.
11
- For questions or comments about the models or the code release, please file a Github issue.
12
 
13
- ## Model Use
14
 
15
- ### Intended Use
16
 
17
- The model is intended for others to use for training their own generative models.
18
 
19
- ### Out-of-Scope Use Cases
20
 
21
- This model is inappropriate for high-fidelity image processing applications. We also do not recommend its use as a
22
- general-purpose image compressor.
23
 
24
- ## Training Data
25
 
26
- The model was trained on publicly available text-image pairs collected from the internet. This data consists partly of
27
- [Conceptual Captions][cc] and a filtered subset of [YFCC100M][yfcc100m]. We used a subset of the filters described in
28
- [Sharma et al.][cc_paper] to construct this dataset; further details are described in [our paper][dalle_paper]. We will
29
- not be releasing the dataset.
30
 
31
- ## Performance and Limitations
32
 
33
- The heavy compression from the encoding process results in a noticeable loss of detail in the reconstructed images. This
34
- renders it inappropriate for applications that require fine-grained details of the image to be preserved.
35
 
36
- [dalle_paper]: https://arxiv.org/abs/2102.12092
37
- [cc]: https://ai.google.com/research/ConceptualCaptions
38
- [cc_paper]: https://www.aclweb.org/anthology/P18-1238/
39
- [yfcc100m]: http://projects.dfki.uni-kl.de/yfcc100m/
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DALL-E Mega Model Card
2
 
3
+ **This Model Card is a simplified version of various resources from the WandB Reports**
4
 
5
+ [Project Report](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini-Generate-images-from-any-text-prompt--VmlldzoyMDE4NDAy)
6
 
7
+ [Demo](https://huggingface.co/spaces/dalle-mini/dalle-mini)
 
 
 
 
8
 
9
+ # Model Description
10
 
11
+ DALL-E Mega is a simplified replication of the original DALL-E training in the open. DALL-E Mega has improved from DALL-E Mini in several key ways.
12
 
13
+ * Optimizer updated to Distributed Shampoo which proved to be more efficient following a comparison of different optimizers
14
 
15
+ * A new architecture based on NormFormer and GLU variants following a comparison of transformer variants, including DeepNet, Swin v2, NormFormer, Sandwich-LN, RMSNorm with GeLU/Swish/SmeLU
16
 
17
+ * We use super conditioning which affects FID and CLIP score (see Pareto curves)
 
18
 
19
+ * Improvements over the dataset with CLIP score exploration
20
 
21
+ The model works by using a BART Encoder Model that DALL-E Mega uses to understand the text prompt. Another larger BART model then decodes it into VQGAN Tokens which are lastly decoded into images you can display.
 
 
 
22
 
23
+ # Training
24
 
25
+ The Simplified Procedure is as followed.
 
26
 
27
+ * Hardware: 1 pod TPU v3-256 = 32 nodes of TPU VM v3-8 (8 TPU per node) = 256 TPU v3
 
 
 
28
 
29
+ * Optimizer: Distributed Shampoo
30
+
31
+ * Model Partition Spec: 8 model parallel x 32 data parallel
32
+
33
+ * Batch: 44 samples per model x 32 data parallel x 3 gradient accumulation steps = 4224 samples per update
34
+
35
+ * Learning rate: warmup to 0.0001 for 10,000 steps and then kept constant until plateau
36
+
37
+ * Gradient checkpointing used on each Encoder/Decoder layer (ie, MHA + FFN)
38
+
39
+ * Distributed Shampoo + Normformer Optimizations have proved to be effective and efficiently scaling this model.
40
+
41
+ It should also be noted that the learning rate and other parameters are sometimes adjusted on the fly.
42
+
43
+ [The Full Procedure and Technical Material](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-Mega-Training--VmlldzoxODMxMDI2#training-parameters)
44
+
45
+ # Limitations and Bias
46
+
47
+ Open-ended Generation Models tend to spread bias and harmful messages learned through the data it was given. It is recommended to not use DALL-E as an image generator for vague prompts, especially related to people related. We also do not recommend purposely making harmful prompts unless for research investigation(1) or other academic purposes. More about these assumptions and biases can be found in [this document](https://docs.google.com/document/u/1/d/1C1iJYbzGN_7dfiQAjaQIbETbbzqaE3ci1Fns5TlR-UI/edit?ouid=113653653333035119417&usp=docs_home&ths=true).
48
+
49
+ 1. Even without enforced restrictions it's just common sense, let's keep AI art comfortable and accessible for everyone.