dalle-mini
/

dalle-mini

@@ -1,40 +1,94 @@
-# Model Card: DALL·E Mini
-This model is a reproduction of OpenAI’s DALL·E.  Please see [this link](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini-Generate-images-from-any-text-prompt--VmlldzoyMDE4NDAy) for project-specific details.  Below, we include the original DALL·E model card available on [the OpenAI github](https://github.com/openai/DALL-E/edit/master/model_card.md).
-## Model Details
-The dVAE was developed by researchers at OpenAI to reduce the memory footprint of the transformer trained on the
-text-to-image generation task. The details involved in training the dVAE are described in [the paper][dalle_paper]. This
-model card describes the first version of the model, released in February 2021. The model consists of a convolutional
-encoder and decoder whose architectures are described [here](dall_e/encoder.py) and [here](dall_e/decoder.py), respectively.
-For questions or comments about the models or the code release, please file a Github issue.
-## Model Use
-### Intended Use
-The model is intended for others to use for training their own generative models.
-### Out-of-Scope Use Cases
-This model is inappropriate for high-fidelity image processing applications. We also do not recommend its use as a
-general-purpose image compressor.
-## Training Data
-The model was trained on publicly available text-image pairs collected from the internet. This data consists partly of
-[Conceptual Captions][cc] and a filtered subset of [YFCC100M][yfcc100m]. We used a subset of the filters described in
-[Sharma et al.][cc_paper] to construct this dataset; further details are described in [our paper][dalle_paper]. We will
-not be releasing the dataset.
-## Performance and Limitations
-The heavy compression from the encoding process results in a noticeable loss of detail in the reconstructed images. This
-renders it inappropriate for applications that require fine-grained details of the image to be preserved.
-[dalle_paper]: https://arxiv.org/abs/2102.12092
-[cc]: https://ai.google.com/research/ConceptualCaptions
-[cc_paper]: https://www.aclweb.org/anthology/P18-1238/
-[yfcc100m]: http://projects.dfki.uni-kl.de/yfcc100m/

+int={2102.12092},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV}
+}
+@misc{
+  title={Learning Transferable Visual Models From Natural Language Supervision},
+  author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
+  year={2021},
+  eprint={2103.00020},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV}
+}
+@misc{
+  title={Taming Transformers for High-Resolution Image Synthesis},
+  author={Patrick Esser and Robin Rombach and Björn Ommer},
+  year={2021},
+  eprint={2012.09841},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV}
+}
+@misc{
+  title={BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension},
+  author={Mike Lewis and Yinhan Liu and Naman Goyal and Marjan Ghazvininejad and Abdelrahman Mohamed and Omer Levy and Ves Stoyanov and Luke Zettlemoyer},
+  year={2019},
+  eprint={1910.13461},
+  archivePrefix={arXiv},
+  primaryClass={cs.CL}
+}
+@misc{
+  title={Scalable Second Order Optimization for Deep Learning},
+  author={Rohan Anil and Vineet Gupta and Tomer Koren and Kevin Regan and Yoram Singer},
+  year={2021},
+  eprint={2002.09018},
+  archivePrefix={arXiv},
+  primaryClass={cs.LG}
+}
+@misc{
+  title={GLU Variants Improve Transformer},
+  author={Noam Shazeer},
+  year={2020},
+  url={https://arxiv.org/abs/2002.05202}
+}
+ @misc{
+  title={DeepNet: Scaling transformers to 1,000 layers},
+  author={Wang, Hongyu and Ma, Shuming and Dong, Li and Huang, Shaohan and Zhang, Dongdong and Wei, Furu},
+  year={2022},
+  eprint={2203.00555}
+  archivePrefix={arXiv},
+  primaryClass={cs.LG}
+}
+@misc{
+  title={NormFormer: Improved Transformer Pretraining with Extra Normalization},
+  author={Sam Shleifer and Jason Weston and Myle Ott},
+  year={2021},
+  eprint={2110.09456},
+  archivePrefix={arXiv},
+  primaryClass={cs.CL}
+}
+@inproceedings{
+  title={Swin Transformer V2: Scaling Up Capacity and Resolution},
+  author={Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
+  booktitle={International Conference on Computer Vision and Pattern Recognition (CVPR)},
+  year={2022}
+}
+@misc{
+  title = {CogView: Mastering Text-to-Image Generation via Transformers},
+  author = {Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang},
+  year = {2021},
+  eprint = {2105.13290},
+  archivePrefix = {arXiv},
+  primaryClass = {cs.CV}
+}
+@misc{
+  title = {Root Mean Square Layer Normalization},
+  author = {Biao Zhang and Rico Sennrich},
+  year = {2019},
+  eprint = {1910.07467},
+  archivePrefix = {arXiv},
+  primaryClass = {cs.LG}
+}
+@misc{
+  title = {Sinkformers: Transformers with Doubly Stochastic Attention},
+  url = {https://arxiv.org/abs/2110.11773},
+  author = {Sander, Michael E. and Ablin, Pierre and Blondel, Mathieu and Peyré, Gabriel},
+  publisher = {arXiv},
+  year = {2021},
+}
+@misc{
+  title = {Smooth activations and reproducibility in deep networks},
+  url = {https://arxiv.org/abs/2010.09931},
+  author = {Shamir, Gil I. and Lin, Dong and Coviello, Lorenzo},
+  publisher = {arXiv},
+  year = {2020},
+}