Pedro Cuenca commited on
Commit
ac3e482
1 Parent(s): 90cc46a

More details in model card.

Browse files
Files changed (1) hide show
  1. README.md +37 -8
README.md CHANGED
@@ -1,10 +1,39 @@
1
  ## VQGAN-f16-16384
2
 
3
- Model converted to JAX from [boris/vqgan_f16_16384](https://huggingface.co/boris/vqgan_f16_16384).
4
-
5
- Model finetuned with [taming-transformers](https://github.com/CompVis/taming-transformers):
6
- * Training run
7
- * [Part 1](https://wandb.ai/wandb/hf-flax-dalle-mini/runs/2021-07-09T15-33-11_dalle_vqgan?workspace=user-borisd13) - started from [vqgan_imagenet_f16_16384 checkpoint](https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/) (pretrained on ImageNet)
8
- * [Part 2](https://wandb.ai/wandb/hf-flax-dalle-mini/runs/2021-07-09T21-42-07_dalle_vqgan?workspace=user-borisd13) - continuation from Part 1
9
- * Dataset: subset of 2,268,720 images processed once originating from [Conceptual Captions 3M](https://ai.google.com/research/ConceptualCaptions/) and [OpenAI subset of YFCC100M](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md)
10
- * Checkpoint uploaded from last artifact version (see training run)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ## VQGAN-f16-16384
2
 
3
+ ### Model Description
4
+
5
+ This is a Flax/JAX implementation of VQGAN, which learns a codebook of context-rich visual parts by leveraging both the use of convolutional methods and transformers. It was introduced in [Taming Transformers for High-Resolution Image Synthesis](https://compvis.github.io/taming-transformers/) ([CVPR paper](https://openaccess.thecvf.com/content/CVPR2021/html/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.html)).
6
+
7
+ The model allows the encoding of images as a fixed-length sequence of tokens taken from the codebook.
8
+
9
+ This version of the model uses a reduction factor `f=16` and a vocabulary of `13,384` tokens.
10
+
11
+ As an example of how the reduction factor works, images of size `256x256` are encoded to sequences of `256` tokens: `256/16 * 256/16`. Images of `512x512` would result in sequences of `1024` tokens.
12
+
13
+ ### Datasets Used for Training
14
+
15
+ * ImageNet. We didn't train this model from scratch. Instead, we started from [a checkpoint pre-trained on ImageNet](https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/).
16
+ * [Conceptual Captions 3M](https://ai.google.com/research/ConceptualCaptions/) (CC3M).
17
+ * [OpenAI subset of YFCC100M](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md).
18
+
19
+ We fine-tuned on CC3M and YFCC100M to improve the encoding quality of people and faces, which are not very well represented in ImageNet. We used a subset of 2,268,720 images from CC3M and YFCC100M for this purpose.
20
+
21
+ ### Training Process
22
+
23
+ Finetuning was performed in PyTorch using [taming-transformers](https://github.com/CompVis/taming-transformers). The full training process and model preparation includes these steps:
24
+
25
+ * Pre-training on ImageNet. Previously performed. We used [this checkpoint](https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887).
26
+ * Fine-tuning, [Part 1](https://wandb.ai/wandb/hf-flax-dalle-mini/runs/2021-07-09T15-33-11_dalle_vqgan?workspace=user-borisd13).
27
+ * Fine-tuning, [Part 2](https://wandb.ai/wandb/hf-flax-dalle-mini/runs/2021-07-09T21-42-07_dalle_vqgan?workspace=user-borisd13) – continuation from Part 1. The final checkpoint was uploaded to [boris/vqgan_f16_16384](https://huggingface.co/boris/vqgan_f16_16384).
28
+ * Conversion to JAX, which is the model described in this card.
29
+
30
+ ### How to Use
31
+
32
+ The checkpoint can be loaded using [Suraj Patil's implementation](https://github.com/patil-suraj/vqgan-jax) of `VQModel`.
33
+
34
+ * Encoding. `coming soon`.
35
+ * Decoding. `coming soon`.
36
+
37
+ ### Other
38
+
39
+ This model was successfully used as part of the implementation of [DALL·E mini](https://github.com/borisdayma/dalle-mini). Our [report](https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini--Vmlldzo4NjIxODA) contains more details on how to leverage it in an image encoding / generation pipeline.