Unconditional Image Generation
Fairseq
youngsheen commited on
Commit
888a3a0
β€’
2 Parent(s): 0f79f06 4f691c1

add base_8k_stage1_400epoch

Browse files
Files changed (1) hide show
  1. README.md +138 -3
README.md CHANGED
@@ -1,3 +1,138 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - ILSVRC/imagenet-1k
5
+ pipeline_tag: unconditional-image-generation
6
+ library_name: fairseq
7
+ ---
8
+ <h1 align="center"> Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective
9
+ </h1>
10
+
11
+ <div align="center">
12
+
13
+ [![arXiv](https://img.shields.io/badge/arXiv%20paper-2410.12490-b31b1b.svg)](https://arxiv.org/abs/2410.12490)
14
+ [![benchmark](https://img.shields.io/badge/Rank%204-Image%20Generation%20on%20ImageNet%20%28AR%29-32B1B4?logo=data%3Aimage%2Fsvg%2Bxml%3Bbase64%2CPHN2ZyB3aWR0aD0iNjA2IiBoZWlnaHQ9IjYwNiIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxuczp4bGluaz0iaHR0cDovL3d3dy53My5vcmcvMTk5OS94bGluayIgb3ZlcmZsb3c9ImhpZGRlbiI%2BPGRlZnM%2BPGNsaXBQYXRoIGlkPSJjbGlwMCI%2BPHJlY3QgeD0iLTEiIHk9Ii0xIiB3aWR0aD0iNjA2IiBoZWlnaHQ9IjYwNiIvPjwvY2xpcFBhdGg%2BPC9kZWZzPjxnIGNsaXAtcGF0aD0idXJsKCNjbGlwMCkiIHRyYW5zZm9ybT0idHJhbnNsYXRlKDEgMSkiPjxyZWN0IHg9IjUyOSIgeT0iNjYiIHdpZHRoPSI1NiIgaGVpZ2h0PSI0NzMiIGZpbGw9IiM0NEYyRjYiLz48cmVjdCB4PSIxOSIgeT0iNjYiIHdpZHRoPSI1NyIgaGVpZ2h0PSI0NzMiIGZpbGw9IiM0NEYyRjYiLz48cmVjdCB4PSIyNzQiIHk9IjE1MSIgd2lkdGg9IjU3IiBoZWlnaHQ9IjMwMiIgZmlsbD0iIzQ0RjJGNiIvPjxyZWN0IHg9IjEwNCIgeT0iMTUxIiB3aWR0aD0iNTciIGhlaWdodD0iMzAyIiBmaWxsPSIjNDRGMkY2Ii8%2BPHJlY3QgeD0iNDQ0IiB5PSIxNTEiIHdpZHRoPSI1NyIgaGVpZ2h0PSIzMDIiIGZpbGw9IiM0NEYyRjYiLz48cmVjdCB4PSIzNTkiIHk9IjE3MCIgd2lkdGg9IjU2IiBoZWlnaHQ9IjI2NCIgZmlsbD0iIzQ0RjJGNiIvPjxyZWN0IHg9IjE4OCIgeT0iMTcwIiB3aWR0aD0iNTciIGhlaWdodD0iMjY0IiBmaWxsPSIjNDRGMkY2Ii8%2BPHJlY3QgeD0iNzYiIHk9IjY2IiB3aWR0aD0iNDciIGhlaWdodD0iNTciIGZpbGw9IiM0NEYyRjYiLz48cmVjdCB4PSI0ODIiIHk9IjY2IiB3aWR0aD0iNDciIGhlaWdodD0iNTciIGZpbGw9IiM0NEYyRjYiLz48cmVjdCB4PSI3NiIgeT0iNDgyIiB3aWR0aD0iNDciIGhlaWdodD0iNTciIGZpbGw9IiM0NEYyRjYiLz48cmVjdCB4PSI0ODIiIHk9IjQ4MiIgd2lkdGg9IjQ3IiBoZWlnaHQ9IjU3IiBmaWxsPSIjNDRGMkY2Ii8%2BPC9nPjwvc3ZnPg%3D%3D)](https://paperswithcode.com/sota/image-generation-on-imagenet-256x256?tag_filter=485&p=stabilize-the-latent-space-for-image)
15
+
16
+ </div>
17
+
18
+
19
+ This is the official implementation of DiGIT [(Github)](https://github.com/DAMO-NLP-SG/DiGIT) accepted at NeurIPS 2024. The code will be available soon.
20
+
21
+
22
+ ## Overview
23
+
24
+
25
+ We present **DiGIT**, an auto-regressive generative model performing next-token prediction in an abstract latent space derived from self-supervised learning (SSL) models. By employing K-Means clustering on the hidden states of the DINOv2 model, we effectively create a novel discrete tokenizer. This method significantly boosts image generation performance on ImageNet dataset, achieving an FID score of 4.59 for class-unconditional tasks and 3.39 for class-conditional tasks. Additionally, the model enhances image understanding, attaining a linear-probe accuracy of 80.3.
26
+
27
+
28
+ ## Experimental Results
29
+
30
+ ### Linear-Probe Accuracy on ImageNet
31
+
32
+
33
+ | Methods | \# Tokens | Features | \# Params | Top-1 Acc. $\uparrow$ |
34
+ |-----------------------------------|-------------|----------|------------|-----------------------|
35
+ | iGPT-L | 32 $\times$ 32 | 1536 | 1362M | 60.3 |
36
+ | iGPT-XL | 64 $\times$ 64 | 3072 | 6801M | 68.7 |
37
+ | VIM+VQGAN | 32 $\times$ 32 | 1024 | 650M | 61.8 |
38
+ | VIM+dVAE | 32 $\times$ 32 | 1024 | 650M | 63.8 |
39
+ | VIM+ViT-VQGAN | 32 $\times$ 32 | 1024 | 650M | 65.1 |
40
+ | VIM+ViT-VQGAN | 32 $\times$ 32 | 2048 | 1697M | 73.2 |
41
+ | AIM | 16 $\times$ 16 | 1536 | 0.6B | 70.5 |
42
+ | **DiGIT (Ours)** | 16 $\times$ 16 | 1024 | 219M | 71.7 |
43
+ | **DiGIT (Ours)** | 16 $\times$ 16 | 1536 | 732M | **80.3** |
44
+
45
+ ### Class-Unconditional Image Generation on ImageNet (Resolution: 256 $\times$ 256)
46
+
47
+ | Type | Methods | \# Param | \# Epoch | FID $\downarrow$ | IS $\uparrow$ |
48
+ |-------|-------------------------------------|----------|----------|------------------|----------------|
49
+ | GAN | BigGAN | 70M | - | 38.6 | 24.70 |
50
+ | Diff. | LDM | 395M | - | 39.1 | 22.83 |
51
+ | Diff. | ADM | 554M | - | 26.2 | 39.70 |
52
+ | MIM | MAGE | 200M | 1600 | 11.1 | 81.17 |
53
+ | MIM | MAGE | 463M | 1600 | 9.10 | 105.1 |
54
+ | MIM | MaskGIT | 227M | 300 | 20.7 | 42.08 |
55
+ | MIM | **DiGIT (+MaskGIT)** | 219M | 200 | **9.04** | **75.04** |
56
+ | AR | VQGAN | 214M | 200 | 24.38 | 30.93 |
57
+ | AR | **DiGIT (+VQGAN)** | 219M | 400 | **9.13** | **73.85** |
58
+ | AR | **DiGIT (+VQGAN)** | 732M | 200 | **4.59** | **141.29** |
59
+
60
+ ### Class-Conditional Image Generation on ImageNet (Resolution: 256 $\times$ 256)
61
+
62
+
63
+
64
+ | Type | Methods | \# Param | \# Epoch | FID $\downarrow$ | IS $\uparrow$ |
65
+ |-------|----------------------|----------|----------|------------------|----------------|
66
+ | GAN | BigGAN | 160M | - | 6.95 | 198.2 |
67
+ | Diff. | ADM | 554M | - | 10.94 | 101.0 |
68
+ | Diff. | LDM-4 | 400M | - | 10.56 | 103.5 |
69
+ | Diff. | DiT-XL/2 | 675M | - | 9.62 | 121.50 |
70
+ | Diff. | L-DiT-7B | 7B | - | 6.09 | 153.32 |
71
+ | MIM | CQR-Trans | 371M | 300 | 5.45 | 172.6 |
72
+ | MIM+AR | VAR | 310M | 200 | 4.64 | - |
73
+ | MIM+AR | VAR | 310M | 200 | 3.60* | 257.5* |
74
+ | MIM+AR | VAR | 600M | 250 | 2.95* | 306.1* |
75
+ | MIM | MAGVIT-v2 | 307M | 1080 | 3.65 | 200.5 |
76
+ | AR | VQVAE-2 | 13.5B | - | 31.11 | 45 |
77
+ | AR | RQ-Trans | 480M | - | 15.72 | 86.8 |
78
+ | AR | RQ-Trans | 3.8B | - | 7.55 | 134.0 |
79
+ | AR | ViTVQGAN | 650M | 360 | 11.20 | 97.2 |
80
+ | AR | ViTVQGAN | 1.7B | 360 | 5.3 | 149.9 |
81
+ | MIM | MaskGIT | 227M | 300 | 6.18 | 182.1 |
82
+ | MIM | **DiGIT (+MaskGIT)** | 219M | 200 | **4.62** | **146.19** |
83
+ | AR | VQGAN | 227M | 300 | 18.65 | 80.4 |
84
+ | AR | **DiGIT (+VQGAN)** | 219M | 200 | **4.79** | **142.87** |
85
+ | AR | **DiGIT (+VQGAN)** | 732M | 200 | **3.39** | **205.96** |
86
+
87
+ *: VAR is trained with classifier-free guidance while all the other models are not.
88
+
89
+
90
+ ## Checkpoints
91
+ The K-Means npy file and model checkpoints can be downloaded from:
92
+
93
+ | Model | Link |
94
+ |:----------:|:-----:|
95
+ | HF weightsπŸ€— | [Huggingface](https://huggingface.co/DAMO-NLP-SG/DiGIT) |
96
+ | Google Drive | [Google Drive](https://drive.google.com/drive/folders/1QWc51HhnZ2G4xI7TkKRanaqXuo8WxUSI?usp=share_link) |
97
+
98
+ For the base model we use [DINOv2-base](https://dl.fbaipublicfiles.com/dinov2/dinov2_vitb14/dinov2_vitb14_reg4_pretrain.pth) and [DINOv2-large](https://dl.fbaipublicfiles.com/dinov2/dinov2_vitl14/dinov2_vitl14_reg4_pretrain.pth) for large size model. The VQGAN we use is the same as [MAGE](https://drive.google.com/file/d/13S_unB87n6KKuuMdyMnyExW0G1kplTbP/view?usp=sharing).
99
+
100
+ ```
101
+ DiGIT
102
+ └── data/
103
+ β”œβ”€β”€ ILSVRC2012
104
+ β”œβ”€β”€ dinov2_base_short_224_l3
105
+ β”œβ”€β”€ km_8k.npy
106
+ β”œβ”€β”€ dinov2_large_short_224_l3
107
+ β”œβ”€β”€ km_16k.npy
108
+ └── outputs/
109
+ β”œβ”€β”€ base_8k_stage1
110
+ β”œβ”€β”€ ...
111
+ └── models/
112
+ β”œβ”€β”€ vqgan_jax_strongaug.ckpt
113
+ β”œβ”€β”€ dinov2_vitb14_reg4_pretrain.pth
114
+ β”œβ”€β”€ dinov2_vitl14_reg4_pretrain.pth
115
+ ```
116
+
117
+ The training and inference code can be found at our github repo https://github.com/DAMO-NLP-SG/DiGIT
118
+
119
+
120
+ ## Citation
121
+
122
+ If you find our project useful, hope you can star our repo and cite our work as follows.
123
+
124
+ ```bibtex
125
+
126
+
127
+ @misc
128
+
129
+ {zhu2024stabilizelatentspaceimage,
130
+ title={Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective},
131
+ author={Yongxin Zhu and Bocheng Li and Hang Zhang and Xin Li and Linli Xu and Lidong Bing},
132
+ year={2024},
133
+ eprint={2410.12490},
134
+ archivePrefix={arXiv},
135
+ primaryClass={cs.CV},
136
+ url={https://arxiv.org/abs/2410.12490},
137
+ }
138
+ ```