Unconditional Image Generation
Fairseq

Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective

arXiv benchmark

This is the official implementation of DiGIT (Github) accepted at NeurIPS 2024. The code will be available soon.

Overview

We present DiGIT, an auto-regressive generative model performing next-token prediction in an abstract latent space derived from self-supervised learning (SSL) models. By employing K-Means clustering on the hidden states of the DINOv2 model, we effectively create a novel discrete tokenizer. This method significantly boosts image generation performance on ImageNet dataset, achieving an FID score of 4.59 for class-unconditional tasks and 3.39 for class-conditional tasks. Additionally, the model enhances image understanding, attaining a linear-probe accuracy of 80.3.

Experimental Results

Linear-Probe Accuracy on ImageNet

Methods # Tokens Features # Params Top-1 Acc. $\uparrow$
iGPT-L 32 $\times$ 32 1536 1362M 60.3
iGPT-XL 64 $\times$ 64 3072 6801M 68.7
VIM+VQGAN 32 $\times$ 32 1024 650M 61.8
VIM+dVAE 32 $\times$ 32 1024 650M 63.8
VIM+ViT-VQGAN 32 $\times$ 32 1024 650M 65.1
VIM+ViT-VQGAN 32 $\times$ 32 2048 1697M 73.2
AIM 16 $\times$ 16 1536 0.6B 70.5
DiGIT (Ours) 16 $\times$ 16 1024 219M 71.7
DiGIT (Ours) 16 $\times$ 16 1536 732M 80.3

Class-Unconditional Image Generation on ImageNet (Resolution: 256 $\times$ 256)

Type Methods # Param # Epoch FID $\downarrow$ IS $\uparrow$
GAN BigGAN 70M - 38.6 24.70
Diff. LDM 395M - 39.1 22.83
Diff. ADM 554M - 26.2 39.70
MIM MAGE 200M 1600 11.1 81.17
MIM MAGE 463M 1600 9.10 105.1
MIM MaskGIT 227M 300 20.7 42.08
MIM DiGIT (+MaskGIT) 219M 200 9.04 75.04
AR VQGAN 214M 200 24.38 30.93
AR DiGIT (+VQGAN) 219M 400 9.13 73.85
AR DiGIT (+VQGAN) 732M 200 4.59 141.29

Class-Conditional Image Generation on ImageNet (Resolution: 256 $\times$ 256)

Type Methods # Param # Epoch FID $\downarrow$ IS $\uparrow$
GAN BigGAN 160M - 6.95 198.2
Diff. ADM 554M - 10.94 101.0
Diff. LDM-4 400M - 10.56 103.5
Diff. DiT-XL/2 675M - 9.62 121.50
Diff. L-DiT-7B 7B - 6.09 153.32
MIM CQR-Trans 371M 300 5.45 172.6
MIM+AR VAR 310M 200 4.64 -
MIM+AR VAR 310M 200 3.60* 257.5*
MIM+AR VAR 600M 250 2.95* 306.1*
MIM MAGVIT-v2 307M 1080 3.65 200.5
AR VQVAE-2 13.5B - 31.11 45
AR RQ-Trans 480M - 15.72 86.8
AR RQ-Trans 3.8B - 7.55 134.0
AR ViTVQGAN 650M 360 11.20 97.2
AR ViTVQGAN 1.7B 360 5.3 149.9
MIM MaskGIT 227M 300 6.18 182.1
MIM DiGIT (+MaskGIT) 219M 200 4.62 146.19
AR VQGAN 227M 300 18.65 80.4
AR DiGIT (+VQGAN) 219M 400 4.79 142.87
AR DiGIT (+VQGAN) 732M 200 3.39 205.96

*: VAR is trained with classifier-free guidance while all the other models are not.

Checkpoints

The K-Means npy file and model checkpoints can be downloaded from:

Model Link
HF weightsπŸ€— Huggingface

For the base model we use DINOv2-base and DINOv2-large for large size model. The VQGAN we use is the same as MAGE.

DiGIT
└── data/
    β”œβ”€β”€ ILSVRC2012
        β”œβ”€β”€ dinov2_base_short_224_l3
            β”œβ”€β”€ km_8k.npy
        β”œβ”€β”€ dinov2_large_short_224_l3
            β”œβ”€β”€ km_16k.npy
└── outputs/
    β”œβ”€β”€ base_8k_stage1
    β”œβ”€β”€ ...
└── models/
    β”œβ”€β”€ vqgan_jax_strongaug.ckpt
    β”œβ”€β”€ dinov2_vitb14_reg4_pretrain.pth
    β”œβ”€β”€ dinov2_vitl14_reg4_pretrain.pth

The training and inference code can be found at our github repo https://github.com/DAMO-NLP-SG/DiGIT

Citation

If you find our project useful, hope you can star our repo and cite our work as follows.

@misc{zhu2024stabilize,
    title={Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective},
    author={Yongxin Zhu and Bocheng Li and Hang Zhang and Xin Li and Linli Xu and Lidong Bing},
    year={2024},
    eprint={2410.12490},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Inference API (serverless) does not yet support fairseq models for this pipeline type.

Dataset used to train DAMO-NLP-SG/DiGIT

Collection including DAMO-NLP-SG/DiGIT