This is a Pytorch Lightning checkpoint of VQGAN, which learns a codebook of context-rich visual parts by leveraging both the use of convolutional methods and transformers. It was introduced in Taming Transformers for High-Resolution Image Synthesis (CVPR paper).
The model allows the encoding of images as a fixed-length sequence of tokens taken from the codebook.
This version of the model uses a reduction factor
f=16 and a vocabulary of
As an example of how the reduction factor works, images of size
256x256 are encoded to sequences of
256/16 * 256/16. Images of
512x512 would result in sequences of
- ImageNet. We didn't train this model from scratch. Instead, we started from a checkpoint pre-trained on ImageNet.
- Conceptual Captions 3M (CC3M).
- OpenAI subset of YFCC100M.
We fine-tuned on CC3M and YFCC100M to improve the encoding quality of people and faces, which are not very well represented in ImageNet. We used a subset of 2,268,720 images from CC3M and YFCC100M for this purpose.
Finetuning was performed in PyTorch using taming-transformers. The full training process and model preparation includes these steps:
- Pre-training on ImageNet. Previously performed. We used this checkpoint.
- Fine-tuning, Part 1.
- Fine-tuning, Part 2 – continuation from Part 1. The final checkpoint has been logged as an artifact in the training run and is the model present in this card.
- Conversion to JAX as
The checkpoint can be loaded using Pytorch-Lightning.
omegaconf==2.0.0 is required for loading the checkpoint.
- JAX version of VQGAN, trained on the same datasets described here:
- DALL·E mini, a Flax/JAX simplified implementation of OpenAI's DALL·E.
- Downloads last month
Unable to determine this model’s pipeline type. Check the docs .