This is a Flax/JAX implementation of VQGAN, which learns a codebook of context-rich visual parts by leveraging both the use of convolutional methods and transformers. It was introduced in Taming Transformers for High-Resolution Image Synthesis (CVPR paper).
The model allows the encoding of images as a fixed-length sequence of tokens taken from the codebook.
This version of the model uses a reduction factor
f=16 and a vocabulary of
As an example of how the reduction factor works, images of size
256x256 are encoded to sequences of
256/16 * 256/16. Images of
512x512 would result in sequences of
This model was ported to JAX using a checkpoint trained on ImageNet.
The checkpoint can be loaded using Suraj Patil's implementation of
- Downloads last month
Unable to determine this model’s pipeline type. Check the docs .