arxiv:2406.07550

An Image is Worth 32 Tokens for Reconstruction and Generation

Published on Jun 11

· Submitted by

akhaliq on Jun 12

#1 Paper of the day

Upvote

Authors:

Qihang Yu ,

Xueqing Deng ,

Xiaohui Shen ,

Abstract

Recent advancements in generative models have highlighted the crucial role of image tokenization in the efficient synthesis of high-resolution images. Tokenization, which transforms images into latent representations, reduces computational demands compared to directly processing pixels and enhances the effectiveness and efficiency of the generation process. Prior methods, such as VQGAN, typically utilize 2D latent grids with fixed downsampling factors. However, these 2D tokenizations face challenges in managing the inherent redundancies present in images, where adjacent regions frequently display similarities. To overcome this issue, we introduce Transformer-based 1-Dimensional Tokenizer (TiTok), an innovative approach that tokenizes images into 1D latent sequences. TiTok provides a more compact latent representation, yielding substantially more efficient and effective representations than conventional techniques. For example, a 256 x 256 x 3 image can be reduced to just 32 discrete tokens, a significant reduction from the 256 or 1024 tokens obtained by prior methods. Despite its compact nature, TiTok achieves competitive performance to state-of-the-art approaches. Specifically, using the same generator framework, TiTok attains 1.97 gFID, outperforming MaskGIT baseline significantly by 4.21 at ImageNet 256 x 256 benchmark. The advantages of TiTok become even more significant when it comes to higher resolution. At ImageNet 512 x 512 benchmark, TiTok not only outperforms state-of-the-art diffusion model DiT-XL/2 (gFID 2.74 vs. 3.04), but also reduces the image tokens by 64x, leading to 410x faster generation process. Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.

View arXiv page View PDF Add to collection

Community

EthanTracy

Jun 12

Can anyone see any mention of the embedding dimension for these models? I can't see it stated anywhere in the paper and they're yet to release any code 🤨

yucornetto

Paper author Jun 12

Thanks for your interest.

For Tokenizer and Detokenizer, they are standard ViT-S/B/L (for TiTok-S/B/L). For codebook, the config is mentioned at "4.1 Preliminary Experiments of 1D Tokenization - Preliminary Experimental Setup.", quoted as "the codebook C is configured to have N = 1024 entries with each entry a vector with 16 channels".

Final models increase the codebook size to 4096 as mentioned in "4.2 Main Experiments - Implementation Details", quoted as "In the final setting for TiTok training, the codebook is configured to N = 4096"

The code & model is current under internal review and we will try our best to release them to public ASAP :)

danjacobellis

Jun 12

•

edited Jun 12

If I understand correctly, this can be used as a lossy compression system to achieve compression ratios in excess of 1000:1 (e.g. 256 x 256 x 24 bits = 1.5MB, but 32 tokens x 12 bits per token = 384 bits). Is this correct?

If so, have you evaluated this against other extreme lossy compression systems? I'd be very curious to see the result!

yucornetto

Paper author Jun 12

Thanks for your interest and comments.

The compression ratio computation seems correct and reasonable to me, and it looks an interesting perspective to view the problem! (we currently uses number of tokens to measure how compact the latent space is but measuring with bits as you did sounds also reasonable and interesting)

TBH I am not so familiar with the topic on lossy compression systems themselves, any references you can introduce to me for comparison against other "extreme lossy compression systems"?

eisneim

Jun 13

this is very exciting! use this tokenizer for multi-modal llm like LLAVA with such high compression ratio would be a great solution

mikewang

Jun 13

•

edited Jun 13

Thanks for sharing this amazing paper! I have a question: what module did you used for upsampling from the output mask tokens to pixels? i.e., (H/f, W/f, D) -> (224, 224, 3)?
Thank you in advance!

yucornetto

Paper author Jun 13

We use a small conv deocder (re-use the MaskGIT-VQGAN's decoder at the decoder-finetuning stage) to upsample the mask tokens to pixels. Ideally it should not be a problem and using a simple linear layer (similar to MAE's last layer) should be fine as well.

robertchen245

Jun 15

•

edited Jun 15

Transformers are strong enough to encode and decode. I think the impressive result of high compression ratio is attributed to the high usage of codebook?
The fact that only a few thousands of fully utilized discrete code are enough to describe all the images is also meaningful to VLMs

robertchen245

Jun 15

Hi, I have some questions. The paper mentions that proxy codes are from MaskGIT (codebook 1024X16) which is convolutional base. But the TiTok has 4096 codes and is based on ViT. How does the distillation go with different architecture and codebook setting?

robertchen245

Jun 15

I figured out the training a little bit. But still wondering what do the proxy codes refer to, Ze or Zq?(before or after quantization of MaskGIT).
Does the first stage loss only include the alignment between TiTok's output and the proxy codes, without considering the loss of the RGB result (TiTok encoder -> TiTok decoder -> MaskGIT decoder) with GT image?
Thank you in advance

robertchen245

Jun 15

This comment has been hidden

kabachuha

Jun 20

Lucidrains has just released the unofficial version https://github.com/lucidrains/titok-pytorch

yucornetto

Paper author about 1 month ago

Hi all , thanks a lot for your interests in our work. Welcome to check our code at https://github.com/bytedance/1d-tokenizer or play with hf demo at https://huggingface.co/spaces/fun-research/TiTok

lucasjin

about 1 month ago

Hello, looks like the model can not reconstruct an image with 512 larger size.

(it need crop, if resize won't working)

Is this a limitation?

yucornetto

Paper author 30 days ago

Hi,

Thanks for your interest run our work. In the paper we have verified TiTok on both 256 and 512 resolution and they work fine. If you mean arbitrary input size/aspect ratio for a trained TiTok, I believe it is an standalone research topic for vision transformer and beyond the scope of this paper. I provided some reference aiming at getting rid of this ViT limitation if you are interested.

https://arxiv.org/abs/2307.06304
https://huggingface.co/adept/fuyu-8b