upload readme
Browse files- .gitattributes +1 -0
- README.md +167 -3
- assets/pipeline.png +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
assets/pipeline.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,3 +1,167 @@
|
|
| 1 |
-
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders
|
| 2 |
+
|
| 3 |
+
<p align="center">
|
| 4 |
+
<a href=""><img src="https://img.shields.io/badge/Paper-Arxiv-b31b1b.svg" alt="arXiv"></a>
|
| 5 |
+
<a href="https://huggingface.co/inclusionAI/TC-AE/tree/main"><img src="https://img.shields.io/badge/🤗%20Hugging%20Face-Models-yellow" alt="Models"></a>
|
| 6 |
+
</p>
|
| 7 |
+
<div align="center">
|
| 8 |
+
<a href="https://tliby.github.io/" target="_blank">Teng Li</a><sup>1,2*</sup>,
|
| 9 |
+
<a href="https://huang-ziyuan.github.io/" target="_blank">Ziyuan Huang</a><sup>1,*,✉</sup>,
|
| 10 |
+
<a href="https://scholar.google.com/citations?user=kwDXTpAAAAAJ&hl=en" target="_blank">Cong Chen</a><sup>1,3,*</sup>,
|
| 11 |
+
<a href="https://ychenl.github.io/" target="_blank">Yangfu Li</a><sup>1,4</sup>,
|
| 12 |
+
<a href="https://qc-ly.github.io/" target="_blank">Yuanhuiyi Lyu</a><sup>1,5</sup>, <br>
|
| 13 |
+
<a href="#" target="_blank">Dandan Zheng</a><sup>1</sup>,
|
| 14 |
+
<a href="https://scholar.google.com/citations?user=Ljk2BvIAAAAJ&hl=en" target="_blank">Chunhua Shen</a><sup>3</sup>,
|
| 15 |
+
<a href="https://eejzhang.people.ust.hk/" target="_blank">Jun Zhang</a><sup>2✉</sup><br>
|
| 16 |
+
<sup>1</sup>Inclusion AI, Ant Group, <sup>2</sup>HKUST, <sup>3</sup>ZJU, <sup>4</sup>ECNU, <sup>5</sup>HKUST (GZ) <br>
|
| 17 |
+
<sup>*</sup>Equal contribution, ✉ Corresponding authors <br>
|
| 18 |
+
</div>
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
## News
|
| 24 |
+
|
| 25 |
+
- [2026/03/30] Research paper, code, and models are released for TC-AE!
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
## Introduction
|
| 29 |
+
|
| 30 |
+
<p align="center">
|
| 31 |
+
<img src="assets/pipeline.png" width=98%>
|
| 32 |
+
<p>
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
**TC-AE** is a novel Vision Transformer (ViT)-based tokenizer for deep image compression and visual generation. Traditional deep compression methods typically increase channel dimensions to maintain reconstruction quality at high compression ratios, but this often leads to representation collapse that degrades generative performance. TC-AE addresses this fundamental challenge from a new perspective: **optimizing the token space** — the critical bridge between pixels and latent representations. By scaling token numbers and enhancing their semantic structure, TC-AE achieves superior reconstruction and generation quality. Key Innovations:
|
| 37 |
+
|
| 38 |
+
- Token Space Optimization: First to address representation collapse through token sapce optimization
|
| 39 |
+
- Staged Token Compression: Decomposes token-to-latent mapping into two stages, reducing structural information loss in the bottleneck
|
| 40 |
+
- Semantic Enhancement: Incorporates self-supervised learning to produce more generative-friendly latents
|
| 41 |
+
|
| 42 |
+
🚀 In this codebase, we release:
|
| 43 |
+
|
| 44 |
+
- Pre-trained TC-AE tokenizer weights and evaluation code
|
| 45 |
+
- Diffusion model training and evaluation code
|
| 46 |
+
|
| 47 |
+
## Environment Setup
|
| 48 |
+
|
| 49 |
+
To set up the environment for TC-AE, follow these steps:
|
| 50 |
+
|
| 51 |
+
```shell
|
| 52 |
+
conda create -n tcae python=3.9
|
| 53 |
+
conda activate tcae
|
| 54 |
+
pip install -r requirements.txt
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
## Download Checkpoints
|
| 58 |
+
|
| 59 |
+
Download the pre-trained TC-AE weights and place them in the `results/` directory:
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
| Tokenizer | Compression Ratio | rFID | LPIPS | Pretrained Weights |
|
| 63 |
+
| --------- | ----------------- | ---- | ----- | ------------------------------------------------------------ |
|
| 64 |
+
| TC-AE-SL | f32d128 | 0.35 | 0.060 | [](https://huggingface.co/inclusionAI/TC-AE/tree/main) |
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
## Reconstruction Evaluation
|
| 68 |
+
|
| 69 |
+
##### Image Reconstruction Demo
|
| 70 |
+
|
| 71 |
+
```shell
|
| 72 |
+
python tcae/script/demo_recon.py \
|
| 73 |
+
--img_folder /path/to/your/images \
|
| 74 |
+
--output_folder /path/to/output \
|
| 75 |
+
--ckpt_path results/tcae.pt \
|
| 76 |
+
--config configs/TC-AE-SL.yaml \
|
| 77 |
+
--rank 0
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
##### ImageNet Evaluation
|
| 81 |
+
|
| 82 |
+
Evaluate reconstruction quality on ImageNet validation set:
|
| 83 |
+
|
| 84 |
+
```shell
|
| 85 |
+
python tcae/script/eval_recon.py \
|
| 86 |
+
--ckpt_path results/tcae.pt \
|
| 87 |
+
--dataset_root /path/to/imagenet_val \
|
| 88 |
+
--config configs/TC-AE-SL.yaml \
|
| 89 |
+
--rank 0
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
## Generation Evaluation
|
| 93 |
+
|
| 94 |
+
Our DiT architecture and training pipeline are based on [RAE](https://github.com/bytetriper/RAE) and [VA-VAE](https://github.com/hustvl/LightningDiT).
|
| 95 |
+
|
| 96 |
+
##### Prepare ImageNet Latents for Training
|
| 97 |
+
|
| 98 |
+
Extract and cache latent representations from ImageNet training set:
|
| 99 |
+
|
| 100 |
+
```shell
|
| 101 |
+
accelerate launch \
|
| 102 |
+
--mixed_precision bf16 \
|
| 103 |
+
diffusion/script/extract_features.py \
|
| 104 |
+
--data_path /path/to/imagenet_train \
|
| 105 |
+
--batch_size 50 \
|
| 106 |
+
--tokenizer_cfg_path configs/TC-AE-SL.yaml \
|
| 107 |
+
--tokenizer_ckpt_path results/tcae.pt
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
This will cache latents to `results/cached_latents/imagenet_train_256/`.
|
| 111 |
+
|
| 112 |
+
##### Training
|
| 113 |
+
|
| 114 |
+
Train a DiT-XL model on the extracted latents:
|
| 115 |
+
|
| 116 |
+
```shell
|
| 117 |
+
mkdir -p results/dit
|
| 118 |
+
torchrun --standalone --nproc_per_node=8 \
|
| 119 |
+
diffusion/script/train_dit.py \
|
| 120 |
+
--config configs/DiT-XL.yaml \
|
| 121 |
+
--data-path results/cached_latents/imagenet_train_256 \
|
| 122 |
+
--results-dir results/dit \
|
| 123 |
+
--image-size 256 \
|
| 124 |
+
--precision bf16
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
##### Sampling
|
| 128 |
+
|
| 129 |
+
Generate images using the trained diffusion model:
|
| 130 |
+
|
| 131 |
+
```shell
|
| 132 |
+
mkdir -p results/dit/samples
|
| 133 |
+
torchrun --standalone --nnodes=1 --nproc_per_node=8 \
|
| 134 |
+
diffusion/script/sample_ddp_dit.py \
|
| 135 |
+
--config configs/DiT-XL.yaml \
|
| 136 |
+
--sample-dir results/dit/samples \
|
| 137 |
+
--precision bf16 \
|
| 138 |
+
--label-sampling equal \
|
| 139 |
+
--tokenizer_cfg_path configs/TC-AE-SL.yaml \
|
| 140 |
+
--tokenizer_ckpt_path results/tcae.pt
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
##### Evaluation
|
| 144 |
+
|
| 145 |
+
Download the ImageNet reference statistics: [adm_in256_stats.npz](https://huggingface.co/jjiaweiyang/l-DeTok/commit/28ef58d254bb1bde10e331372fe542e5458f3b5f#d2h-232267) and place it in `results/`.
|
| 146 |
+
|
| 147 |
+
```shell
|
| 148 |
+
python diffusion/script/eval_dit.py \
|
| 149 |
+
--generated_dir results/dit/samples/DiT-0100000-cfg-1.00-bs100-ODE-50-euler-bf16 \
|
| 150 |
+
--reference_npz results/adm_in256_stats.npz \
|
| 151 |
+
--batch-size 512 \
|
| 152 |
+
--num-workers 8
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
## Acknowledgements
|
| 156 |
+
|
| 157 |
+
The codebase is built on [HieraTok](https://arxiv.org/abs/2509.23736), [RAE](https://github.com/bytetriper/RAE), [VA-VAE](https://github.com/hustvl/LightningDiT), [iBOT](https://github.com/bytedance/ibot). Thanks for their efforts!
|
| 158 |
+
|
| 159 |
+
## License
|
| 160 |
+
|
| 161 |
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
| 162 |
+
|
| 163 |
+
## Citation
|
| 164 |
+
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
```
|
assets/pipeline.png
ADDED
|
Git LFS Details
|