devzhk commited on Dec 7, 2024

Commit

972a35a

1 Parent(s): 6e0fb69

Add model files

Browse files

Files changed (50) hide show

.DS_Store +0 -0
Dockerfile +4 -0
LICENSE +21 -0
README.md +163 -3
assets/figs/12samples_compressed.png +3 -0
assets/figs/arch.png +3 -0
assets/figs/bar_mem_256.png +3 -0
assets/figs/bar_mem_512.png +3 -0
assets/figs/bar_speed_256.png +3 -0
assets/figs/bar_speed_512.png +3 -0
assets/figs/bubble_gflops_wg.png +3 -0
assets/figs/bubble_gflops_wog.png +3 -0
assets/figs/maskdit_arch.png +3 -0
assets/figs/repo_head.png +3 -0
assets/figs/sample512-set1.png +3 -0
assets/imagenet_label.json +1 -0
autoencoder.py +522 -0
checkpoints/.DS_Store +0 -0
configs/finetune/imagenet256-latent-const.yaml +49 -0
configs/finetune/imagenet256-latent-cos.yaml +49 -0
configs/finetune/imagenet512-latent.yaml +47 -0
configs/test/maskdit-256.yaml +45 -0
configs/test/maskdit-512.yaml +46 -0
configs/train/imagenet256-latent.yaml +48 -0
configs/train/imagenet512-latent.yaml +47 -0
eval_latent.py +132 -0
evaluator.py +695 -0
extract_latent.py +114 -0
fid.py +177 -0
generate.py +91 -0
licenses/LICENSE_ADM.txt +21 -0
licenses/LICENSE_DIT.txt +400 -0
licenses/LICENSE_EDM.txt +439 -0
licenses/LICENSE_UVIT.txt +21 -0
lmdb2wds.py +39 -0
models/maskdit.py +781 -0
sample.py +397 -0
scripts/download_assets.sh +8 -0
scripts/finetune_latent512.sh +14 -0
scripts/prepare_latent256.sh +3 -0
scripts/prepare_latent512.sh +6 -0
scripts/train_latent512.sh +11 -0
torch_utils/__init__.py +0 -0
torch_utils/persistence.py +276 -0
train.py +336 -0
train_utils/datasets.py +412 -0
train_utils/helper.py +69 -0
train_utils/loss.py +101 -0
train_wds.py +400 -0
utils.py +225 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

Dockerfile ADDED Viewed

	@@ -0,0 +1,4 @@

+FROM nvcr.io/nvidia/pytorch:23.03-py3
+RUN pip install einops lmdb omegaconf wandb tqdm pyyaml accelerate
+RUN pip install timm webdataset
+RUN pip install diffusers["torch"] transformers

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) [2023] [Anima-Lab]
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,3 +1,163 @@
----
-license: mit
----

+# Fast Training of Diffusion Models with Masked Transformers
+Official PyTorch implementation of the TMLR 2024 paper:<br>
+**[Fast Training of Diffusion Models with Masked Transformers](https://openreview.net/pdf?id=vTBjBtGioE)**
+<br>
+Hongkai Zheng*, Weili Nie*, Arash Vahdat, Anima Anandkumar <br>
+(*Equal contribution)<br>
+Abstract: *While masked transformers have been extensively explored for representation learning, their application to
+generative learning is less explored in the vision domain. Our work is the first to exploit masked training to reduce
+the training cost of diffusion models significantly. Specifically, we randomly mask out a high proportion (e.g., 50%) of
+patches in diffused input images during training. For masked training, we introduce an asymmetric encoder-decoder
+architecture consisting of a transformer encoder that operates only on unmasked patches and a lightweight transformer
+decoder on full patches. To promote a long-range understanding of full patches, we add an auxiliary task of
+reconstructing masked patches to the denoising score matching objective that learns the score of unmasked patches.
+Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better
+generative performance than the state-of-the-art Diffusion Transformer (DiT) model, using only around 30% of its
+original training time. Thus, our method shows a promising way of efficiently training large transformer-based diffusion
+models without sacrificing the generative performance.*
+<div align='center'>
+<img src="assets/figs/repo_head.png" alt="Architecture" width="900" height="500" style="display: block;"/>
+</div>
+## Requirements
+- Training MaskDiT on ImageNet256x256 takes around 260 hours with 8 A100 GPUs to perform 2M updates with a batch size of
+  1024.
+- Training MaskDiT on ImageNet512x512 takes around 210 A100 GPU days to perform 1M updates with a batch size of 1024.
+- At least one high-end GPU for sampling.
+- [Dockerfile](Dockerfile) is provided for exact software environment.
+## Efficiency
+Our MaskDiT applies Automatic Mixed Precision (AMP) by default. We also add the MaskDiT without AMP (Ours_ft32) for
+reference.
+### Training speed
+<img src="assets/figs/bar_speed_256.png" width=45% style="display: inline-block;"><img src="assets/figs/bar_speed_512.png" width=46%  style="display: inline-block;">
+### GPU memory
+<img src="assets/figs/bar_mem_256.png" width=45% style="display: inline-block;"><img src="assets/figs/bar_mem_512.png" width=44.3%  style="display: inline-block;">
+## Pretrained Models
+We provide pretrained models of MaskDiT for ImageNet256 and ImageNet512 in the following table. For FID with guidance, the guidance scale is set to 1.5 by default.
+| Guidance | Resolution | FID   | Model                                                                                                                     |
+| :------- | :--------- | :---- | :------------------------------------------------------------------------------------------------------------------------ |
+| Yes      | 256x256    | 2.28  | [imagenet256-guidance.pt](checkpoints/imagenet256_with_guidance.pt)       |
+| No       | 256x256    | 5.69  | [imagenet256-conditional.pt](checkpoints/imagenet256_without_guidance.pt) |
+| Yes      | 512x512    | 2.50  | [imagenet512-guidance.pt](checkpoints/imagenet512_with_guidance.pt)                                   |
+| No       | 512x512    | 10.79 | [imagenet512-conditional.pt](checkpoints/imagenet512_without_guidance.pt)                                 |                                                                                       |
+## Generate from pretrained models
+To generate samples from provided checkpoints, run
+```bash
+python3 generate.py --config configs/test/maskdit-512.yaml --ckpt_path [path to checkpoints] --class_idx [class index from 0-999] --cfg_scale [guidance scale]
+```
+<img src="assets/figs/12samples_compressed.png" title="Generated samples from MaskDiT 256x256." width="850" style="display: block; margin: 0 auto;"/>
+<p align='center'> Generated samples from MaskDiT 256x256. Upper panel: without CFG. Lower panel: with CFG (scale=1.5).
+<p\>
+<img src="assets/figs/imagenet512.png" title="Generated samples from MaskDiT 512x512." width="850" style="display: block; margin: 0 auto;"/>
+<p align='center'> Generated samples from MaskDiT 512x512 with CFG (scale=1.5).
+<p\>
+## Prepare dataset
+We use the pre-trained VAE to first encode the ImageNet dataset into latent space. You can download the ImageNet-256x256
+and ImageNet-512x512 that have been encoded into latent space by running
+```bash
+bash scripts/download_assets.sh
+```
+`extract_latent.py` was used to encode the ImageNet.
+### LMDB to Webdataset
+When training on ImageNet-256x256, we store our data in LMDB format. When training on ImageNet-512x512, we
+use [webdataset](https://github.com/webdataset/webdataset) for faster IO performance. To convert a LMDB dataset into a
+webdataset, run
+```
+python3 lmdb2wds.py --datadir [path to lmdb] --outdir [path to save webdataset] --resolution [latent resolution] --num_channels [number of latent channels]
+```
+## Train
+### ImageNet-256x256
+First train MaskDiT with 50% mask ratio with AMP enabled.
+```bash
+accelerate launch --multi_gpu train.py --config configs/train/imagenet256-latent.yaml
+```
+Then finetune with unmasking.
+```bash
+accelerate launch --multi_gpu train.py --config configs/finetune/imagenet256-latent-const.yaml --ckpt_path [path to checkpoint] --use_ckpt_path False --use_strict_load False --no_amp
+```
+### ImageNet-512x512
+Train MaskDiT with 50% mask ratio with AMP enabled. Here is an example of 4-node training script.
+```bash
+bash scripts/train_latent512.sh
+```
+Then finetune with unmasking.
+```bash
+bash scripts/finetune_latent512.sh
+```
+## Evaluation
+### FID evaluation
+To compute a FID of a pretrained model, run
+```bash
+accelerate launch --multi_gpu eval_latent.py --config configs/test/maskdit-256.yaml --ckpt [path to the pretrained model] --cfg_scale [guidance scale]
+```
+### Full evaluation
+First, download the reference from [ADM repo](https://github.com/openai/guided-diffusion/tree/main/evaluations)
+directly. You can also use `download_assets.py` by running
+```bash
+python3 download_assets.py --name imagenet256 --dest [destination directory]
+```
+Then we use the evaluator `evaluator.py`
+from [ADM repo](https://github.com/openai/guided-diffusion/tree/main/evaluations), or `fid.py`
+from [EDM repo](https://github.com/NVlabs/edm), to evaluate the generated samples.
+### Citation
+```
+@inproceedings{Zheng2024MaskDiT,
+  title={Fast Training of Diffusion Models with Masked Transformers},
+  author={Zheng, Hongkai and Nie, Weili and Vahdat, Arash and Anandkumar, Anima},
+  booktitle = {Transactions on Machine Learning Research (TMLR)},
+  year={2024}
+}
+```
+## Acknowledgements
+Thanks to the open source codebases such as [DiT](https://github.com/facebookresearch/DiT)
+, [MAE](https://github.com/facebookresearch/mae), [U-ViT](https://github.com/baofff/U-ViT)
+, [ADM](https://github.com/openai/guided-diffusion), and [EDM](https://github.com/NVlabs/edm). Our codebase is built on
+them.

assets/figs/12samples_compressed.png ADDED Viewed

Git LFS Details

SHA256: 827525f00c9ec3de222f20082efc9e78f058cb80aae48712113a22df9819f24c
Pointer size: 132 Bytes
Size of remote file: 3.59 MB

assets/figs/arch.png ADDED Viewed

Git LFS Details

SHA256: 958ae1cf735092a826f0c0b227b2b3628066375b8b136af78485071565da59a5
Pointer size: 132 Bytes
Size of remote file: 3.13 MB

assets/figs/bar_mem_256.png ADDED Viewed

Git LFS Details

SHA256: a06b76bcaf818307c4c287569b354352f1d2d7e671f417df4d198b13108685c1
Pointer size: 130 Bytes
Size of remote file: 27.9 kB

assets/figs/bar_mem_512.png ADDED Viewed

Git LFS Details

SHA256: de707af00e56ef9bc9c42bf7066f79723a7bcc03c7b4a930d7d14971e95c00ac
Pointer size: 130 Bytes
Size of remote file: 74.2 kB

assets/figs/bar_speed_256.png ADDED Viewed

Git LFS Details

SHA256: 0e6a49fd1d7b9bf50b2e092f3ceceab5b87fa7487537887a1fdcb92a242e43f3
Pointer size: 130 Bytes
Size of remote file: 25.3 kB

assets/figs/bar_speed_512.png ADDED Viewed

Git LFS Details

SHA256: d68bc409e3232cf2075df6bdf5183959ad8269da2d6576b995b419967a3dee23
Pointer size: 130 Bytes
Size of remote file: 72.4 kB

assets/figs/bubble_gflops_wg.png ADDED Viewed

Git LFS Details

SHA256: 10f764e67f101e23a35d8ea989e6c97080a817102d3913a3b886baf638b8cd20
Pointer size: 131 Bytes
Size of remote file: 318 kB

assets/figs/bubble_gflops_wog.png ADDED Viewed

Git LFS Details

SHA256: 209d48bcf148572491e3004f488a6709acf3def862d8b071421e5609d5009c99
Pointer size: 131 Bytes
Size of remote file: 323 kB

assets/figs/maskdit_arch.png ADDED Viewed

Git LFS Details

SHA256: f0d4cda130e6c3ebfa39abd33685824e369149e2ed1fbc6d22cddb06f7192d41
Pointer size: 132 Bytes
Size of remote file: 2.09 MB

assets/figs/repo_head.png ADDED Viewed

Git LFS Details

SHA256: fd3d4db8cd7d8147ecaa07bf11b69ddae97ec935d876e07e078e2166fba404f6
Pointer size: 132 Bytes
Size of remote file: 3.21 MB

assets/figs/sample512-set1.png ADDED Viewed

Git LFS Details

SHA256: 21abf3b5ba37f436dc2baf5ce7003533e2e6fa62a7361e2478e68b3d7f3772be
Pointer size: 132 Bytes
Size of remote file: 4.4 MB

assets/imagenet_label.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"0": ["n01440764", "tench"], "1": ["n01443537", "goldfish"], "2": ["n01484850", "great_white_shark"], "3": ["n01491361", "tiger_shark"], "4": ["n01494475", "hammerhead"], "5": ["n01496331", "electric_ray"], "6": ["n01498041", "stingray"], "7": ["n01514668", "cock"], "8": ["n01514859", "hen"], "9": ["n01518878", "ostrich"], "10": ["n01530575", "brambling"], "11": ["n01531178", "goldfinch"], "12": ["n01532829", "house_finch"], "13": ["n01534433", "junco"], "14": ["n01537544", "indigo_bunting"], "15": ["n01558993", "robin"], "16": ["n01560419", "bulbul"], "17": ["n01580077", "jay"], "18": ["n01582220", "magpie"], "19": ["n01592084", "chickadee"], "20": ["n01601694", "water_ouzel"], "21": ["n01608432", "kite"], "22": ["n01614925", "bald_eagle"], "23": ["n01616318", "vulture"], "24": ["n01622779", "great_grey_owl"], "25": ["n01629819", "European_fire_salamander"], "26": ["n01630670", "common_newt"], "27": ["n01631663", "eft"], "28": ["n01632458", "spotted_salamander"], "29": ["n01632777", "axolotl"], "30": ["n01641577", "bullfrog"], "31": ["n01644373", "tree_frog"], "32": ["n01644900", "tailed_frog"], "33": ["n01664065", "loggerhead"], "34": ["n01665541", "leatherback_turtle"], "35": ["n01667114", "mud_turtle"], "36": ["n01667778", "terrapin"], "37": ["n01669191", "box_turtle"], "38": ["n01675722", "banded_gecko"], "39": ["n01677366", "common_iguana"], "40": ["n01682714", "American_chameleon"], "41": ["n01685808", "whiptail"], "42": ["n01687978", "agama"], "43": ["n01688243", "frilled_lizard"], "44": ["n01689811", "alligator_lizard"], "45": ["n01692333", "Gila_monster"], "46": ["n01693334", "green_lizard"], "47": ["n01694178", "African_chameleon"], "48": ["n01695060", "Komodo_dragon"], "49": ["n01697457", "African_crocodile"], "50": ["n01698640", "American_alligator"], "51": ["n01704323", "triceratops"], "52": ["n01728572", "thunder_snake"], "53": ["n01728920", "ringneck_snake"], "54": ["n01729322", "hognose_snake"], "55": ["n01729977", "green_snake"], "56": ["n01734418", "king_snake"], "57": ["n01735189", "garter_snake"], "58": ["n01737021", "water_snake"], "59": ["n01739381", "vine_snake"], "60": ["n01740131", "night_snake"], "61": ["n01742172", "boa_constrictor"], "62": ["n01744401", "rock_python"], "63": ["n01748264", "Indian_cobra"], "64": ["n01749939", "green_mamba"], "65": ["n01751748", "sea_snake"], "66": ["n01753488", "horned_viper"], "67": ["n01755581", "diamondback"], "68": ["n01756291", "sidewinder"], "69": ["n01768244", "trilobite"], "70": ["n01770081", "harvestman"], "71": ["n01770393", "scorpion"], "72": ["n01773157", "black_and_gold_garden_spider"], "73": ["n01773549", "barn_spider"], "74": ["n01773797", "garden_spider"], "75": ["n01774384", "black_widow"], "76": ["n01774750", "tarantula"], "77": ["n01775062", "wolf_spider"], "78": ["n01776313", "tick"], "79": ["n01784675", "centipede"], "80": ["n01795545", "black_grouse"], "81": ["n01796340", "ptarmigan"], "82": ["n01797886", "ruffed_grouse"], "83": ["n01798484", "prairie_chicken"], "84": ["n01806143", "peacock"], "85": ["n01806567", "quail"], "86": ["n01807496", "partridge"], "87": ["n01817953", "African_grey"], "88": ["n01818515", "macaw"], "89": ["n01819313", "sulphur-crested_cockatoo"], "90": ["n01820546", "lorikeet"], "91": ["n01824575", "coucal"], "92": ["n01828970", "bee_eater"], "93": ["n01829413", "hornbill"], "94": ["n01833805", "hummingbird"], "95": ["n01843065", "jacamar"], "96": ["n01843383", "toucan"], "97": ["n01847000", "drake"], "98": ["n01855032", "red-breasted_merganser"], "99": ["n01855672", "goose"], "100": ["n01860187", "black_swan"], "101": ["n01871265", "tusker"], "102": ["n01872401", "echidna"], "103": ["n01873310", "platypus"], "104": ["n01877812", "wallaby"], "105": ["n01882714", "koala"], "106": ["n01883070", "wombat"], "107": ["n01910747", "jellyfish"], "108": ["n01914609", "sea_anemone"], "109": ["n01917289", "brain_coral"], "110": ["n01924916", "flatworm"], "111": ["n01930112", "nematode"], "112": ["n01943899", "conch"], "113": ["n01944390", "snail"], "114": ["n01945685", "slug"], "115": ["n01950731", "sea_slug"], "116": ["n01955084", "chiton"], "117": ["n01968897", "chambered_nautilus"], "118": ["n01978287", "Dungeness_crab"], "119": ["n01978455", "rock_crab"], "120": ["n01980166", "fiddler_crab"], "121": ["n01981276", "king_crab"], "122": ["n01983481", "American_lobster"], "123": ["n01984695", "spiny_lobster"], "124": ["n01985128", "crayfish"], "125": ["n01986214", "hermit_crab"], "126": ["n01990800", "isopod"], "127": ["n02002556", "white_stork"], "128": ["n02002724", "black_stork"], "129": ["n02006656", "spoonbill"], "130": ["n02007558", "flamingo"], "131": ["n02009229", "little_blue_heron"], "132": ["n02009912", "American_egret"], "133": ["n02011460", "bittern"], "134": ["n02012849", "crane"], "135": ["n02013706", "limpkin"], "136": ["n02017213", "European_gallinule"], "137": ["n02018207", "American_coot"], "138": ["n02018795", "bustard"], "139": ["n02025239", "ruddy_turnstone"], "140": ["n02027492", "red-backed_sandpiper"], "141": ["n02028035", "redshank"], "142": ["n02033041", "dowitcher"], "143": ["n02037110", "oystercatcher"], "144": ["n02051845", "pelican"], "145": ["n02056570", "king_penguin"], "146": ["n02058221", "albatross"], "147": ["n02066245", "grey_whale"], "148": ["n02071294", "killer_whale"], "149": ["n02074367", "dugong"], "150": ["n02077923", "sea_lion"], "151": ["n02085620", "Chihuahua"], "152": ["n02085782", "Japanese_spaniel"], "153": ["n02085936", "Maltese_dog"], "154": ["n02086079", "Pekinese"], "155": ["n02086240", "Shih-Tzu"], "156": ["n02086646", "Blenheim_spaniel"], "157": ["n02086910", "papillon"], "158": ["n02087046", "toy_terrier"], "159": ["n02087394", "Rhodesian_ridgeback"], "160": ["n02088094", "Afghan_hound"], "161": ["n02088238", "basset"], "162": ["n02088364", "beagle"], "163": ["n02088466", "bloodhound"], "164": ["n02088632", "bluetick"], "165": ["n02089078", "black-and-tan_coonhound"], "166": ["n02089867", "Walker_hound"], "167": ["n02089973", "English_foxhound"], "168": ["n02090379", "redbone"], "169": ["n02090622", "borzoi"], "170": ["n02090721", "Irish_wolfhound"], "171": ["n02091032", "Italian_greyhound"], "172": ["n02091134", "whippet"], "173": ["n02091244", "Ibizan_hound"], "174": ["n02091467", "Norwegian_elkhound"], "175": ["n02091635", "otterhound"], "176": ["n02091831", "Saluki"], "177": ["n02092002", "Scottish_deerhound"], "178": ["n02092339", "Weimaraner"], "179": ["n02093256", "Staffordshire_bullterrier"], "180": ["n02093428", "American_Staffordshire_terrier"], "181": ["n02093647", "Bedlington_terrier"], "182": ["n02093754", "Border_terrier"], "183": ["n02093859", "Kerry_blue_terrier"], "184": ["n02093991", "Irish_terrier"], "185": ["n02094114", "Norfolk_terrier"], "186": ["n02094258", "Norwich_terrier"], "187": ["n02094433", "Yorkshire_terrier"], "188": ["n02095314", "wire-haired_fox_terrier"], "189": ["n02095570", "Lakeland_terrier"], "190": ["n02095889", "Sealyham_terrier"], "191": ["n02096051", "Airedale"], "192": ["n02096177", "cairn"], "193": ["n02096294", "Australian_terrier"], "194": ["n02096437", "Dandie_Dinmont"], "195": ["n02096585", "Boston_bull"], "196": ["n02097047", "miniature_schnauzer"], "197": ["n02097130", "giant_schnauzer"], "198": ["n02097209", "standard_schnauzer"], "199": ["n02097298", "Scotch_terrier"], "200": ["n02097474", "Tibetan_terrier"], "201": ["n02097658", "silky_terrier"], "202": ["n02098105", "soft-coated_wheaten_terrier"], "203": ["n02098286", "West_Highland_white_terrier"], "204": ["n02098413", "Lhasa"], "205": ["n02099267", "flat-coated_retriever"], "206": ["n02099429", "curly-coated_retriever"], "207": ["n02099601", "golden_retriever"], "208": ["n02099712", "Labrador_retriever"], "209": ["n02099849", "Chesapeake_Bay_retriever"], "210": ["n02100236", "German_short-haired_pointer"], "211": ["n02100583", "vizsla"], "212": ["n02100735", "English_setter"], "213": ["n02100877", "Irish_setter"], "214": ["n02101006", "Gordon_setter"], "215": ["n02101388", "Brittany_spaniel"], "216": ["n02101556", "clumber"], "217": ["n02102040", "English_springer"], "218": ["n02102177", "Welsh_springer_spaniel"], "219": ["n02102318", "cocker_spaniel"], "220": ["n02102480", "Sussex_spaniel"], "221": ["n02102973", "Irish_water_spaniel"], "222": ["n02104029", "kuvasz"], "223": ["n02104365", "schipperke"], "224": ["n02105056", "groenendael"], "225": ["n02105162", "malinois"], "226": ["n02105251", "briard"], "227": ["n02105412", "kelpie"], "228": ["n02105505", "komondor"], "229": ["n02105641", "Old_English_sheepdog"], "230": ["n02105855", "Shetland_sheepdog"], "231": ["n02106030", "collie"], "232": ["n02106166", "Border_collie"], "233": ["n02106382", "Bouvier_des_Flandres"], "234": ["n02106550", "Rottweiler"], "235": ["n02106662", "German_shepherd"], "236": ["n02107142", "Doberman"], "237": ["n02107312", "miniature_pinscher"], "238": ["n02107574", "Greater_Swiss_Mountain_dog"], "239": ["n02107683", "Bernese_mountain_dog"], "240": ["n02107908", "Appenzeller"], "241": ["n02108000", "EntleBucher"], "242": ["n02108089", "boxer"], "243": ["n02108422", "bull_mastiff"], "244": ["n02108551", "Tibetan_mastiff"], "245": ["n02108915", "French_bulldog"], "246": ["n02109047", "Great_Dane"], "247": ["n02109525", "Saint_Bernard"], "248": ["n02109961", "Eskimo_dog"], "249": ["n02110063", "malamute"], "250": ["n02110185", "Siberian_husky"], "251": ["n02110341", "dalmatian"], "252": ["n02110627", "affenpinscher"], "253": ["n02110806", "basenji"], "254": ["n02110958", "pug"], "255": ["n02111129", "Leonberg"], "256": ["n02111277", "Newfoundland"], "257": ["n02111500", "Great_Pyrenees"], "258": ["n02111889", "Samoyed"], "259": ["n02112018", "Pomeranian"], "260": ["n02112137", "chow"], "261": ["n02112350", "keeshond"], "262": ["n02112706", "Brabancon_griffon"], "263": ["n02113023", "Pembroke"], "264": ["n02113186", "Cardigan"], "265": ["n02113624", "toy_poodle"], "266": ["n02113712", "miniature_poodle"], "267": ["n02113799", "standard_poodle"], "268": ["n02113978", "Mexican_hairless"], "269": ["n02114367", "timber_wolf"], "270": ["n02114548", "white_wolf"], "271": ["n02114712", "red_wolf"], "272": ["n02114855", "coyote"], "273": ["n02115641", "dingo"], "274": ["n02115913", "dhole"], "275": ["n02116738", "African_hunting_dog"], "276": ["n02117135", "hyena"], "277": ["n02119022", "red_fox"], "278": ["n02119789", "kit_fox"], "279": ["n02120079", "Arctic_fox"], "280": ["n02120505", "grey_fox"], "281": ["n02123045", "tabby"], "282": ["n02123159", "tiger_cat"], "283": ["n02123394", "Persian_cat"], "284": ["n02123597", "Siamese_cat"], "285": ["n02124075", "Egyptian_cat"], "286": ["n02125311", "cougar"], "287": ["n02127052", "lynx"], "288": ["n02128385", "leopard"], "289": ["n02128757", "snow_leopard"], "290": ["n02128925", "jaguar"], "291": ["n02129165", "lion"], "292": ["n02129604", "tiger"], "293": ["n02130308", "cheetah"], "294": ["n02132136", "brown_bear"], "295": ["n02133161", "American_black_bear"], "296": ["n02134084", "ice_bear"], "297": ["n02134418", "sloth_bear"], "298": ["n02137549", "mongoose"], "299": ["n02138441", "meerkat"], "300": ["n02165105", "tiger_beetle"], "301": ["n02165456", "ladybug"], "302": ["n02167151", "ground_beetle"], "303": ["n02168699", "long-horned_beetle"], "304": ["n02169497", "leaf_beetle"], "305": ["n02172182", "dung_beetle"], "306": ["n02174001", "rhinoceros_beetle"], "307": ["n02177972", "weevil"], "308": ["n02190166", "fly"], "309": ["n02206856", "bee"], "310": ["n02219486", "ant"], "311": ["n02226429", "grasshopper"], "312": ["n02229544", "cricket"], "313": ["n02231487", "walking_stick"], "314": ["n02233338", "cockroach"], "315": ["n02236044", "mantis"], "316": ["n02256656", "cicada"], "317": ["n02259212", "leafhopper"], "318": ["n02264363", "lacewing"], "319": ["n02268443", "dragonfly"], "320": ["n02268853", "damselfly"], "321": ["n02276258", "admiral"], "322": ["n02277742", "ringlet"], "323": ["n02279972", "monarch"], "324": ["n02280649", "cabbage_butterfly"], "325": ["n02281406", "sulphur_butterfly"], "326": ["n02281787", "lycaenid"], "327": ["n02317335", "starfish"], "328": ["n02319095", "sea_urchin"], "329": ["n02321529", "sea_cucumber"], "330": ["n02325366", "wood_rabbit"], "331": ["n02326432", "hare"], "332": ["n02328150", "Angora"], "333": ["n02342885", "hamster"], "334": ["n02346627", "porcupine"], "335": ["n02356798", "fox_squirrel"], "336": ["n02361337", "marmot"], "337": ["n02363005", "beaver"], "338": ["n02364673", "guinea_pig"], "339": ["n02389026", "sorrel"], "340": ["n02391049", "zebra"], "341": ["n02395406", "hog"], "342": ["n02396427", "wild_boar"], "343": ["n02397096", "warthog"], "344": ["n02398521", "hippopotamus"], "345": ["n02403003", "ox"], "346": ["n02408429", "water_buffalo"], "347": ["n02410509", "bison"], "348": ["n02412080", "ram"], "349": ["n02415577", "bighorn"], "350": ["n02417914", "ibex"], "351": ["n02422106", "hartebeest"], "352": ["n02422699", "impala"], "353": ["n02423022", "gazelle"], "354": ["n02437312", "Arabian_camel"], "355": ["n02437616", "llama"], "356": ["n02441942", "weasel"], "357": ["n02442845", "mink"], "358": ["n02443114", "polecat"], "359": ["n02443484", "black-footed_ferret"], "360": ["n02444819", "otter"], "361": ["n02445715", "skunk"], "362": ["n02447366", "badger"], "363": ["n02454379", "armadillo"], "364": ["n02457408", "three-toed_sloth"], "365": ["n02480495", "orangutan"], "366": ["n02480855", "gorilla"], "367": ["n02481823", "chimpanzee"], "368": ["n02483362", "gibbon"], "369": ["n02483708", "siamang"], "370": ["n02484975", "guenon"], "371": ["n02486261", "patas"], "372": ["n02486410", "baboon"], "373": ["n02487347", "macaque"], "374": ["n02488291", "langur"], "375": ["n02488702", "colobus"], "376": ["n02489166", "proboscis_monkey"], "377": ["n02490219", "marmoset"], "378": ["n02492035", "capuchin"], "379": ["n02492660", "howler_monkey"], "380": ["n02493509", "titi"], "381": ["n02493793", "spider_monkey"], "382": ["n02494079", "squirrel_monkey"], "383": ["n02497673", "Madagascar_cat"], "384": ["n02500267", "indri"], "385": ["n02504013", "Indian_elephant"], "386": ["n02504458", "African_elephant"], "387": ["n02509815", "lesser_panda"], "388": ["n02510455", "giant_panda"], "389": ["n02514041", "barracouta"], "390": ["n02526121", "eel"], "391": ["n02536864", "coho"], "392": ["n02606052", "rock_beauty"], "393": ["n02607072", "anemone_fish"], "394": ["n02640242", "sturgeon"], "395": ["n02641379", "gar"], "396": ["n02643566", "lionfish"], "397": ["n02655020", "puffer"], "398": ["n02666196", "abacus"], "399": ["n02667093", "abaya"], "400": ["n02669723", "academic_gown"], "401": ["n02672831", "accordion"], "402": ["n02676566", "acoustic_guitar"], "403": ["n02687172", "aircraft_carrier"], "404": ["n02690373", "airliner"], "405": ["n02692877", "airship"], "406": ["n02699494", "altar"], "407": ["n02701002", "ambulance"], "408": ["n02704792", "amphibian"], "409": ["n02708093", "analog_clock"], "410": ["n02727426", "apiary"], "411": ["n02730930", "apron"], "412": ["n02747177", "ashcan"], "413": ["n02749479", "assault_rifle"], "414": ["n02769748", "backpack"], "415": ["n02776631", "bakery"], "416": ["n02777292", "balance_beam"], "417": ["n02782093", "balloon"], "418": ["n02783161", "ballpoint"], "419": ["n02786058", "Band_Aid"], "420": ["n02787622", "banjo"], "421": ["n02788148", "bannister"], "422": ["n02790996", "barbell"], "423": ["n02791124", "barber_chair"], "424": ["n02791270", "barbershop"], "425": ["n02793495", "barn"], "426": ["n02794156", "barometer"], "427": ["n02795169", "barrel"], "428": ["n02797295", "barrow"], "429": ["n02799071", "baseball"], "430": ["n02802426", "basketball"], "431": ["n02804414", "bassinet"], "432": ["n02804610", "bassoon"], "433": ["n02807133", "bathing_cap"], "434": ["n02808304", "bath_towel"], "435": ["n02808440", "bathtub"], "436": ["n02814533", "beach_wagon"], "437": ["n02814860", "beacon"], "438": ["n02815834", "beaker"], "439": ["n02817516", "bearskin"], "440": ["n02823428", "beer_bottle"], "441": ["n02823750", "beer_glass"], "442": ["n02825657", "bell_cote"], "443": ["n02834397", "bib"], "444": ["n02835271", "bicycle-built-for-two"], "445": ["n02837789", "bikini"], "446": ["n02840245", "binder"], "447": ["n02841315", "binoculars"], "448": ["n02843684", "birdhouse"], "449": ["n02859443", "boathouse"], "450": ["n02860847", "bobsled"], "451": ["n02865351", "bolo_tie"], "452": ["n02869837", "bonnet"], "453": ["n02870880", "bookcase"], "454": ["n02871525", "bookshop"], "455": ["n02877765", "bottlecap"], "456": ["n02879718", "bow"], "457": ["n02883205", "bow_tie"], "458": ["n02892201", "brass"], "459": ["n02892767", "brassiere"], "460": ["n02894605", "breakwater"], "461": ["n02895154", "breastplate"], "462": ["n02906734", "broom"], "463": ["n02909870", "bucket"], "464": ["n02910353", "buckle"], "465": ["n02916936", "bulletproof_vest"], "466": ["n02917067", "bullet_train"], "467": ["n02927161", "butcher_shop"], "468": ["n02930766", "cab"], "469": ["n02939185", "caldron"], "470": ["n02948072", "candle"], "471": ["n02950826", "cannon"], "472": ["n02951358", "canoe"], "473": ["n02951585", "can_opener"], "474": ["n02963159", "cardigan"], "475": ["n02965783", "car_mirror"], "476": ["n02966193", "carousel"], "477": ["n02966687", "carpenter's_kit"], "478": ["n02971356", "carton"], "479": ["n02974003", "car_wheel"], "480": ["n02977058", "cash_machine"], "481": ["n02978881", "cassette"], "482": ["n02979186", "cassette_player"], "483": ["n02980441", "castle"], "484": ["n02981792", "catamaran"], "485": ["n02988304", "CD_player"], "486": ["n02992211", "cello"], "487": ["n02992529", "cellular_telephone"], "488": ["n02999410", "chain"], "489": ["n03000134", "chainlink_fence"], "490": ["n03000247", "chain_mail"], "491": ["n03000684", "chain_saw"], "492": ["n03014705", "chest"], "493": ["n03016953", "chiffonier"], "494": ["n03017168", "chime"], "495": ["n03018349", "china_cabinet"], "496": ["n03026506", "Christmas_stocking"], "497": ["n03028079", "church"], "498": ["n03032252", "cinema"], "499": ["n03041632", "cleaver"], "500": ["n03042490", "cliff_dwelling"], "501": ["n03045698", "cloak"], "502": ["n03047690", "clog"], "503": ["n03062245", "cocktail_shaker"], "504": ["n03063599", "coffee_mug"], "505": ["n03063689", "coffeepot"], "506": ["n03065424", "coil"], "507": ["n03075370", "combination_lock"], "508": ["n03085013", "computer_keyboard"], "509": ["n03089624", "confectionery"], "510": ["n03095699", "container_ship"], "511": ["n03100240", "convertible"], "512": ["n03109150", "corkscrew"], "513": ["n03110669", "cornet"], "514": ["n03124043", "cowboy_boot"], "515": ["n03124170", "cowboy_hat"], "516": ["n03125729", "cradle"], "517": ["n03126707", "crane"], "518": ["n03127747", "crash_helmet"], "519": ["n03127925", "crate"], "520": ["n03131574", "crib"], "521": ["n03133878", "Crock_Pot"], "522": ["n03134739", "croquet_ball"], "523": ["n03141823", "crutch"], "524": ["n03146219", "cuirass"], "525": ["n03160309", "dam"], "526": ["n03179701", "desk"], "527": ["n03180011", "desktop_computer"], "528": ["n03187595", "dial_telephone"], "529": ["n03188531", "diaper"], "530": ["n03196217", "digital_clock"], "531": ["n03197337", "digital_watch"], "532": ["n03201208", "dining_table"], "533": ["n03207743", "dishrag"], "534": ["n03207941", "dishwasher"], "535": ["n03208938", "disk_brake"], "536": ["n03216828", "dock"], "537": ["n03218198", "dogsled"], "538": ["n03220513", "dome"], "539": ["n03223299", "doormat"], "540": ["n03240683", "drilling_platform"], "541": ["n03249569", "drum"], "542": ["n03250847", "drumstick"], "543": ["n03255030", "dumbbell"], "544": ["n03259280", "Dutch_oven"], "545": ["n03271574", "electric_fan"], "546": ["n03272010", "electric_guitar"], "547": ["n03272562", "electric_locomotive"], "548": ["n03290653", "entertainment_center"], "549": ["n03291819", "envelope"], "550": ["n03297495", "espresso_maker"], "551": ["n03314780", "face_powder"], "552": ["n03325584", "feather_boa"], "553": ["n03337140", "file"], "554": ["n03344393", "fireboat"], "555": ["n03345487", "fire_engine"], "556": ["n03347037", "fire_screen"], "557": ["n03355925", "flagpole"], "558": ["n03372029", "flute"], "559": ["n03376595", "folding_chair"], "560": ["n03379051", "football_helmet"], "561": ["n03384352", "forklift"], "562": ["n03388043", "fountain"], "563": ["n03388183", "fountain_pen"], "564": ["n03388549", "four-poster"], "565": ["n03393912", "freight_car"], "566": ["n03394916", "French_horn"], "567": ["n03400231", "frying_pan"], "568": ["n03404251", "fur_coat"], "569": ["n03417042", "garbage_truck"], "570": ["n03424325", "gasmask"], "571": ["n03425413", "gas_pump"], "572": ["n03443371", "goblet"], "573": ["n03444034", "go-kart"], "574": ["n03445777", "golf_ball"], "575": ["n03445924", "golfcart"], "576": ["n03447447", "gondola"], "577": ["n03447721", "gong"], "578": ["n03450230", "gown"], "579": ["n03452741", "grand_piano"], "580": ["n03457902", "greenhouse"], "581": ["n03459775", "grille"], "582": ["n03461385", "grocery_store"], "583": ["n03467068", "guillotine"], "584": ["n03476684", "hair_slide"], "585": ["n03476991", "hair_spray"], "586": ["n03478589", "half_track"], "587": ["n03481172", "hammer"], "588": ["n03482405", "hamper"], "589": ["n03483316", "hand_blower"], "590": ["n03485407", "hand-held_computer"], "591": ["n03485794", "handkerchief"], "592": ["n03492542", "hard_disc"], "593": ["n03494278", "harmonica"], "594": ["n03495258", "harp"], "595": ["n03496892", "harvester"], "596": ["n03498962", "hatchet"], "597": ["n03527444", "holster"], "598": ["n03529860", "home_theater"], "599": ["n03530642", "honeycomb"], "600": ["n03532672", "hook"], "601": ["n03534580", "hoopskirt"], "602": ["n03535780", "horizontal_bar"], "603": ["n03538406", "horse_cart"], "604": ["n03544143", "hourglass"], "605": ["n03584254", "iPod"], "606": ["n03584829", "iron"], "607": ["n03590841", "jack-o'-lantern"], "608": ["n03594734", "jean"], "609": ["n03594945", "jeep"], "610": ["n03595614", "jersey"], "611": ["n03598930", "jigsaw_puzzle"], "612": ["n03599486", "jinrikisha"], "613": ["n03602883", "joystick"], "614": ["n03617480", "kimono"], "615": ["n03623198", "knee_pad"], "616": ["n03627232", "knot"], "617": ["n03630383", "lab_coat"], "618": ["n03633091", "ladle"], "619": ["n03637318", "lampshade"], "620": ["n03642806", "laptop"], "621": ["n03649909", "lawn_mower"], "622": ["n03657121", "lens_cap"], "623": ["n03658185", "letter_opener"], "624": ["n03661043", "library"], "625": ["n03662601", "lifeboat"], "626": ["n03666591", "lighter"], "627": ["n03670208", "limousine"], "628": ["n03673027", "liner"], "629": ["n03676483", "lipstick"], "630": ["n03680355", "Loafer"], "631": ["n03690938", "lotion"], "632": ["n03691459", "loudspeaker"], "633": ["n03692522", "loupe"], "634": ["n03697007", "lumbermill"], "635": ["n03706229", "magnetic_compass"], "636": ["n03709823", "mailbag"], "637": ["n03710193", "mailbox"], "638": ["n03710637", "maillot"], "639": ["n03710721", "maillot"], "640": ["n03717622", "manhole_cover"], "641": ["n03720891", "maraca"], "642": ["n03721384", "marimba"], "643": ["n03724870", "mask"], "644": ["n03729826", "matchstick"], "645": ["n03733131", "maypole"], "646": ["n03733281", "maze"], "647": ["n03733805", "measuring_cup"], "648": ["n03742115", "medicine_chest"], "649": ["n03743016", "megalith"], "650": ["n03759954", "microphone"], "651": ["n03761084", "microwave"], "652": ["n03763968", "military_uniform"], "653": ["n03764736", "milk_can"], "654": ["n03769881", "minibus"], "655": ["n03770439", "miniskirt"], "656": ["n03770679", "minivan"], "657": ["n03773504", "missile"], "658": ["n03775071", "mitten"], "659": ["n03775546", "mixing_bowl"], "660": ["n03776460", "mobile_home"], "661": ["n03777568", "Model_T"], "662": ["n03777754", "modem"], "663": ["n03781244", "monastery"], "664": ["n03782006", "monitor"], "665": ["n03785016", "moped"], "666": ["n03786901", "mortar"], "667": ["n03787032", "mortarboard"], "668": ["n03788195", "mosque"], "669": ["n03788365", "mosquito_net"], "670": ["n03791053", "motor_scooter"], "671": ["n03792782", "mountain_bike"], "672": ["n03792972", "mountain_tent"], "673": ["n03793489", "mouse"], "674": ["n03794056", "mousetrap"], "675": ["n03796401", "moving_van"], "676": ["n03803284", "muzzle"], "677": ["n03804744", "nail"], "678": ["n03814639", "neck_brace"], "679": ["n03814906", "necklace"], "680": ["n03825788", "nipple"], "681": ["n03832673", "notebook"], "682": ["n03837869", "obelisk"], "683": ["n03838899", "oboe"], "684": ["n03840681", "ocarina"], "685": ["n03841143", "odometer"], "686": ["n03843555", "oil_filter"], "687": ["n03854065", "organ"], "688": ["n03857828", "oscilloscope"], "689": ["n03866082", "overskirt"], "690": ["n03868242", "oxcart"], "691": ["n03868863", "oxygen_mask"], "692": ["n03871628", "packet"], "693": ["n03873416", "paddle"], "694": ["n03874293", "paddlewheel"], "695": ["n03874599", "padlock"], "696": ["n03876231", "paintbrush"], "697": ["n03877472", "pajama"], "698": ["n03877845", "palace"], "699": ["n03884397", "panpipe"], "700": ["n03887697", "paper_towel"], "701": ["n03888257", "parachute"], "702": ["n03888605", "parallel_bars"], "703": ["n03891251", "park_bench"], "704": ["n03891332", "parking_meter"], "705": ["n03895866", "passenger_car"], "706": ["n03899768", "patio"], "707": ["n03902125", "pay-phone"], "708": ["n03903868", "pedestal"], "709": ["n03908618", "pencil_box"], "710": ["n03908714", "pencil_sharpener"], "711": ["n03916031", "perfume"], "712": ["n03920288", "Petri_dish"], "713": ["n03924679", "photocopier"], "714": ["n03929660", "pick"], "715": ["n03929855", "pickelhaube"], "716": ["n03930313", "picket_fence"], "717": ["n03930630", "pickup"], "718": ["n03933933", "pier"], "719": ["n03935335", "piggy_bank"], "720": ["n03937543", "pill_bottle"], "721": ["n03938244", "pillow"], "722": ["n03942813", "ping-pong_ball"], "723": ["n03944341", "pinwheel"], "724": ["n03947888", "pirate"], "725": ["n03950228", "pitcher"], "726": ["n03954731", "plane"], "727": ["n03956157", "planetarium"], "728": ["n03958227", "plastic_bag"], "729": ["n03961711", "plate_rack"], "730": ["n03967562", "plow"], "731": ["n03970156", "plunger"], "732": ["n03976467", "Polaroid_camera"], "733": ["n03976657", "pole"], "734": ["n03977966", "police_van"], "735": ["n03980874", "poncho"], "736": ["n03982430", "pool_table"], "737": ["n03983396", "pop_bottle"], "738": ["n03991062", "pot"], "739": ["n03992509", "potter's_wheel"], "740": ["n03995372", "power_drill"], "741": ["n03998194", "prayer_rug"], "742": ["n04004767", "printer"], "743": ["n04005630", "prison"], "744": ["n04008634", "projectile"], "745": ["n04009552", "projector"], "746": ["n04019541", "puck"], "747": ["n04023962", "punching_bag"], "748": ["n04026417", "purse"], "749": ["n04033901", "quill"], "750": ["n04033995", "quilt"], "751": ["n04037443", "racer"], "752": ["n04039381", "racket"], "753": ["n04040759", "radiator"], "754": ["n04041544", "radio"], "755": ["n04044716", "radio_telescope"], "756": ["n04049303", "rain_barrel"], "757": ["n04065272", "recreational_vehicle"], "758": ["n04067472", "reel"], "759": ["n04069434", "reflex_camera"], "760": ["n04070727", "refrigerator"], "761": ["n04074963", "remote_control"], "762": ["n04081281", "restaurant"], "763": ["n04086273", "revolver"], "764": ["n04090263", "rifle"], "765": ["n04099969", "rocking_chair"], "766": ["n04111531", "rotisserie"], "767": ["n04116512", "rubber_eraser"], "768": ["n04118538", "rugby_ball"], "769": ["n04118776", "rule"], "770": ["n04120489", "running_shoe"], "771": ["n04125021", "safe"], "772": ["n04127249", "safety_pin"], "773": ["n04131690", "saltshaker"], "774": ["n04133789", "sandal"], "775": ["n04136333", "sarong"], "776": ["n04141076", "sax"], "777": ["n04141327", "scabbard"], "778": ["n04141975", "scale"], "779": ["n04146614", "school_bus"], "780": ["n04147183", "schooner"], "781": ["n04149813", "scoreboard"], "782": ["n04152593", "screen"], "783": ["n04153751", "screw"], "784": ["n04154565", "screwdriver"], "785": ["n04162706", "seat_belt"], "786": ["n04179913", "sewing_machine"], "787": ["n04192698", "shield"], "788": ["n04200800", "shoe_shop"], "789": ["n04201297", "shoji"], "790": ["n04204238", "shopping_basket"], "791": ["n04204347", "shopping_cart"], "792": ["n04208210", "shovel"], "793": ["n04209133", "shower_cap"], "794": ["n04209239", "shower_curtain"], "795": ["n04228054", "ski"], "796": ["n04229816", "ski_mask"], "797": ["n04235860", "sleeping_bag"], "798": ["n04238763", "slide_rule"], "799": ["n04239074", "sliding_door"], "800": ["n04243546", "slot"], "801": ["n04251144", "snorkel"], "802": ["n04252077", "snowmobile"], "803": ["n04252225", "snowplow"], "804": ["n04254120", "soap_dispenser"], "805": ["n04254680", "soccer_ball"], "806": ["n04254777", "sock"], "807": ["n04258138", "solar_dish"], "808": ["n04259630", "sombrero"], "809": ["n04263257", "soup_bowl"], "810": ["n04264628", "space_bar"], "811": ["n04265275", "space_heater"], "812": ["n04266014", "space_shuttle"], "813": ["n04270147", "spatula"], "814": ["n04273569", "speedboat"], "815": ["n04275548", "spider_web"], "816": ["n04277352", "spindle"], "817": ["n04285008", "sports_car"], "818": ["n04286575", "spotlight"], "819": ["n04296562", "stage"], "820": ["n04310018", "steam_locomotive"], "821": ["n04311004", "steel_arch_bridge"], "822": ["n04311174", "steel_drum"], "823": ["n04317175", "stethoscope"], "824": ["n04325704", "stole"], "825": ["n04326547", "stone_wall"], "826": ["n04328186", "stopwatch"], "827": ["n04330267", "stove"], "828": ["n04332243", "strainer"], "829": ["n04335435", "streetcar"], "830": ["n04336792", "stretcher"], "831": ["n04344873", "studio_couch"], "832": ["n04346328", "stupa"], "833": ["n04347754", "submarine"], "834": ["n04350905", "suit"], "835": ["n04355338", "sundial"], "836": ["n04355933", "sunglass"], "837": ["n04356056", "sunglasses"], "838": ["n04357314", "sunscreen"], "839": ["n04366367", "suspension_bridge"], "840": ["n04367480", "swab"], "841": ["n04370456", "sweatshirt"], "842": ["n04371430", "swimming_trunks"], "843": ["n04371774", "swing"], "844": ["n04372370", "switch"], "845": ["n04376876", "syringe"], "846": ["n04380533", "table_lamp"], "847": ["n04389033", "tank"], "848": ["n04392985", "tape_player"], "849": ["n04398044", "teapot"], "850": ["n04399382", "teddy"], "851": ["n04404412", "television"], "852": ["n04409515", "tennis_ball"], "853": ["n04417672", "thatch"], "854": ["n04418357", "theater_curtain"], "855": ["n04423845", "thimble"], "856": ["n04428191", "thresher"], "857": ["n04429376", "throne"], "858": ["n04435653", "tile_roof"], "859": ["n04442312", "toaster"], "860": ["n04443257", "tobacco_shop"], "861": ["n04447861", "toilet_seat"], "862": ["n04456115", "torch"], "863": ["n04458633", "totem_pole"], "864": ["n04461696", "tow_truck"], "865": ["n04462240", "toyshop"], "866": ["n04465501", "tractor"], "867": ["n04467665", "trailer_truck"], "868": ["n04476259", "tray"], "869": ["n04479046", "trench_coat"], "870": ["n04482393", "tricycle"], "871": ["n04483307", "trimaran"], "872": ["n04485082", "tripod"], "873": ["n04486054", "triumphal_arch"], "874": ["n04487081", "trolleybus"], "875": ["n04487394", "trombone"], "876": ["n04493381", "tub"], "877": ["n04501370", "turnstile"], "878": ["n04505470", "typewriter_keyboard"], "879": ["n04507155", "umbrella"], "880": ["n04509417", "unicycle"], "881": ["n04515003", "upright"], "882": ["n04517823", "vacuum"], "883": ["n04522168", "vase"], "884": ["n04523525", "vault"], "885": ["n04525038", "velvet"], "886": ["n04525305", "vending_machine"], "887": ["n04532106", "vestment"], "888": ["n04532670", "viaduct"], "889": ["n04536866", "violin"], "890": ["n04540053", "volleyball"], "891": ["n04542943", "waffle_iron"], "892": ["n04548280", "wall_clock"], "893": ["n04548362", "wallet"], "894": ["n04550184", "wardrobe"], "895": ["n04552348", "warplane"], "896": ["n04553703", "washbasin"], "897": ["n04554684", "washer"], "898": ["n04557648", "water_bottle"], "899": ["n04560804", "water_jug"], "900": ["n04562935", "water_tower"], "901": ["n04579145", "whiskey_jug"], "902": ["n04579432", "whistle"], "903": ["n04584207", "wig"], "904": ["n04589890", "window_screen"], "905": ["n04590129", "window_shade"], "906": ["n04591157", "Windsor_tie"], "907": ["n04591713", "wine_bottle"], "908": ["n04592741", "wing"], "909": ["n04596742", "wok"], "910": ["n04597913", "wooden_spoon"], "911": ["n04599235", "wool"], "912": ["n04604644", "worm_fence"], "913": ["n04606251", "wreck"], "914": ["n04612504", "yawl"], "915": ["n04613696", "yurt"], "916": ["n06359193", "web_site"], "917": ["n06596364", "comic_book"], "918": ["n06785654", "crossword_puzzle"], "919": ["n06794110", "street_sign"], "920": ["n06874185", "traffic_light"], "921": ["n07248320", "book_jacket"], "922": ["n07565083", "menu"], "923": ["n07579787", "plate"], "924": ["n07583066", "guacamole"], "925": ["n07584110", "consomme"], "926": ["n07590611", "hot_pot"], "927": ["n07613480", "trifle"], "928": ["n07614500", "ice_cream"], "929": ["n07615774", "ice_lolly"], "930": ["n07684084", "French_loaf"], "931": ["n07693725", "bagel"], "932": ["n07695742", "pretzel"], "933": ["n07697313", "cheeseburger"], "934": ["n07697537", "hotdog"], "935": ["n07711569", "mashed_potato"], "936": ["n07714571", "head_cabbage"], "937": ["n07714990", "broccoli"], "938": ["n07715103", "cauliflower"], "939": ["n07716358", "zucchini"], "940": ["n07716906", "spaghetti_squash"], "941": ["n07717410", "acorn_squash"], "942": ["n07717556", "butternut_squash"], "943": ["n07718472", "cucumber"], "944": ["n07718747", "artichoke"], "945": ["n07720875", "bell_pepper"], "946": ["n07730033", "cardoon"], "947": ["n07734744", "mushroom"], "948": ["n07742313", "Granny_Smith"], "949": ["n07745940", "strawberry"], "950": ["n07747607", "orange"], "951": ["n07749582", "lemon"], "952": ["n07753113", "fig"], "953": ["n07753275", "pineapple"], "954": ["n07753592", "banana"], "955": ["n07754684", "jackfruit"], "956": ["n07760859", "custard_apple"], "957": ["n07768694", "pomegranate"], "958": ["n07802026", "hay"], "959": ["n07831146", "carbonara"], "960": ["n07836838", "chocolate_sauce"], "961": ["n07860988", "dough"], "962": ["n07871810", "meat_loaf"], "963": ["n07873807", "pizza"], "964": ["n07875152", "potpie"], "965": ["n07880968", "burrito"], "966": ["n07892512", "red_wine"], "967": ["n07920052", "espresso"], "968": ["n07930864", "cup"], "969": ["n07932039", "eggnog"], "970": ["n09193705", "alp"], "971": ["n09229709", "bubble"], "972": ["n09246464", "cliff"], "973": ["n09256479", "coral_reef"], "974": ["n09288635", "geyser"], "975": ["n09332890", "lakeside"], "976": ["n09399592", "promontory"], "977": ["n09421951", "sandbar"], "978": ["n09428293", "seashore"], "979": ["n09468604", "valley"], "980": ["n09472597", "volcano"], "981": ["n09835506", "ballplayer"], "982": ["n10148035", "groom"], "983": ["n10565667", "scuba_diver"], "984": ["n11879895", "rapeseed"], "985": ["n11939491", "daisy"], "986": ["n12057211", "yellow_lady's_slipper"], "987": ["n12144580", "corn"], "988": ["n12267677", "acorn"], "989": ["n12620546", "hip"], "990": ["n12768682", "buckeye"], "991": ["n12985857", "coral_fungus"], "992": ["n12998815", "agaric"], "993": ["n13037406", "gyromitra"], "994": ["n13040303", "stinkhorn"], "995": ["n13044778", "earthstar"], "996": ["n13052670", "hen-of-the-woods"], "997": ["n13054560", "bolete"], "998": ["n13133613", "ear"], "999": ["n15075141", "toilet_tissue"]}

autoencoder.py ADDED Viewed

	@@ -0,0 +1,522 @@

+# Borrowed from U-ViT: https://github.com/baofff/U-ViT/blob/main/libs/autoencoder.py.
+# The original code is licensed under MIT License, which is can be found at licenses/LICENSE_UVIT.txt.
+import torch
+import torch.nn as nn
+import numpy as np
+from einops import rearrange
+class LinearAttention(nn.Module):
+    def __init__(self, dim, heads=4, dim_head=32):
+        super().__init__()
+        self.heads = heads
+        hidden_dim = dim_head * heads
+        self.to_qkv = nn.Conv2d(dim, hidden_dim * 3, 1, bias = False)
+        self.to_out = nn.Conv2d(hidden_dim, dim, 1)
+    def forward(self, x):
+        b, c, h, w = x.shape
+        qkv = self.to_qkv(x)
+        q, k, v = rearrange(qkv, 'b (qkv heads c) h w -> qkv b heads c (h w)', heads = self.heads, qkv=3)
+        k = k.softmax(dim=-1)
+        context = torch.einsum('bhdn,bhen->bhde', k, v)
+        out = torch.einsum('bhde,bhdn->bhen', context, q)
+        out = rearrange(out, 'b heads c (h w) -> b (heads c) h w', heads=self.heads, h=h, w=w)
+        return self.to_out(out)
+def nonlinearity(x):
+    # swish
+    return x*torch.sigmoid(x)
+def Normalize(in_channels, num_groups=32):
+    return torch.nn.GroupNorm(num_groups=num_groups, num_channels=in_channels, eps=1e-6, affine=True)
+class Upsample(nn.Module):
+    def __init__(self, in_channels, with_conv):
+        super().__init__()
+        self.with_conv = with_conv
+        if self.with_conv:
+            self.conv = torch.nn.Conv2d(in_channels,
+                                        in_channels,
+                                        kernel_size=3,
+                                        stride=1,
+                                        padding=1)
+    def forward(self, x):
+        x = torch.nn.functional.interpolate(x, scale_factor=2.0, mode="nearest")
+        if self.with_conv:
+            x = self.conv(x)
+        return x
+class Downsample(nn.Module):
+    def __init__(self, in_channels, with_conv):
+        super().__init__()
+        self.with_conv = with_conv
+        if self.with_conv:
+            # no asymmetric padding in torch conv, must do it ourselves
+            self.conv = torch.nn.Conv2d(in_channels,
+                                        in_channels,
+                                        kernel_size=3,
+                                        stride=2,
+                                        padding=0)
+    def forward(self, x):
+        if self.with_conv:
+            pad = (0,1,0,1)
+            x = torch.nn.functional.pad(x, pad, mode="constant", value=0)
+            x = self.conv(x)
+        else:
+            x = torch.nn.functional.avg_pool2d(x, kernel_size=2, stride=2)
+        return x
+class ResnetBlock(nn.Module):
+    def __init__(self, *, in_channels, out_channels=None, conv_shortcut=False,
+                 dropout, temb_channels=512):
+        super().__init__()
+        self.in_channels = in_channels
+        out_channels = in_channels if out_channels is None else out_channels
+        self.out_channels = out_channels
+        self.use_conv_shortcut = conv_shortcut
+        self.norm1 = Normalize(in_channels)
+        self.conv1 = torch.nn.Conv2d(in_channels,
+                                     out_channels,
+                                     kernel_size=3,
+                                     stride=1,
+                                     padding=1)
+        if temb_channels > 0:
+            self.temb_proj = torch.nn.Linear(temb_channels,
+                                             out_channels)
+        self.norm2 = Normalize(out_channels)
+        self.dropout = torch.nn.Dropout(dropout)
+        self.conv2 = torch.nn.Conv2d(out_channels,
+                                     out_channels,
+                                     kernel_size=3,
+                                     stride=1,
+                                     padding=1)
+        if self.in_channels != self.out_channels:
+            if self.use_conv_shortcut:
+                self.conv_shortcut = torch.nn.Conv2d(in_channels,
+                                                     out_channels,
+                                                     kernel_size=3,
+                                                     stride=1,
+                                                     padding=1)
+            else:
+                self.nin_shortcut = torch.nn.Conv2d(in_channels,
+                                                    out_channels,
+                                                    kernel_size=1,
+                                                    stride=1,
+                                                    padding=0)
+    def forward(self, x, temb):
+        h = x
+        h = self.norm1(h)
+        h = nonlinearity(h)
+        h = self.conv1(h)
+        if temb is not None:
+            h = h + self.temb_proj(nonlinearity(temb))[:,:,None,None]
+        h = self.norm2(h)
+        h = nonlinearity(h)
+        h = self.dropout(h)
+        h = self.conv2(h)
+        if self.in_channels != self.out_channels:
+            if self.use_conv_shortcut:
+                x = self.conv_shortcut(x)
+            else:
+                x = self.nin_shortcut(x)
+        return x+h
+class LinAttnBlock(LinearAttention):
+    """to match AttnBlock usage"""
+    def __init__(self, in_channels):
+        super().__init__(dim=in_channels, heads=1, dim_head=in_channels)
+class AttnBlock(nn.Module):
+    def __init__(self, in_channels):
+        super().__init__()
+        self.in_channels = in_channels
+        self.norm = Normalize(in_channels)
+        self.q = torch.nn.Conv2d(in_channels,
+                                 in_channels,
+                                 kernel_size=1,
+                                 stride=1,
+                                 padding=0)
+        self.k = torch.nn.Conv2d(in_channels,
+                                 in_channels,
+                                 kernel_size=1,
+                                 stride=1,
+                                 padding=0)
+        self.v = torch.nn.Conv2d(in_channels,
+                                 in_channels,
+                                 kernel_size=1,
+                                 stride=1,
+                                 padding=0)
+        self.proj_out = torch.nn.Conv2d(in_channels,
+                                        in_channels,
+                                        kernel_size=1,
+                                        stride=1,
+                                        padding=0)
+    def forward(self, x):
+        h_ = x
+        h_ = self.norm(h_)
+        q = self.q(h_)
+        k = self.k(h_)
+        v = self.v(h_)
+        # compute attention
+        b,c,h,w = q.shape
+        q = q.reshape(b,c,h*w)
+        q = q.permute(0,2,1)   # b,hw,c
+        k = k.reshape(b,c,h*w) # b,c,hw
+        w_ = torch.bmm(q,k)     # b,hw,hw    w[b,i,j]=sum_c q[b,i,c]k[b,c,j]
+        w_ = w_ * (int(c)**(-0.5))
+        w_ = torch.nn.functional.softmax(w_, dim=2)
+        # attend to values
+        v = v.reshape(b,c,h*w)
+        w_ = w_.permute(0,2,1)   # b,hw,hw (first hw of k, second of q)
+        h_ = torch.bmm(v,w_)     # b, c,hw (hw of q) h_[b,c,j] = sum_i v[b,c,i] w_[b,i,j]
+        h_ = h_.reshape(b,c,h,w)
+        h_ = self.proj_out(h_)
+        return x+h_
+def make_attn(in_channels, attn_type="vanilla"):
+    assert attn_type in ["vanilla", "linear", "none"], f'attn_type {attn_type} unknown'
+    print(f"making attention of type '{attn_type}' with {in_channels} in_channels")
+    if attn_type == "vanilla":
+        return AttnBlock(in_channels)
+    elif attn_type == "none":
+        return nn.Identity(in_channels)
+    else:
+        return LinAttnBlock(in_channels)
+class Encoder(nn.Module):
+    def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
+                 attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels,
+                 resolution, z_channels, double_z=True, use_linear_attn=False, attn_type="vanilla",
+                 **ignore_kwargs):
+        super().__init__()
+        if use_linear_attn: attn_type = "linear"
+        self.ch = ch
+        self.temb_ch = 0
+        self.num_resolutions = len(ch_mult)
+        self.num_res_blocks = num_res_blocks
+        self.resolution = resolution
+        self.in_channels = in_channels
+        # downsampling
+        self.conv_in = torch.nn.Conv2d(in_channels,
+                                       self.ch,
+                                       kernel_size=3,
+                                       stride=1,
+                                       padding=1)
+        curr_res = resolution
+        in_ch_mult = (1,)+tuple(ch_mult)
+        self.in_ch_mult = in_ch_mult
+        self.down = nn.ModuleList()
+        for i_level in range(self.num_resolutions):
+            block = nn.ModuleList()
+            attn = nn.ModuleList()
+            block_in = ch*in_ch_mult[i_level]
+            block_out = ch*ch_mult[i_level]
+            for i_block in range(self.num_res_blocks):
+                block.append(ResnetBlock(in_channels=block_in,
+                                         out_channels=block_out,
+                                         temb_channels=self.temb_ch,
+                                         dropout=dropout))
+                block_in = block_out
+                if curr_res in attn_resolutions:
+                    attn.append(make_attn(block_in, attn_type=attn_type))
+            down = nn.Module()
+            down.block = block
+            down.attn = attn
+            if i_level != self.num_resolutions-1:
+                down.downsample = Downsample(block_in, resamp_with_conv)
+                curr_res = curr_res // 2
+            self.down.append(down)
+        # middle
+        self.mid = nn.Module()
+        self.mid.block_1 = ResnetBlock(in_channels=block_in,
+                                       out_channels=block_in,
+                                       temb_channels=self.temb_ch,
+                                       dropout=dropout)
+        self.mid.attn_1 = make_attn(block_in, attn_type=attn_type)
+        self.mid.block_2 = ResnetBlock(in_channels=block_in,
+                                       out_channels=block_in,
+                                       temb_channels=self.temb_ch,
+                                       dropout=dropout)
+        # end
+        self.norm_out = Normalize(block_in)
+        self.conv_out = torch.nn.Conv2d(block_in,
+                                        2*z_channels if double_z else z_channels,
+                                        kernel_size=3,
+                                        stride=1,
+                                        padding=1)
+    def forward(self, x):
+        # timestep embedding
+        temb = None
+        # downsampling
+        hs = [self.conv_in(x)]
+        for i_level in range(self.num_resolutions):
+            for i_block in range(self.num_res_blocks):
+                h = self.down[i_level].block[i_block](hs[-1], temb)
+                if len(self.down[i_level].attn) > 0:
+                    h = self.down[i_level].attn[i_block](h)
+                hs.append(h)
+            if i_level != self.num_resolutions-1:
+                hs.append(self.down[i_level].downsample(hs[-1]))
+        # middle
+        h = hs[-1]
+        h = self.mid.block_1(h, temb)
+        h = self.mid.attn_1(h)
+        h = self.mid.block_2(h, temb)
+        # end
+        h = self.norm_out(h)
+        h = nonlinearity(h)
+        h = self.conv_out(h)
+        return h
+class Decoder(nn.Module):
+    def __init__(self, *, ch, out_ch, ch_mult=(1,2,4,8), num_res_blocks,
+                 attn_resolutions, dropout=0.0, resamp_with_conv=True, in_channels,
+                 resolution, z_channels, give_pre_end=False, tanh_out=False, use_linear_attn=False,
+                 attn_type="vanilla", **ignorekwargs):
+        super().__init__()
+        if use_linear_attn: attn_type = "linear"
+        self.ch = ch
+        self.temb_ch = 0
+        self.num_resolutions = len(ch_mult)
+        self.num_res_blocks = num_res_blocks
+        self.resolution = resolution
+        self.in_channels = in_channels
+        self.give_pre_end = give_pre_end
+        self.tanh_out = tanh_out
+        # compute in_ch_mult, block_in and curr_res at lowest res
+        in_ch_mult = (1,)+tuple(ch_mult)
+        block_in = ch*ch_mult[self.num_resolutions-1]
+        curr_res = resolution // 2**(self.num_resolutions-1)
+        self.z_shape = (1,z_channels,curr_res,curr_res)
+        print("Working with z of shape {} = {} dimensions.".format(
+            self.z_shape, np.prod(self.z_shape)))
+        # z to block_in
+        self.conv_in = torch.nn.Conv2d(z_channels,
+                                       block_in,
+                                       kernel_size=3,
+                                       stride=1,
+                                       padding=1)
+        # middle
+        self.mid = nn.Module()
+        self.mid.block_1 = ResnetBlock(in_channels=block_in,
+                                       out_channels=block_in,
+                                       temb_channels=self.temb_ch,
+                                       dropout=dropout)
+        self.mid.attn_1 = make_attn(block_in, attn_type=attn_type)
+        self.mid.block_2 = ResnetBlock(in_channels=block_in,
+                                       out_channels=block_in,
+                                       temb_channels=self.temb_ch,
+                                       dropout=dropout)
+        # upsampling
+        self.up = nn.ModuleList()
+        for i_level in reversed(range(self.num_resolutions)):
+            block = nn.ModuleList()
+            attn = nn.ModuleList()
+            block_out = ch*ch_mult[i_level]
+            for i_block in range(self.num_res_blocks+1):
+                block.append(ResnetBlock(in_channels=block_in,
+                                         out_channels=block_out,
+                                         temb_channels=self.temb_ch,
+                                         dropout=dropout))
+                block_in = block_out
+                if curr_res in attn_resolutions:
+                    attn.append(make_attn(block_in, attn_type=attn_type))
+            up = nn.Module()
+            up.block = block
+            up.attn = attn
+            if i_level != 0:
+                up.upsample = Upsample(block_in, resamp_with_conv)
+                curr_res = curr_res * 2
+            self.up.insert(0, up) # prepend to get consistent order
+        # end
+        self.norm_out = Normalize(block_in)
+        self.conv_out = torch.nn.Conv2d(block_in,
+                                        out_ch,
+                                        kernel_size=3,
+                                        stride=1,
+                                        padding=1)
+    def forward(self, z):
+        #assert z.shape[1:] == self.z_shape[1:]
+        self.last_z_shape = z.shape
+        # timestep embedding
+        temb = None
+        # z to block_in
+        h = self.conv_in(z)
+        # middle
+        h = self.mid.block_1(h, temb)
+        h = self.mid.attn_1(h)
+        h = self.mid.block_2(h, temb)
+        # upsampling
+        for i_level in reversed(range(self.num_resolutions)):
+            for i_block in range(self.num_res_blocks+1):
+                h = self.up[i_level].block[i_block](h, temb)
+                if len(self.up[i_level].attn) > 0:
+                    h = self.up[i_level].attn[i_block](h)
+            if i_level != 0:
+                h = self.up[i_level].upsample(h)
+        # end
+        if self.give_pre_end:
+            return h
+        h = self.norm_out(h)
+        h = nonlinearity(h)
+        h = self.conv_out(h)
+        if self.tanh_out:
+            h = torch.tanh(h)
+        return h
+class FrozenAutoencoderKL(nn.Module):
+    def __init__(self, ddconfig, embed_dim, pretrained_path, scale_factor=0.18215):
+        super().__init__()
+        print(f'Create autoencoder with scale_factor={scale_factor}')
+        self.encoder = Encoder(**ddconfig)
+        self.decoder = Decoder(**ddconfig)
+        assert ddconfig["double_z"]
+        self.quant_conv = torch.nn.Conv2d(2 * ddconfig["z_channels"], 2 * embed_dim, 1)
+        self.post_quant_conv = torch.nn.Conv2d(embed_dim, ddconfig["z_channels"], 1)
+        self.embed_dim = embed_dim
+        self.scale_factor = scale_factor
+        m, u = self.load_state_dict(torch.load(pretrained_path, map_location='cpu'))
+        assert len(m) == 0 and len(u) == 0
+        self.eval()
+        self.requires_grad_(False)
+    def encode_moments(self, x):
+        h = self.encoder(x)
+        moments = self.quant_conv(h)
+        return moments
+    def sample(self, moments):
+        mean, logvar = torch.chunk(moments, 2, dim=1)
+        logvar = torch.clamp(logvar, -30.0, 20.0)
+        std = torch.exp(0.5 * logvar)
+        z = mean + std * torch.randn_like(mean)
+        z = self.scale_factor * z
+        return z
+    def encode(self, x):
+        moments = self.encode_moments(x)
+        z = self.sample(moments)
+        return z
+    def decode(self, z):
+        z = (1. / self.scale_factor) * z
+        z = self.post_quant_conv(z)
+        dec = self.decoder(z)
+        return dec
+    def forward(self, inputs, fn):
+        if fn == 'encode_moments':
+            return self.encode_moments(inputs)
+        elif fn == 'encode':
+            return self.encode(inputs)
+        elif fn == 'decode':
+            return self.decode(inputs)
+        else:
+            raise NotImplementedError
+def get_model(pretrained_path, scale_factor=0.18215):
+    ddconfig = dict(
+        double_z=True,
+        z_channels=4,
+        resolution=256,
+        in_channels=3,
+        out_ch=3,
+        ch=128,
+        ch_mult=[1, 2, 4, 4],
+        num_res_blocks=2,
+        attn_resolutions=[],
+        dropout=0.0
+    )
+    return FrozenAutoencoderKL(ddconfig, 4, pretrained_path, scale_factor)
+def main():
+    import torchvision.transforms as transforms
+    from torchvision.utils import save_image
+    import os
+    from PIL import Image
+    model = get_model('assets/stable-diffusion/autoencoder_kl.pth')
+    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    model = model.to(device)
+    scale_factor = 0.18215
+    T = transforms.Compose([transforms.Resize(256), transforms.CenterCrop(256), transforms.ToTensor()])
+    path = 'imgs'
+    fnames = os.listdir(path)
+    for fname in fnames:
+        p = os.path.join(path, fname)
+        img = Image.open(p)
+        img = T(img)
+        img = img * 2. - 1
+        img = img[None, ...]
+        img = img.to(device)
+        # with torch.cuda.amp.autocast():
+        #     moments = model.encode_moments(img)
+        #     mean, logvar = torch.chunk(moments, 2, dim=1)
+        #     logvar = torch.clamp(logvar, -30.0, 20.0)
+        #     std = torch.exp(0.5 * logvar)
+        #     zs = [(mean + std * torch.randn_like(mean)) * scale_factor for _ in range(4)]
+        #     recons = [model.decode(z) for z in zs]
+        with torch.cuda.amp.autocast():
+            print('test encode & decode')
+            recons = [model.decode(model.encode(img)) for _ in range(4)]
+        out = torch.cat([img, *recons], dim=0)
+        out = (out + 1) * 0.5
+        save_image(out, f'recons_{fname}')
+if __name__ == "__main__":
+    main()

checkpoints/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

configs/finetune/imagenet256-latent-const.yaml ADDED Viewed

	@@ -0,0 +1,49 @@

+data:
+  dataset: imagenet256-latent
+  category: lmdb
+  resolution: 32
+  num_channels: 4
+  random_flip: True
+  root: ../data/imagenet256
+  feat_path: None
+model:
+  precond: edm
+  model_type: DiT-XL/2
+  in_size: 32
+  in_channels: 4
+  num_classes: 1000
+  use_decoder: True
+  ext_feature_dim: 0
+  pad_cls_token: False
+  mask_ratio: 0.0
+  mask_ratio_fn: constant
+  mask_ratio_min: 0
+  mae_loss_coef: 0.1
+  class_dropout_prob: 0.1
+train:
+  tf32: True
+  amp: False
+  batchsize: 64   # batchsize per GPU
+  grad_accum: 1
+  epochs: 1000
+  lr: 0.00005
+  lr_rampup_kimg: 0
+  xflip: False
+  max_num_steps: 100_000
+eval: # FID evaluation
+  cfg_scales: [1.5]
+  batchsize: 50
+  ref_path: assets/fid_stats/fid_stats_imagenet256_guided_diffusion.npz
+log:
+  log_every: 500
+  ckpt_every: 12_500
+  tag: finetune-const
+wandb:
+  entity: MaskDiT
+  project: MaskDiT-ImageNet256-latent-finetune
+  group: finetune-const

configs/finetune/imagenet256-latent-cos.yaml ADDED Viewed

	@@ -0,0 +1,49 @@

+data:
+  dataset: imagenet256-latent
+  category: lmdb
+  resolution: 32
+  num_channels: 4
+  random_flip: True
+  root: ../data/imagenet256
+  feat_path: None
+model:
+  precond: edm
+  model_type: DiT-XL/2
+  in_size: 32
+  in_channels: 4
+  num_classes: 1000
+  use_decoder: True
+  ext_feature_dim: 0
+  pad_cls_token: False
+  mask_ratio: 0.5
+  mask_ratio_fn: cos4
+  mask_ratio_min: 0
+  mae_loss_coef: 0.1
+  class_dropout_prob: 0.1
+train:
+  tf32: True
+  amp: False
+  batchsize: 64   # batchsize per GPU
+  grad_accum: 1
+  epochs: 1000
+  lr: 0.00005
+  lr_rampup_kimg: 0
+  xflip: False
+  max_num_steps: 100_000
+eval: # FID evaluation
+  cfg_scales: [1.5]
+  batchsize: 50
+  ref_path: assets/fid_stats/fid_stats_imagenet256_guided_diffusion.npz
+log:
+  log_every: 500
+  ckpt_every: 12_500
+  tag: finetune-cos
+wandb:
+  entity: MaskDiT
+  project: MaskDiT-ImageNet256-latent-finetune
+  group: finetune-cos

configs/finetune/imagenet512-latent.yaml ADDED Viewed

	@@ -0,0 +1,47 @@

+data:
+  dataset: imagenet512-latent
+  category: lmdb
+  resolution: 64
+  num_channels: 4
+  root: ../data/imagenet512-wds
+  total_num: 1281167
+model:
+  precond: edm
+  model_type: DiT-XL/2
+  in_size: 64
+  in_channels: 4
+  num_classes: 1000
+  use_decoder: True
+  ext_feature_dim: 0
+  pad_cls_token: False
+  mask_ratio: 0.0
+  mask_ratio_fn: constant
+  mask_ratio_min: 0
+  mae_loss_coef: 0.1
+  class_dropout_prob: 0.1
+train:
+  tf32: True
+  amp: False
+  batchsize: 16   # batchsize per GPU
+  grad_accum: 1
+  epochs: 2000
+  lr: 0.00005
+  lr_rampup_kimg: 0
+  xflip: False
+  max_num_steps: 50_000
+eval: # FID evaluation
+  batchsize: 50
+  ref_path: assets/fid_stats/VIRTUAL_imagenet512.npz
+log:
+  log_every: 100
+  ckpt_every: 10_000
+  tag: finetune-4n-wds
+wandb:
+  entity: MaskDiT
+  project: MaskDiT-ImageNet512-latent-finetune
+  group: finetune-wds-4nodes

configs/test/maskdit-256.yaml ADDED Viewed

	@@ -0,0 +1,45 @@

+data:
+  dataset: imagenet256-latent
+  category: lmdb
+  resolution: 32
+  num_channels: 4
+  root: /imagenet_256_latent_lmdb
+  total_num: 1281167
+model:
+  precond: edm
+  model_type: DiT-XL/2
+  in_size: 32
+  in_channels: 4
+  num_classes: 1000
+  use_decoder: True
+  ext_feature_dim: 0
+  pad_cls_token: False
+  mask_ratio: 0.5
+  cond_mask_ratio: 0
+  mae_loss_coef: 0.1
+  class_dropout_prob: 0.1
+train:
+  tf32: False
+  amp: True
+  batchsize: 32   # batchsize per GPU
+  grad_accum: 1
+  epochs: 2800
+  lr: 0.0001
+  lr_rampup_kimg: 0
+  xflip: False
+eval: # FID evaluation
+  batchsize: 50
+  ref_path: assets/fid_stats/fid_stats_imagenet256_guided_diffusion.npz
+log:
+  log_every: 500
+  ckpt_every: 50_000
+  tag: baseline
+wandb:
+  entity: MaskDiT
+  project: MaskDiT-ImageNet256-latent
+  group: baseline

configs/test/maskdit-512.yaml ADDED Viewed

	@@ -0,0 +1,46 @@

+data:
+  dataset: imagenet512-latent
+  category: webdataset
+  resolution: 64
+  num_channels: 4
+  root: ../data/imagenet-wds
+  total_num: 1281167
+model:
+  precond: edm
+  model_type: DiT-XL/2
+  in_size: 64
+  in_channels: 4
+  num_classes: 1000
+  use_decoder: True
+  ext_feature_dim: 0
+  pad_cls_token: False
+  mask_ratio: 0.5
+  cond_mask_ratio: 0
+  mae_loss_coef: 0.1
+  class_dropout_prob: 0.1
+train:
+  tf32: False
+  amp: True
+  batchsize: 32   # batchsize per GPU
+  grad_accum: 1
+  epochs: 2800
+  lr: 0.0001
+  lr_rampup_kimg: 0
+  xflip: False
+  max_num_steps: 2000000
+eval: # FID evaluation
+  batchsize: 50
+  ref_path: assets/fid_stats/VIRTUAL_imagenet512.npz
+log:
+  log_every: 500
+  ckpt_every: 50_000
+  tag: pretrain
+wandb:
+  entity: MaskDiT
+  project: MaskDiT-ImageNet256-latent
+  group: pretrain

configs/train/imagenet256-latent.yaml ADDED Viewed

	@@ -0,0 +1,48 @@

+data:
+  dataset: imagenet256-latent
+  category: lmdb
+  resolution: 32
+  num_channels: 4
+  random_flip: True
+  root: ../data/imagenet256
+  feat_path: None
+model:
+  precond: edm
+  model_type: DiT-XL/2
+  in_size: 32
+  in_channels: 4
+  num_classes: 1000
+  use_decoder: True
+  ext_feature_dim: 0
+  pad_cls_token: False
+  mask_ratio: 0.5
+  mask_ratio_fn: constant
+  mask_ratio_min: 0
+  mae_loss_coef: 0.1
+  class_dropout_prob: 0.1
+train:
+  tf32: False
+  amp: True
+  batchsize: 128   # batchsize per GPU
+  grad_accum: 1
+  epochs: 2800
+  lr: 0.0001
+  lr_rampup_kimg: 0
+  xflip: False
+  max_num_steps: 2000000
+eval: # FID evaluation
+  batchsize: 50
+  ref_path: assets/fid_stats/fid_stats_imagenet256_guided_diffusion.npz
+log:
+  log_every: 500
+  ckpt_every: 50_000
+  tag: pretrain
+wandb:
+  entity: MaskDiT
+  project: MaskDiT-ImageNet256-latent-train
+  group: pretrain

configs/train/imagenet512-latent.yaml ADDED Viewed

	@@ -0,0 +1,47 @@

+data:
+  dataset: imagenet512-latent
+  category: webdataset
+  resolution: 64
+  num_channels: 4
+  root: ../data/imagenet512-wds
+  total_num: 1281167
+model:
+  precond: edm
+  model_type: DiT-XL/2
+  in_size: 64
+  in_channels: 4
+  num_classes: 1000
+  use_decoder: True
+  ext_feature_dim: 0
+  pad_cls_token: False
+  mask_ratio: 0.5
+  mask_ratio_fn: constant
+  mask_ratio_min: 0
+  mae_loss_coef: 0.1
+  class_dropout_prob: 0.1
+train:
+  tf32: False
+  amp: True
+  batchsize: 32   # batchsize per GPU
+  grad_accum: 1
+  epochs: 2000
+  lr: 0.0001
+  lr_rampup_kimg: 0
+  xflip: False
+  max_num_steps: 2000000
+eval: # FID evaluation
+  batchsize: 50
+  ref_path: assets/fid_stats/VIRTUAL_imagenet512.npz
+log:
+  log_every: 100
+  ckpt_every: 25_000
+  tag: pretrain-4nodes
+wandb:
+  entity: MaskDiT
+  project: MaskDiT-ImageNet512-latent-train
+  group: pretrain-4nodes

eval_latent.py ADDED Viewed

	@@ -0,0 +1,132 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+from argparse import ArgumentParser
+import os
+from collections import OrderedDict
+from omegaconf import OmegaConf
+import torch
+import accelerate
+from fid import calc
+from models.maskdit import Precond_models
+from sample import generate_with_net
+from utils import dist, mprint, get_ckpt_paths, Logger, parse_int_list, parse_float_none
+# ------------------------------------------------------------
+# Training Helper Function
+@torch.no_grad()
+def update_ema(ema_model, model, decay=0.9999):
+    """
+    Step the EMA model towards the current model.
+    """
+    ema_params = OrderedDict(ema_model.named_parameters())
+    model_params = OrderedDict(model.named_parameters())
+    for name, param in model_params.items():
+        # TODO: Consider applying only to params that require_grad to avoid small numerical changes of pos_embed
+        ema_params[name].mul_(decay).add_(param.data, alpha=1 - decay)
+def requires_grad(model, flag=True):
+    """
+    Set requires_grad flag for all parameters in a model.
+    """
+    for p in model.parameters():
+        p.requires_grad = flag
+# ------------------------------------------------------------
+def eval_fn(model, args, device, rank, size):
+    generate_with_net(args, model, device, rank, size)
+    dist.barrier()
+    fid = calc(args.outdir, args.ref_path, args.num_expected, args.global_seed, args.fid_batch_size)
+    mprint(f'{args.num_expected} samples generated and saved in {args.outdir}')
+    mprint(f'guidance: {args.cfg_scale} FID: {fid}')
+    dist.barrier()
+    return fid
+def eval_loop(args):
+    config = OmegaConf.load(args.config)
+    accelerator = accelerate.Accelerator()
+    device = accelerator.device
+    size = accelerator.num_processes
+    rank = accelerator.process_index
+    print(f'world_size: {size}, rank: {rank}')
+    experiment_dir = args.exp_dir
+    if accelerator.is_main_process:
+        logger = Logger(file_name=f'{experiment_dir}/log_eval.txt', file_mode="a+", should_flush=True)
+        # setup wandb
+    model = Precond_models[config.model.precond](
+        img_resolution=config.model.in_size,
+        img_channels=config.model.in_channels,
+        num_classes=config.model.num_classes,
+        model_type=config.model.model_type,
+        use_decoder=config.model.use_decoder,
+        mae_loss_coef=config.model.mae_loss_coef,
+        pad_cls_token=config.model.pad_cls_token,
+    ).to(device)
+    # Note that parameter initialization is done within the model constructor
+    model.eval()
+    mprint(f"{config.model.model_type} ((use_decoder: {config.model.use_decoder})) Model Parameters: {sum(p.numel() for p in model.parameters()):,}")
+    mprint(f'extras: {model.model.extras}, cls_token: {model.model.cls_token}')
+    # model = torch.compile(model)
+    # Load checkpoints
+    mprint('start evaluating...')
+    args.outdir = os.path.join(experiment_dir, 'fid', f'edm-steps{args.num_steps}_cfg{args.cfg_scale}')
+    os.makedirs(args.outdir, exist_ok=True)
+    ckpt = torch.load(args.ckpt, map_location=device)
+    model.load_state_dict(ckpt['ema'])
+    fid = eval_fn(model, args, device, rank, size)
+    mprint(f'FID: {fid}')
+    if accelerator.is_main_process:
+        logger.close()
+    accelerator.end_training()
+if __name__ == '__main__':
+    parser = ArgumentParser('training parameters')
+    # basic config
+    parser.add_argument('--config', type=str, required=True, help='path to config file')
+    # training
+    parser.add_argument("--exp_dir", type=str, required=True, help='The exp directory to evaluate, it must contain a checkpoints folder')
+    parser.add_argument('--ckpt', type=str, required=True, help='path to the checkpoint')
+    # sampling
+    parser.add_argument('--seeds', type=parse_int_list, default='100000-149999', help='Random seeds (e.g. 1,2,5-10)')
+    parser.add_argument('--subdirs', action='store_true', help='Create subdirectory for every 1000 seeds')
+    parser.add_argument('--class_idx', type=int, default=None, help='Class label  [default: random]')
+    parser.add_argument('--max_batch_size', type=int, default=50, help='Maximum batch size per GPU during sampling, must be a factor of 50k if torch.compile is used')
+    parser.add_argument("--cfg_scale", type=parse_float_none, default=None, help='None = no guidance, by default = 4.0')
+    parser.add_argument('--num_steps', type=int, default=40, help='Number of sampling steps')
+    parser.add_argument('--S_churn', type=int, default=0, help='Stochasticity strength')
+    parser.add_argument('--solver', type=str, default=None, choices=['euler', 'heun'], help='Ablate ODE solver')
+    parser.add_argument('--discretization', type=str, default=None, choices=['vp', 've', 'iddpm', 'edm'], help='Ablate ODE solver')
+    parser.add_argument('--schedule', type=str, default=None, choices=['vp', 've', 'linear'], help='Ablate noise schedule sigma(t)')
+    parser.add_argument('--scaling', type=str, default=None, choices=['vp', 'none'], help='Ablate signal scaling s(t)')
+    parser.add_argument('--pretrained_path', type=str, default='assets/stable_diffusion/autoencoder_kl.pth', help='Autoencoder ckpt')
+    parser.add_argument('--ref_path', type=str, default='assets/fid_stats/VIRTUAL_imagenet512.npz', help='Dataset reference statistics')
+    parser.add_argument('--num_expected', type=int, default=50000, help='Number of images to use')
+    parser.add_argument("--global_seed", type=int, default=0)
+    parser.add_argument('--fid_batch_size', type=int, default=128, help='Maximum batch size per GPU')
+    args = parser.parse_args()
+    torch.backends.cudnn.benchmark = True
+    eval_loop(args)

evaluator.py ADDED Viewed

	@@ -0,0 +1,695 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+# This code is adapted from https://github.com/openai/guided-diffusion/blob/main/evaluations/evaluator.py.
+# The original code is licensed under a MIT License, which is can be found at licenses/LICENSE_ADM.txt.
+import argparse
+import io
+import os
+import random
+import warnings
+import zipfile
+from abc import ABC, abstractmethod
+from contextlib import contextmanager
+from functools import partial
+from multiprocessing import cpu_count
+from multiprocessing.pool import ThreadPool
+from typing import Iterable, Optional, Tuple
+import numpy as np
+import requests
+import tensorflow.compat.v1 as tf
+from scipy import linalg
+from tqdm.auto import tqdm
+from PIL import Image
+INCEPTION_V3_URL = "https://openaipublic.blob.core.windows.net/diffusion/jul-2021/ref_batches/classify_image_graph_def.pb"
+INCEPTION_V3_PATH = "classify_image_graph_def.pb"
+FID_POOL_NAME = "pool_3:0"
+FID_SPATIAL_NAME = "mixed_6/conv:0"
+def get_all_files(path):
+    path_list = []
+    for root, dirs, files in os.walk(path):
+        for file in files:
+            path_list.append(os.path.join(root, file))
+    return path_list
+def isimg(filename):
+    if filename.endswith(".png") or filename.endswith(".jpg"):
+        return True
+    else:
+        return False
+def png2npz(img_dir):
+    img_list = []
+    file_list = get_all_files(img_dir)
+    for filename in file_list:
+        if isimg(filename):
+            filepath = filename
+            img = np.asarray(Image.open(filepath).convert('RGB'))
+            img_list.append(img)
+    imgs = np.stack(img_list, axis=0)
+    npz_dir = os.path.join('tmp', 'fid')
+    os.makedirs(npz_dir, exist_ok=True)
+    npz_path = os.path.join(npz_dir, 'imgs.npz')
+    np.savez(npz_path, imgs)
+    return npz_path
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("ref_batch", help="path to reference batch npz file")
+    parser.add_argument("sample_batch", help="path to sample batch npz file")
+    args = parser.parse_args()
+    config = tf.ConfigProto(
+        allow_soft_placement=True  # allows DecodeJpeg to run on CPU in Inception graph
+    )
+    config.gpu_options.allow_growth = True
+    evaluator = Evaluator(tf.Session(config=config))
+    print("warming up TensorFlow...")
+    # This will cause TF to print a bunch of verbose stuff now rather
+    # than after the next print(), to help prevent confusion.
+    evaluator.warmup()
+    print("computing reference batch activations...")
+    ref_acts = evaluator.read_activations(args.ref_batch)
+    print("computing/reading reference batch statistics...")
+    ref_stats, ref_stats_spatial = evaluator.read_statistics(args.ref_batch, ref_acts)
+    if os.path.isdir(args.sample_batch):
+        sample_batch = png2npz(args.sample_batch)
+    else:
+        sample_batch = args.sample_batch
+    print("computing sample batch activations...")
+    sample_acts = evaluator.read_activations(sample_batch)
+    print("computing/reading sample batch statistics...")
+    sample_stats, sample_stats_spatial = evaluator.read_statistics(sample_batch, sample_acts)
+    print("Computing evaluations...")
+    print("Inception Score:", evaluator.compute_inception_score(sample_acts[0]))
+    print("FID:", sample_stats.frechet_distance(ref_stats))
+    print("sFID:", sample_stats_spatial.frechet_distance(ref_stats_spatial))
+    prec, recall = evaluator.compute_prec_recall(ref_acts[0], sample_acts[0])
+    print("Precision:", prec)
+    print("Recall:", recall)
+class InvalidFIDException(Exception):
+    pass
+class FIDStatistics:
+    def __init__(self, mu: np.ndarray, sigma: np.ndarray):
+        self.mu = mu
+        self.sigma = sigma
+    def frechet_distance(self, other, eps=1e-6):
+        """
+        Compute the Frechet distance between two sets of statistics.
+        """
+        # https://github.com/bioinf-jku/TTUR/blob/73ab375cdf952a12686d9aa7978567771084da42/fid.py#L132
+        mu1, sigma1 = self.mu, self.sigma
+        mu2, sigma2 = other.mu, other.sigma
+        mu1 = np.atleast_1d(mu1)
+        mu2 = np.atleast_1d(mu2)
+        sigma1 = np.atleast_2d(sigma1)
+        sigma2 = np.atleast_2d(sigma2)
+        assert (
+            mu1.shape == mu2.shape
+        ), f"Training and test mean vectors have different lengths: {mu1.shape}, {mu2.shape}"
+        assert (
+            sigma1.shape == sigma2.shape
+        ), f"Training and test covariances have different dimensions: {sigma1.shape}, {sigma2.shape}"
+        diff = mu1 - mu2
+        # product might be almost singular
+        covmean, _ = linalg.sqrtm(sigma1.dot(sigma2), disp=False)
+        if not np.isfinite(covmean).all():
+            msg = (
+                "fid calculation produces singular product; adding %s to diagonal of cov estimates"
+                % eps
+            )
+            warnings.warn(msg)
+            offset = np.eye(sigma1.shape[0]) * eps
+            covmean = linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))
+        # numerical error might give slight imaginary component
+        if np.iscomplexobj(covmean):
+            if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
+                m = np.max(np.abs(covmean.imag))
+                raise ValueError("Imaginary component {}".format(m))
+            covmean = covmean.real
+        tr_covmean = np.trace(covmean)
+        return diff.dot(diff) + np.trace(sigma1) + np.trace(sigma2) - 2 * tr_covmean
+class Evaluator:
+    def __init__(
+        self,
+        session,
+        batch_size=64,
+        softmax_batch_size=512,
+    ):
+        self.sess = session
+        self.batch_size = batch_size
+        self.softmax_batch_size = softmax_batch_size
+        self.manifold_estimator = ManifoldEstimator(session)
+        with self.sess.graph.as_default():
+            self.image_input = tf.placeholder(tf.float32, shape=[None, None, None, 3])
+            self.softmax_input = tf.placeholder(tf.float32, shape=[None, 2048])
+            self.pool_features, self.spatial_features = _create_feature_graph(self.image_input)
+            self.softmax = _create_softmax_graph(self.softmax_input)
+    def warmup(self):
+        self.compute_activations(np.zeros([1, 8, 64, 64, 3]))
+    def read_activations(self, npz_path: str) -> Tuple[np.ndarray, np.ndarray]:
+        with open_npz_array(npz_path, "arr_0") as reader:
+            return self.compute_activations(reader.read_batches(self.batch_size))
+    def compute_activations(self, batches: Iterable[np.ndarray]) -> Tuple[np.ndarray, np.ndarray]:
+        """
+        Compute image features for downstream evals.
+        :param batches: a iterator over NHWC numpy arrays in [0, 255].
+        :return: a tuple of numpy arrays of shape [N x X], where X is a feature
+                 dimension. The tuple is (pool_3, spatial).
+        """
+        preds = []
+        spatial_preds = []
+        for batch in tqdm(batches):
+            batch = batch.astype(np.float32)
+            pred, spatial_pred = self.sess.run(
+                [self.pool_features, self.spatial_features], {self.image_input: batch}
+            )
+            preds.append(pred.reshape([pred.shape[0], -1]))
+            spatial_preds.append(spatial_pred.reshape([spatial_pred.shape[0], -1]))
+        return (
+            np.concatenate(preds, axis=0),
+            np.concatenate(spatial_preds, axis=0),
+        )
+    def read_statistics(
+        self, npz_path: str, activations: Tuple[np.ndarray, np.ndarray]
+    ) -> Tuple[FIDStatistics, FIDStatistics]:
+        obj = np.load(npz_path)
+        if "mu" in list(obj.keys()):
+            return FIDStatistics(obj["mu"], obj["sigma"]), FIDStatistics(
+                obj["mu_s"], obj["sigma_s"]
+            )
+        return tuple(self.compute_statistics(x) for x in activations)
+    def compute_statistics(self, activations: np.ndarray) -> FIDStatistics:
+        mu = np.mean(activations, axis=0)
+        sigma = np.cov(activations, rowvar=False)
+        return FIDStatistics(mu, sigma)
+    def compute_inception_score(self, activations: np.ndarray, split_size: int = 5000) -> float:
+        softmax_out = []
+        for i in range(0, len(activations), self.softmax_batch_size):
+            acts = activations[i : i + self.softmax_batch_size]
+            softmax_out.append(self.sess.run(self.softmax, feed_dict={self.softmax_input: acts}))
+        preds = np.concatenate(softmax_out, axis=0)
+        # https://github.com/openai/improved-gan/blob/4f5d1ec5c16a7eceb206f42bfc652693601e1d5c/inception_score/model.py#L46
+        scores = []
+        for i in range(0, len(preds), split_size):
+            part = preds[i : i + split_size]
+            kl = part * (np.log(part) - np.log(np.expand_dims(np.mean(part, 0), 0)))
+            kl = np.mean(np.sum(kl, 1))
+            scores.append(np.exp(kl))
+        return float(np.mean(scores))
+    def compute_prec_recall(
+        self, activations_ref: np.ndarray, activations_sample: np.ndarray
+    ) -> Tuple[float, float]:
+        radii_1 = self.manifold_estimator.manifold_radii(activations_ref)
+        radii_2 = self.manifold_estimator.manifold_radii(activations_sample)
+        pr = self.manifold_estimator.evaluate_pr(
+            activations_ref, radii_1, activations_sample, radii_2
+        )
+        return (float(pr[0][0]), float(pr[1][0]))
+class ManifoldEstimator:
+    """
+    A helper for comparing manifolds of feature vectors.
+    Adapted from https://github.com/kynkaat/improved-precision-and-recall-metric/blob/f60f25e5ad933a79135c783fcda53de30f42c9b9/precision_recall.py#L57
+    """
+    def __init__(
+        self,
+        session,
+        row_batch_size=10000,
+        col_batch_size=10000,
+        nhood_sizes=(3,),
+        clamp_to_percentile=None,
+        eps=1e-5,
+    ):
+        """
+        Estimate the manifold of given feature vectors.
+        :param session: the TensorFlow session.
+        :param row_batch_size: row batch size to compute pairwise distances
+                               (parameter to trade-off between memory usage and performance).
+        :param col_batch_size: column batch size to compute pairwise distances.
+        :param nhood_sizes: number of neighbors used to estimate the manifold.
+        :param clamp_to_percentile: prune hyperspheres that have radius larger than
+                                    the given percentile.
+        :param eps: small number for numerical stability.
+        """
+        self.distance_block = DistanceBlock(session)
+        self.row_batch_size = row_batch_size
+        self.col_batch_size = col_batch_size
+        self.nhood_sizes = nhood_sizes
+        self.num_nhoods = len(nhood_sizes)
+        self.clamp_to_percentile = clamp_to_percentile
+        self.eps = eps
+    def warmup(self):
+        feats, radii = (
+            np.zeros([1, 2048], dtype=np.float32),
+            np.zeros([1, 1], dtype=np.float32),
+        )
+        self.evaluate_pr(feats, radii, feats, radii)
+    def manifold_radii(self, features: np.ndarray) -> np.ndarray:
+        num_images = len(features)
+        # Estimate manifold of features by calculating distances to k-NN of each sample.
+        radii = np.zeros([num_images, self.num_nhoods], dtype=np.float32)
+        distance_batch = np.zeros([self.row_batch_size, num_images], dtype=np.float32)
+        seq = np.arange(max(self.nhood_sizes) + 1, dtype=np.int32)
+        for begin1 in range(0, num_images, self.row_batch_size):
+            end1 = min(begin1 + self.row_batch_size, num_images)
+            row_batch = features[begin1:end1]
+            for begin2 in range(0, num_images, self.col_batch_size):
+                end2 = min(begin2 + self.col_batch_size, num_images)
+                col_batch = features[begin2:end2]
+                # Compute distances between batches.
+                distance_batch[
+                    0 : end1 - begin1, begin2:end2
+                ] = self.distance_block.pairwise_distances(row_batch, col_batch)
+            # Find the k-nearest neighbor from the current batch.
+            radii[begin1:end1, :] = np.concatenate(
+                [
+                    x[:, self.nhood_sizes]
+                    for x in _numpy_partition(distance_batch[0 : end1 - begin1, :], seq, axis=1)
+                ],
+                axis=0,
+            )
+        if self.clamp_to_percentile is not None:
+            max_distances = np.percentile(radii, self.clamp_to_percentile, axis=0)
+            radii[radii > max_distances] = 0
+        return radii
+    def evaluate(self, features: np.ndarray, radii: np.ndarray, eval_features: np.ndarray):
+        """
+        Evaluate if new feature vectors are at the manifold.
+        """
+        num_eval_images = eval_features.shape[0]
+        num_ref_images = radii.shape[0]
+        distance_batch = np.zeros([self.row_batch_size, num_ref_images], dtype=np.float32)
+        batch_predictions = np.zeros([num_eval_images, self.num_nhoods], dtype=np.int32)
+        max_realism_score = np.zeros([num_eval_images], dtype=np.float32)
+        nearest_indices = np.zeros([num_eval_images], dtype=np.int32)
+        for begin1 in range(0, num_eval_images, self.row_batch_size):
+            end1 = min(begin1 + self.row_batch_size, num_eval_images)
+            feature_batch = eval_features[begin1:end1]
+            for begin2 in range(0, num_ref_images, self.col_batch_size):
+                end2 = min(begin2 + self.col_batch_size, num_ref_images)
+                ref_batch = features[begin2:end2]
+                distance_batch[
+                    0 : end1 - begin1, begin2:end2
+                ] = self.distance_block.pairwise_distances(feature_batch, ref_batch)
+            # From the minibatch of new feature vectors, determine if they are in the estimated manifold.
+            # If a feature vector is inside a hypersphere of some reference sample, then
+            # the new sample lies at the estimated manifold.
+            # The radii of the hyperspheres are determined from distances of neighborhood size k.
+            samples_in_manifold = distance_batch[0 : end1 - begin1, :, None] <= radii
+            batch_predictions[begin1:end1] = np.any(samples_in_manifold, axis=1).astype(np.int32)
+            max_realism_score[begin1:end1] = np.max(
+                radii[:, 0] / (distance_batch[0 : end1 - begin1, :] + self.eps), axis=1
+            )
+            nearest_indices[begin1:end1] = np.argmin(distance_batch[0 : end1 - begin1, :], axis=1)
+        return {
+            "fraction": float(np.mean(batch_predictions)),
+            "batch_predictions": batch_predictions,
+            "max_realisim_score": max_realism_score,
+            "nearest_indices": nearest_indices,
+        }
+    def evaluate_pr(
+        self,
+        features_1: np.ndarray,
+        radii_1: np.ndarray,
+        features_2: np.ndarray,
+        radii_2: np.ndarray,
+    ) -> Tuple[np.ndarray, np.ndarray]:
+        """
+        Evaluate precision and recall efficiently.
+        :param features_1: [N1 x D] feature vectors for reference batch.
+        :param radii_1: [N1 x K1] radii for reference vectors.
+        :param features_2: [N2 x D] feature vectors for the other batch.
+        :param radii_2: [N x K2] radii for other vectors.
+        :return: a tuple of arrays for (precision, recall):
+                 - precision: an np.ndarray of length K1
+                 - recall: an np.ndarray of length K2
+        """
+        features_1_status = np.zeros([len(features_1), radii_2.shape[1]], dtype=np.bool)
+        features_2_status = np.zeros([len(features_2), radii_1.shape[1]], dtype=np.bool)
+        for begin_1 in range(0, len(features_1), self.row_batch_size):
+            end_1 = begin_1 + self.row_batch_size
+            batch_1 = features_1[begin_1:end_1]
+            for begin_2 in range(0, len(features_2), self.col_batch_size):
+                end_2 = begin_2 + self.col_batch_size
+                batch_2 = features_2[begin_2:end_2]
+                batch_1_in, batch_2_in = self.distance_block.less_thans(
+                    batch_1, radii_1[begin_1:end_1], batch_2, radii_2[begin_2:end_2]
+                )
+                features_1_status[begin_1:end_1] |= batch_1_in
+                features_2_status[begin_2:end_2] |= batch_2_in
+        return (
+            np.mean(features_2_status.astype(np.float64), axis=0),
+            np.mean(features_1_status.astype(np.float64), axis=0),
+        )
+class DistanceBlock:
+    """
+    Calculate pairwise distances between vectors.
+    Adapted from https://github.com/kynkaat/improved-precision-and-recall-metric/blob/f60f25e5ad933a79135c783fcda53de30f42c9b9/precision_recall.py#L34
+    """
+    def __init__(self, session):
+        self.session = session
+        # Initialize TF graph to calculate pairwise distances.
+        with session.graph.as_default():
+            self._features_batch1 = tf.placeholder(tf.float32, shape=[None, None])
+            self._features_batch2 = tf.placeholder(tf.float32, shape=[None, None])
+            distance_block_16 = _batch_pairwise_distances(
+                tf.cast(self._features_batch1, tf.float16),
+                tf.cast(self._features_batch2, tf.float16),
+            )
+            self.distance_block = tf.cond(
+                tf.reduce_all(tf.math.is_finite(distance_block_16)),
+                lambda: tf.cast(distance_block_16, tf.float32),
+                lambda: _batch_pairwise_distances(self._features_batch1, self._features_batch2),
+            )
+            # Extra logic for less thans.
+            self._radii1 = tf.placeholder(tf.float32, shape=[None, None])
+            self._radii2 = tf.placeholder(tf.float32, shape=[None, None])
+            dist32 = tf.cast(self.distance_block, tf.float32)[..., None]
+            self._batch_1_in = tf.math.reduce_any(dist32 <= self._radii2, axis=1)
+            self._batch_2_in = tf.math.reduce_any(dist32 <= self._radii1[:, None], axis=0)
+    def pairwise_distances(self, U, V):
+        """
+        Evaluate pairwise distances between two batches of feature vectors.
+        """
+        return self.session.run(
+            self.distance_block,
+            feed_dict={self._features_batch1: U, self._features_batch2: V},
+        )
+    def less_thans(self, batch_1, radii_1, batch_2, radii_2):
+        return self.session.run(
+            [self._batch_1_in, self._batch_2_in],
+            feed_dict={
+                self._features_batch1: batch_1,
+                self._features_batch2: batch_2,
+                self._radii1: radii_1,
+                self._radii2: radii_2,
+            },
+        )
+def _batch_pairwise_distances(U, V):
+    """
+    Compute pairwise distances between two batches of feature vectors.
+    """
+    with tf.variable_scope("pairwise_dist_block"):
+        # Squared norms of each row in U and V.
+        norm_u = tf.reduce_sum(tf.square(U), 1)
+        norm_v = tf.reduce_sum(tf.square(V), 1)
+        # norm_u as a column and norm_v as a row vectors.
+        norm_u = tf.reshape(norm_u, [-1, 1])
+        norm_v = tf.reshape(norm_v, [1, -1])
+        # Pairwise squared Euclidean distances.
+        D = tf.maximum(norm_u - 2 * tf.matmul(U, V, False, True) + norm_v, 0.0)
+    return D
+class NpzArrayReader(ABC):
+    @abstractmethod
+    def read_batch(self, batch_size: int) -> Optional[np.ndarray]:
+        pass
+    @abstractmethod
+    def remaining(self) -> int:
+        pass
+    def read_batches(self, batch_size: int) -> Iterable[np.ndarray]:
+        def gen_fn():
+            while True:
+                batch = self.read_batch(batch_size)
+                if batch is None:
+                    break
+                yield batch
+        rem = self.remaining()
+        num_batches = rem // batch_size + int(rem % batch_size != 0)
+        return BatchIterator(gen_fn, num_batches)
+class BatchIterator:
+    def __init__(self, gen_fn, length):
+        self.gen_fn = gen_fn
+        self.length = length
+    def __len__(self):
+        return self.length
+    def __iter__(self):
+        return self.gen_fn()
+class StreamingNpzArrayReader(NpzArrayReader):
+    def __init__(self, arr_f, shape, dtype):
+        self.arr_f = arr_f
+        self.shape = shape
+        self.dtype = dtype
+        self.idx = 0
+    def read_batch(self, batch_size: int) -> Optional[np.ndarray]:
+        if self.idx >= self.shape[0]:
+            return None
+        bs = min(batch_size, self.shape[0] - self.idx)
+        self.idx += bs
+        if self.dtype.itemsize == 0:
+            return np.ndarray([bs, *self.shape[1:]], dtype=self.dtype)
+        read_count = bs * np.prod(self.shape[1:])
+        read_size = int(read_count * self.dtype.itemsize)
+        data = _read_bytes(self.arr_f, read_size, "array data")
+        return np.frombuffer(data, dtype=self.dtype).reshape([bs, *self.shape[1:]])
+    def remaining(self) -> int:
+        return max(0, self.shape[0] - self.idx)
+class MemoryNpzArrayReader(NpzArrayReader):
+    def __init__(self, arr):
+        self.arr = arr
+        self.idx = 0
+    @classmethod
+    def load(cls, path: str, arr_name: str):
+        with open(path, "rb") as f:
+            arr = np.load(f)[arr_name]
+        return cls(arr)
+    def read_batch(self, batch_size: int) -> Optional[np.ndarray]:
+        if self.idx >= self.arr.shape[0]:
+            return None
+        res = self.arr[self.idx : self.idx + batch_size]
+        self.idx += batch_size
+        return res
+    def remaining(self) -> int:
+        return max(0, self.arr.shape[0] - self.idx)
+@contextmanager
+def open_npz_array(path: str, arr_name: str) -> NpzArrayReader:
+    with _open_npy_file(path, arr_name) as arr_f:
+        version = np.lib.format.read_magic(arr_f)
+        if version == (1, 0):
+            header = np.lib.format.read_array_header_1_0(arr_f)
+        elif version == (2, 0):
+            header = np.lib.format.read_array_header_2_0(arr_f)
+        else:
+            yield MemoryNpzArrayReader.load(path, arr_name)
+            return
+        shape, fortran, dtype = header
+        if fortran or dtype.hasobject:
+            yield MemoryNpzArrayReader.load(path, arr_name)
+        else:
+            yield StreamingNpzArrayReader(arr_f, shape, dtype)
+def _read_bytes(fp, size, error_template="ran out of data"):
+    """
+    Copied from: https://github.com/numpy/numpy/blob/fb215c76967739268de71aa4bda55dd1b062bc2e/numpy/lib/format.py#L788-L886
+    Read from file-like object until size bytes are read.
+    Raises ValueError if not EOF is encountered before size bytes are read.
+    Non-blocking objects only supported if they derive from io objects.
+    Required as e.g. ZipExtFile in python 2.6 can return less data than
+    requested.
+    """
+    data = bytes()
+    while True:
+        # io files (default in python3) return None or raise on
+        # would-block, python2 file will truncate, probably nothing can be
+        # done about that.  note that regular files can't be non-blocking
+        try:
+            r = fp.read(size - len(data))
+            data += r
+            if len(r) == 0 or len(data) == size:
+                break
+        except io.BlockingIOError:
+            pass
+    if len(data) != size:
+        msg = "EOF: reading %s, expected %d bytes got %d"
+        raise ValueError(msg % (error_template, size, len(data)))
+    else:
+        return data
+@contextmanager
+def _open_npy_file(path: str, arr_name: str):
+    with open(path, "rb") as f:
+        with zipfile.ZipFile(f, "r") as zip_f:
+            if f"{arr_name}.npy" not in zip_f.namelist():
+                raise ValueError(f"missing {arr_name} in npz file")
+            with zip_f.open(f"{arr_name}.npy", "r") as arr_f:
+                yield arr_f
+def _download_inception_model():
+    if os.path.exists(INCEPTION_V3_PATH):
+        return
+    print("downloading InceptionV3 model...")
+    with requests.get(INCEPTION_V3_URL, stream=True) as r:
+        r.raise_for_status()
+        tmp_path = INCEPTION_V3_PATH + ".tmp"
+        with open(tmp_path, "wb") as f:
+            for chunk in tqdm(r.iter_content(chunk_size=8192)):
+                f.write(chunk)
+        os.rename(tmp_path, INCEPTION_V3_PATH)
+def _create_feature_graph(input_batch):
+    _download_inception_model()
+    prefix = f"{random.randrange(2**32)}_{random.randrange(2**32)}"
+    with open(INCEPTION_V3_PATH, "rb") as f:
+        graph_def = tf.GraphDef()
+        graph_def.ParseFromString(f.read())
+    pool3, spatial = tf.import_graph_def(
+        graph_def,
+        input_map={f"ExpandDims:0": input_batch},
+        return_elements=[FID_POOL_NAME, FID_SPATIAL_NAME],
+        name=prefix,
+    )
+    _update_shapes(pool3)
+    spatial = spatial[..., :7]
+    return pool3, spatial
+def _create_softmax_graph(input_batch):
+    _download_inception_model()
+    prefix = f"{random.randrange(2**32)}_{random.randrange(2**32)}"
+    with open(INCEPTION_V3_PATH, "rb") as f:
+        graph_def = tf.GraphDef()
+        graph_def.ParseFromString(f.read())
+    (matmul,) = tf.import_graph_def(
+        graph_def, return_elements=[f"softmax/logits/MatMul"], name=prefix
+    )
+    w = matmul.inputs[1]
+    logits = tf.matmul(input_batch, w)
+    return tf.nn.softmax(logits)
+def _update_shapes(pool3):
+    # https://github.com/bioinf-jku/TTUR/blob/73ab375cdf952a12686d9aa7978567771084da42/fid.py#L50-L63
+    ops = pool3.graph.get_operations()
+    for op in ops:
+        for o in op.outputs:
+            shape = o.get_shape()
+            if shape._dims is not None:  # pylint: disable=protected-access
+                # shape = [s.value for s in shape] TF 1.x
+                shape = [s for s in shape]  # TF 2.x
+                new_shape = []
+                for j, s in enumerate(shape):
+                    if s == 1 and j == 0:
+                        new_shape.append(None)
+                    else:
+                        new_shape.append(s)
+                o.__dict__["_shape_val"] = tf.TensorShape(new_shape)
+    return pool3
+def _numpy_partition(arr, kth, **kwargs):
+    num_workers = min(cpu_count(), len(arr))
+    chunk_size = len(arr) // num_workers
+    extra = len(arr) % num_workers
+    start_idx = 0
+    batches = []
+    for i in range(num_workers):
+        size = chunk_size + (1 if i < extra else 0)
+        batches.append(arr[start_idx : start_idx + size])
+        start_idx += size
+    with ThreadPool(num_workers) as pool:
+        return list(pool.map(partial(np.partition, kth=kth, **kwargs), batches))
+if __name__ == "__main__":
+    main()

extract_latent.py ADDED Viewed

	@@ -0,0 +1,114 @@

+import argparse
+import os
+import time
+import lmdb
+import torch
+from torch import nn
+import torchvision.transforms as transforms
+from torch.utils.data import DataLoader
+from autoencoder import get_model
+from train_utils.datasets import imagenet_lmdb_dataset
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--data_name', default='imagenet', type=str)
+    parser.add_argument('--data_dir', default='../datasets', type=str)
+    parser.add_argument('--ckpt', default='assets/vae/autoencoder_kl.pth', type=str, help='checkpoint path')
+    parser.add_argument('--resolution', default=512, type=int)
+    parser.add_argument('--batch_size', default=128, type=int)
+    parser.add_argument('--split', default='train', type=str)
+    parser.add_argument('--xflip', action='store_true')
+    parser.add_argument('--outdir', type=str, default='../data/imagenet512-latent', help='output directory')
+    args = parser.parse_args()
+    assert args.split in ['train', 'val']
+    transform = transforms.Compose([
+        transforms.ToTensor(),
+        transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
+    ])
+    dataset = imagenet_lmdb_dataset(root=f'{args.data_dir}/{args.split}',
+                                    transform=transform, resolution=args.resolution)
+    print(f'data size: {len(dataset)}')
+    model = get_model(args.ckpt)
+    print(f'load vae weights from autoencoder_kl.pth')
+    model = nn.DataParallel(model)
+    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    model.to(device)
+    def extract_feature():
+        outdir = f'{args.data_name}_{args.resolution}_latent_lmdb'
+        target_db_dir = os.path.join(args.outdir, outdir, args.split)
+        os.makedirs(target_db_dir, exist_ok=True)
+        target_env = lmdb.open(target_db_dir, map_size=pow(2,40), readahead=False)
+        dataset_loader = DataLoader(dataset, batch_size=args.batch_size, shuffle=False, drop_last=False,
+                                    num_workers=8, pin_memory=True, persistent_workers=True)
+        idx = 0
+        begin = time.time()
+        print('start...')
+        for batch in dataset_loader:
+            img, label = batch
+            assert img.min() >= -1 and img.max() <= 1
+            img = img.to(device)
+            moments = model(img, fn='encode_moments')
+            assert moments.shape[-1] == (args.resolution // 8)
+            moments = moments.detach().cpu().numpy()
+            label = label.detach().cpu().numpy()
+            with target_env.begin(write=True) as target_txn:
+                for moment, lb in zip(moments, label):
+                    target_txn.put(f'z-{str(idx)}'.encode('utf-8'), moment)
+                    target_txn.put(f'y-{str(idx)}'.encode('utf-8'), str(lb).encode('utf-8'))
+                    idx += 1
+            if idx % 5120 == 0:
+                cur_time = time.time()
+                print(f'saved {idx} files with {cur_time - begin}s elapsed')
+                begin = time.time()
+        # idx = 1_281_167
+        if args.xflip:
+            print('starting to store the xflip latents')
+            begin = time.time()
+            for batch in dataset_loader:
+                img, label = batch
+                assert img.min() >= -1 and img.max() <= 1
+                img = img.to(device)
+                moments = model(img.flip(dims=[-1]), fn='encode_moments')
+                moments = moments.detach().cpu().numpy()
+                label = label.detach().cpu().numpy()
+                with target_env.begin(write=True) as target_txn:
+                    for moment, lb in zip(moments, label):
+                        target_txn.put(f'z-{str(idx)}'.encode('utf-8'), moment)
+                        target_txn.put(f'y-{str(idx)}'.encode('utf-8'), str(lb).encode('utf-8'))
+                        idx += 1
+                if idx % 10000 == 0:
+                    cur_time = time.time()
+                    print(f'saved {idx} files with {cur_time - begin}s elapsed')
+                    begin = time.time()
+        with target_env.begin(write=True) as target_txn:
+            target_txn.put('length'.encode('utf-8'), str(idx).encode('utf-8'))
+        print(f'[finished] saved {idx} files')
+    extract_feature()
+if __name__ == "__main__":
+    main()

fid.py ADDED Viewed

	@@ -0,0 +1,177 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+# This code is adapted from https://github.com/NVlabs/edm/blob/main/fid.py.
+# The original code is licensed under a Creative Commons
+# Attribution-NonCommercial-ShareAlike 4.0 International License, which is can be found at licenses/LICENSE_EDM.txt.
+"""Script for calculating Frechet Inception Distance (FID)."""
+import argparse
+from multiprocessing import Process
+import click
+import tqdm
+import pickle
+import numpy as np
+import scipy.linalg
+import torch
+import torch.distributed as dist
+from torch.utils.data import DataLoader
+from utils import *
+from train_utils.datasets import ImageFolderDataset
+#----------------------------------------------------------------------------
+def calculate_inception_stats(
+        image_path, num_expected=None, seed=0, max_batch_size=64,
+        num_workers=3, prefetch_factor=2, device=torch.device('cuda'),
+):
+    # Rank 0 goes first.
+    if dist.get_rank() != 0:
+        dist.barrier()
+    # Load Inception-v3 model.
+    # This is a direct PyTorch translation of http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz
+    detector_url = 'https://api.ngc.nvidia.com/v2/models/nvidia/research/stylegan3/versions/1/files/metrics/inception-2015-12-05.pkl'
+    mprint('Loading Inception-v3 model...')
+    detector_kwargs = dict(return_features=True)
+    feature_dim = 2048
+    with open(detector_url, 'rb') as f:
+        detector_net = pickle.load(f).to(device)
+    # List images.
+    mprint(f'Loading images from "{image_path}"...')
+    dataset_obj = ImageFolderDataset(path=image_path, max_size=num_expected, random_seed=seed)
+    if num_expected is not None and len(dataset_obj) < num_expected:
+        raise click.ClickException(f'Found {len(dataset_obj)} images, but expected at least {num_expected}')
+    if len(dataset_obj) < 2:
+        raise click.ClickException(f'Found {len(dataset_obj)} images, but need at least 2 to compute statistics')
+    # Other ranks follow.
+    if dist.get_rank() == 0:
+        dist.barrier()
+    # Divide images into batches.
+    num_batches = ((len(dataset_obj) - 1) // (max_batch_size * dist.get_world_size()) + 1) * dist.get_world_size()
+    all_batches = torch.arange(len(dataset_obj)).tensor_split(num_batches)
+    rank_batches = all_batches[dist.get_rank() :: dist.get_world_size()]
+    data_loader = DataLoader(dataset_obj, batch_sampler=rank_batches, num_workers=num_workers, prefetch_factor=prefetch_factor)
+    # Accumulate statistics.
+    mprint(f'Calculating statistics for {len(dataset_obj)} images...')
+    mu = torch.zeros([feature_dim], dtype=torch.float64, device=device)
+    sigma = torch.zeros([feature_dim, feature_dim], dtype=torch.float64, device=device)
+    for images, _labels in tqdm.tqdm(data_loader, unit='batch', disable=(dist.get_rank() != 0)):
+        dist.barrier()
+        if images.shape[0] == 0:
+            continue
+        if images.shape[1] == 1:
+            images = images.repeat([1, 3, 1, 1])
+        features = detector_net(images.to(device), **detector_kwargs).to(torch.float64)
+        mu += features.sum(0)
+        sigma += features.T @ features
+    # Calculate grand totals.
+    dist.all_reduce(mu)
+    dist.all_reduce(sigma)
+    mu /= len(dataset_obj)
+    sigma -= mu.ger(mu) * len(dataset_obj)
+    sigma /= len(dataset_obj) - 1
+    return mu.cpu().numpy(), sigma.cpu().numpy()
+#----------------------------------------------------------------------------
+def calculate_fid_from_inception_stats(mu, sigma, mu_ref, sigma_ref):
+    m = np.square(mu - mu_ref).sum()
+    s, _ = scipy.linalg.sqrtm(np.dot(sigma, sigma_ref), disp=False)
+    fid = m + np.trace(sigma + sigma_ref - s * 2)
+    return float(np.real(fid))
+#----------------------------------------------------------------------------
+def calc(image_path, ref_path, num_expected, seed, batch):
+    """Calculate FID for a given set of images."""
+    if dist.get_rank() == 0:
+        logger = Logger(file_name=f'{image_path}/log_fid.txt', file_mode="a+", should_flush=True)
+    mprint(f'Loading dataset reference statistics from "{ref_path}"...')
+    ref = None
+    if dist.get_rank() == 0:
+        assert ref_path.endswith('.npz')
+        ref = dict(np.load(ref_path))
+    mu, sigma = calculate_inception_stats(image_path=image_path, num_expected=num_expected, seed=seed, max_batch_size=batch)
+    mprint('Calculating FID...')
+    fid = None
+    if dist.get_rank() == 0:
+        fid = calculate_fid_from_inception_stats(mu, sigma, ref['mu'], ref['sigma'])
+        print(f'{fid:g}')
+    dist.barrier()
+    if dist.get_rank() == 0:
+        logger.close()
+    return fid
+#----------------------------------------------------------------------------
+def ref(dataset_path, dest_path, batch):
+    """Calculate dataset reference statistics needed by 'calc'."""
+    mu, sigma = calculate_inception_stats(image_path=dataset_path, max_batch_size=batch)
+    mprint(f'Saving dataset reference statistics to "{dest_path}"...')
+    if dist.get_rank() == 0:
+        if os.path.dirname(dest_path):
+            os.makedirs(os.path.dirname(dest_path), exist_ok=True)
+        np.savez(dest_path, mu=mu, sigma=sigma)
+    dist.barrier()
+    mprint('Done.')
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser('fid parameters')
+    # ddp
+    parser.add_argument('--num_proc_node', type=int, default=1, help='The number of nodes in multi node env.')
+    parser.add_argument('--num_process_per_node', type=int, default=1, help='number of gpus')
+    parser.add_argument('--node_rank', type=int, default=0, help='The index of node.')
+    parser.add_argument('--local_rank', type=int, default=0, help='rank of process in the node')
+    parser.add_argument('--master_address', type=str, default='localhost', help='address for master')
+    # fid
+    parser.add_argument('--mode', type=str, required=True, choices=['calc', 'ref'], help='Calcalute FID or store reference statistics')
+    parser.add_argument('--image_path', type=str, required=True, help='Path to the images')
+    parser.add_argument('--ref_path', type=str, default='assets/fid_stats/fid_stats_imagenet256_guided_diffusion.npz', help='Dataset reference statistics')
+    parser.add_argument('--num_expected', type=int, default=50000, help='Number of images to use')
+    parser.add_argument('--seed', type=int, default=0, help='Random seed for selecting the images')
+    parser.add_argument('--batch', type=int, default=64, help='Maximum batch size per GPU')
+    args = parser.parse_args()
+    args.global_size = args.num_proc_node * args.num_process_per_node
+    size = args.num_process_per_node
+    func = lambda args: calc(args.image_path, args.ref_path, args.num_expected, args.seed, args.batch) \
+        if args.mode == 'calc' else lambda args: ref(args.image_path, args.ref_path, args.batch)
+    if size > 1:
+        processes = []
+        for rank in range(size):
+            args.local_rank = rank
+            args.global_rank = rank + args.node_rank * args.num_process_per_node
+            p = Process(target=init_processes, args=(func, args))
+            p.start()
+            processes.append(p)
+        for p in processes:
+            p.join()
+    else:
+        print('Single GPU run')
+        assert args.global_size == 1 and args.local_rank == 0
+        args.global_rank = 0
+        init_processes(func, args)

generate.py ADDED Viewed

	@@ -0,0 +1,91 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+from argparse import ArgumentParser
+import os
+import json
+from omegaconf import OmegaConf
+import torch
+from models.maskdit import Precond_models
+from sample import generate_with_net
+from utils import parse_float_none, parse_int_list, init_processes
+def generate(args):
+    rank = args.global_rank
+    size = args.global_size
+    config = OmegaConf.load(args.config)
+    label_dict = json.load(open(args.label_dict, 'r'))
+    class_label = label_dict[str(args.class_idx)][1]
+    print(f'start sampling class {class_label}...')
+    device = torch.device('cuda')
+    # setup directory
+    sample_dir = os.path.join(args.results_dir, class_label)
+    os.makedirs(sample_dir, exist_ok=True)
+    args.outdir = sample_dir
+    # setup model
+    model = Precond_models[config.model.precond](
+        img_resolution=config.model.in_size,
+        img_channels=config.model.in_channels,
+        num_classes=config.model.num_classes,
+        model_type=config.model.model_type,
+        use_decoder=config.model.use_decoder,
+        mae_loss_coef=config.model.mae_loss_coef,
+        pad_cls_token=config.model.pad_cls_token,
+        use_encoder_feat=config.model.self_cond,
+    ).to(device)
+    model.eval()
+    print(f"{config.model.model_type} ((use_decoder: {config.model.use_decoder})) Model Parameters: {sum(p.numel() for p in model.parameters()):,}")
+    print(f'extras: {model.model.extras}, cls_token: {model.model.cls_token}')
+    model = torch.compile(model)
+    ckpt = torch.load(args.ckpt_path, map_location=device)
+    model.load_state_dict(ckpt['ema'])
+    generate_with_net(args, model, device, rank, size)
+    print(f'sampling class {class_label} done!')
+if __name__ == '__main__':
+    parser = ArgumentParser('Sample from a trained model')
+    # basic config
+    parser.add_argument('--config', type=str, required=True, help='path to config file')
+    parser.add_argument('--label_dict', type=str, default='assets/imagenet_label.json', help='path to label dict')
+    parser.add_argument("--results_dir", type=str, default="samples", help='path to save samples')
+    parser.add_argument('--ckpt_path', type=str, default=None, help='path to ckpt')
+    # sampling
+    parser.add_argument('--seeds', type=parse_int_list, default='100-131', help='Random seeds (e.g. 1,2,5-10)')
+    parser.add_argument('--subdirs', action='store_true', help='Create subdirectory for every 1000 seeds')
+    parser.add_argument('--class_idx', type=int, default=None, help='Class label  [default: random]')
+    parser.add_argument("--cfg_scale", type=parse_float_none, default=None, help='None = no guidance, by default = 4.0')
+    parser.add_argument('--num_steps', type=int, default=40, help='Number of sampling steps')
+    parser.add_argument('--S_churn', type=int, default=0, help='Stochasticity strength')
+    parser.add_argument('--solver', type=str, default=None, choices=['euler', 'heun'], help='Ablate ODE solver')
+    parser.add_argument('--discretization', type=str, default=None, choices=['vp', 've', 'iddpm', 'edm'], help='Ablate ODE solver')
+    parser.add_argument('--schedule', type=str, default=None, choices=['vp', 've', 'linear'], help='Ablate noise schedule sigma(t)')
+    parser.add_argument('--scaling', type=str, default=None, choices=['vp', 'none'], help='Ablate signal scaling s(t)')
+    parser.add_argument('--pretrained_path', type=str, default='assets/autoencoder_kl.pth', help='Autoencoder ckpt')
+    parser.add_argument('--max_batch_size', type=int, default=32, help='Maximum batch size per GPU during sampling')
+    parser.add_argument('--num_expected', type=int, default=32, help='Number of images to use')
+    parser.add_argument("--global_seed", type=int, default=0)
+    parser.add_argument('--fid_batch_size', type=int, default=32, help='Maximum batch size')
+    # ddp
+    parser.add_argument('--num_proc_node', type=int, default=1, help='The number of nodes in multi node env.')
+    parser.add_argument('--num_process_per_node', type=int, default=1, help='number of gpus')
+    parser.add_argument('--node_rank', type=int, default=0, help='The index of node.')
+    parser.add_argument('--local_rank', type=int, default=0, help='rank of process in the node')
+    parser.add_argument('--master_address', type=str, default='localhost', help='address for master')
+    args = parser.parse_args()
+    args.global_rank = 0
+    args.local_rank = 0
+    args.global_size = 1
+    init_processes(generate, args)

licenses/LICENSE_ADM.txt ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2021 OpenAI
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

licenses/LICENSE_DIT.txt ADDED Viewed

	@@ -0,0 +1,400 @@

+Attribution-NonCommercial 4.0 International
+=======================================================================
+Creative Commons Corporation ("Creative Commons") is not a law firm and
+does not provide legal services or legal advice. Distribution of
+Creative Commons public licenses does not create a lawyer-client or
+other relationship. Creative Commons makes its licenses and related
+information available on an "as-is" basis. Creative Commons gives no
+warranties regarding its licenses, any material licensed under their
+terms and conditions, or any related information. Creative Commons
+disclaims all liability for damages resulting from their use to the
+fullest extent possible.
+Using Creative Commons Public Licenses
+Creative Commons public licenses provide a standard set of terms and
+conditions that creators and other rights holders may use to share
+original works of authorship and other material subject to copyright
+and certain other rights specified in the public license below. The
+following considerations are for informational purposes only, are not
+exhaustive, and do not form part of our licenses.
+     Considerations for licensors: Our public licenses are
+     intended for use by those authorized to give the public
+     permission to use material in ways otherwise restricted by
+     copyright and certain other rights. Our licenses are
+     irrevocable. Licensors should read and understand the terms
+     and conditions of the license they choose before applying it.
+     Licensors should also secure all rights necessary before
+     applying our licenses so that the public can reuse the
+     material as expected. Licensors should clearly mark any
+     material not subject to the license. This includes other CC-
+     licensed material, or material used under an exception or
+     limitation to copyright. More considerations for licensors:
+	wiki.creativecommons.org/Considerations_for_licensors
+     Considerations for the public: By using one of our public
+     licenses, a licensor grants the public permission to use the
+     licensed material under specified terms and conditions. If
+     the licensor's permission is not necessary for any reason--for
+     example, because of any applicable exception or limitation to
+     copyright--then that use is not regulated by the license. Our
+     licenses grant only permissions under copyright and certain
+     other rights that a licensor has authority to grant. Use of
+     the licensed material may still be restricted for other
+     reasons, including because others have copyright or other
+     rights in the material. A licensor may make special requests,
+     such as asking that all changes be marked or described.
+     Although not required by our licenses, you are encouraged to
+     respect those requests where reasonable. More_considerations
+     for the public:
+	wiki.creativecommons.org/Considerations_for_licensees
+=======================================================================
+Creative Commons Attribution-NonCommercial 4.0 International Public
+License
+By exercising the Licensed Rights (defined below), You accept and agree
+to be bound by the terms and conditions of this Creative Commons
+Attribution-NonCommercial 4.0 International Public License ("Public
+License"). To the extent this Public License may be interpreted as a
+contract, You are granted the Licensed Rights in consideration of Your
+acceptance of these terms and conditions, and the Licensor grants You
+such rights in consideration of benefits the Licensor receives from
+making the Licensed Material available under these terms and
+conditions.
+Section 1 -- Definitions.
+  a. Adapted Material means material subject to Copyright and Similar
+     Rights that is derived from or based upon the Licensed Material
+     and in which the Licensed Material is translated, altered,
+     arranged, transformed, or otherwise modified in a manner requiring
+     permission under the Copyright and Similar Rights held by the
+     Licensor. For purposes of this Public License, where the Licensed
+     Material is a musical work, performance, or sound recording,
+     Adapted Material is always produced where the Licensed Material is
+     synched in timed relation with a moving image.
+  b. Adapter's License means the license You apply to Your Copyright
+     and Similar Rights in Your contributions to Adapted Material in
+     accordance with the terms and conditions of this Public License.
+  c. Copyright and Similar Rights means copyright and/or similar rights
+     closely related to copyright including, without limitation,
+     performance, broadcast, sound recording, and Sui Generis Database
+     Rights, without regard to how the rights are labeled or
+     categorized. For purposes of this Public License, the rights
+     specified in Section 2(b)(1)-(2) are not Copyright and Similar
+     Rights.
+  d. Effective Technological Measures means those measures that, in the
+     absence of proper authority, may not be circumvented under laws
+     fulfilling obligations under Article 11 of the WIPO Copyright
+     Treaty adopted on December 20, 1996, and/or similar international
+     agreements.
+  e. Exceptions and Limitations means fair use, fair dealing, and/or
+     any other exception or limitation to Copyright and Similar Rights
+     that applies to Your use of the Licensed Material.
+  f. Licensed Material means the artistic or literary work, database,
+     or other material to which the Licensor applied this Public
+     License.
+  g. Licensed Rights means the rights granted to You subject to the
+     terms and conditions of this Public License, which are limited to
+     all Copyright and Similar Rights that apply to Your use of the
+     Licensed Material and that the Licensor has authority to license.
+  h. Licensor means the individual(s) or entity(ies) granting rights
+     under this Public License.
+  i. NonCommercial means not primarily intended for or directed towards
+     commercial advantage or monetary compensation. For purposes of
+     this Public License, the exchange of the Licensed Material for
+     other material subject to Copyright and Similar Rights by digital
+     file-sharing or similar means is NonCommercial provided there is
+     no payment of monetary compensation in connection with the
+     exchange.
+  j. Share means to provide material to the public by any means or
+     process that requires permission under the Licensed Rights, such
+     as reproduction, public display, public performance, distribution,
+     dissemination, communication, or importation, and to make material
+     available to the public including in ways that members of the
+     public may access the material from a place and at a time
+     individually chosen by them.
+  k. Sui Generis Database Rights means rights other than copyright
+     resulting from Directive 96/9/EC of the European Parliament and of
+     the Council of 11 March 1996 on the legal protection of databases,
+     as amended and/or succeeded, as well as other essentially
+     equivalent rights anywhere in the world.
+  l. You means the individual or entity exercising the Licensed Rights
+     under this Public License. Your has a corresponding meaning.
+Section 2 -- Scope.
+  a. License grant.
+       1. Subject to the terms and conditions of this Public License,
+          the Licensor hereby grants You a worldwide, royalty-free,
+          non-sublicensable, non-exclusive, irrevocable license to
+          exercise the Licensed Rights in the Licensed Material to:
+            a. reproduce and Share the Licensed Material, in whole or
+               in part, for NonCommercial purposes only; and
+            b. produce, reproduce, and Share Adapted Material for
+               NonCommercial purposes only.
+       2. Exceptions and Limitations. For the avoidance of doubt, where
+          Exceptions and Limitations apply to Your use, this Public
+          License does not apply, and You do not need to comply with
+          its terms and conditions.
+       3. Term. The term of this Public License is specified in Section
+          6(a).
+       4. Media and formats; technical modifications allowed. The
+          Licensor authorizes You to exercise the Licensed Rights in
+          all media and formats whether now known or hereafter created,
+          and to make technical modifications necessary to do so. The
+          Licensor waives and/or agrees not to assert any right or
+          authority to forbid You from making technical modifications
+          necessary to exercise the Licensed Rights, including
+          technical modifications necessary to circumvent Effective
+          Technological Measures. For purposes of this Public License,
+          simply making modifications authorized by this Section 2(a)
+          (4) never produces Adapted Material.
+       5. Downstream recipients.
+            a. Offer from the Licensor -- Licensed Material. Every
+               recipient of the Licensed Material automatically
+               receives an offer from the Licensor to exercise the
+               Licensed Rights under the terms and conditions of this
+               Public License.
+            b. No downstream restrictions. You may not offer or impose
+               any additional or different terms or conditions on, or
+               apply any Effective Technological Measures to, the
+               Licensed Material if doing so restricts exercise of the
+               Licensed Rights by any recipient of the Licensed
+               Material.
+       6. No endorsement. Nothing in this Public License constitutes or
+          may be construed as permission to assert or imply that You
+          are, or that Your use of the Licensed Material is, connected
+          with, or sponsored, endorsed, or granted official status by,
+          the Licensor or others designated to receive attribution as
+          provided in Section 3(a)(1)(A)(i).
+  b. Other rights.
+       1. Moral rights, such as the right of integrity, are not
+          licensed under this Public License, nor are publicity,
+          privacy, and/or other similar personality rights; however, to
+          the extent possible, the Licensor waives and/or agrees not to
+          assert any such rights held by the Licensor to the limited
+          extent necessary to allow You to exercise the Licensed
+          Rights, but not otherwise.
+       2. Patent and trademark rights are not licensed under this
+          Public License.
+       3. To the extent possible, the Licensor waives any right to
+          collect royalties from You for the exercise of the Licensed
+          Rights, whether directly or through a collecting society
+          under any voluntary or waivable statutory or compulsory
+          licensing scheme. In all other cases the Licensor expressly
+          reserves any right to collect such royalties, including when
+          the Licensed Material is used other than for NonCommercial
+          purposes.
+Section 3 -- License Conditions.
+Your exercise of the Licensed Rights is expressly made subject to the
+following conditions.
+  a. Attribution.
+       1. If You Share the Licensed Material (including in modified
+          form), You must:
+            a. retain the following if it is supplied by the Licensor
+               with the Licensed Material:
+                 i. identification of the creator(s) of the Licensed
+                    Material and any others designated to receive
+                    attribution, in any reasonable manner requested by
+                    the Licensor (including by pseudonym if
+                    designated);
+                ii. a copyright notice;
+               iii. a notice that refers to this Public License;
+                iv. a notice that refers to the disclaimer of
+                    warranties;
+                 v. a URI or hyperlink to the Licensed Material to the
+                    extent reasonably practicable;
+            b. indicate if You modified the Licensed Material and
+               retain an indication of any previous modifications; and
+            c. indicate the Licensed Material is licensed under this
+               Public License, and include the text of, or the URI or
+               hyperlink to, this Public License.
+       2. You may satisfy the conditions in Section 3(a)(1) in any
+          reasonable manner based on the medium, means, and context in
+          which You Share the Licensed Material. For example, it may be
+          reasonable to satisfy the conditions by providing a URI or
+          hyperlink to a resource that includes the required
+          information.
+       3. If requested by the Licensor, You must remove any of the
+          information required by Section 3(a)(1)(A) to the extent
+          reasonably practicable.
+       4. If You Share Adapted Material You produce, the Adapter's
+          License You apply must not prevent recipients of the Adapted
+          Material from complying with this Public License.
+Section 4 -- Sui Generis Database Rights.
+Where the Licensed Rights include Sui Generis Database Rights that
+apply to Your use of the Licensed Material:
+  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
+     to extract, reuse, reproduce, and Share all or a substantial
+     portion of the contents of the database for NonCommercial purposes
+     only;
+  b. if You include all or a substantial portion of the database
+     contents in a database in which You have Sui Generis Database
+     Rights, then the database in which You have Sui Generis Database
+     Rights (but not its individual contents) is Adapted Material; and
+  c. You must comply with the conditions in Section 3(a) if You Share
+     all or a substantial portion of the contents of the database.
+For the avoidance of doubt, this Section 4 supplements and does not
+replace Your obligations under this Public License where the Licensed
+Rights include other Copyright and Similar Rights.
+Section 5 -- Disclaimer of Warranties and Limitation of Liability.
+  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
+     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
+     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
+     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
+     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
+     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
+     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
+     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
+     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
+     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
+  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
+     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
+     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
+     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
+     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
+     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
+     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
+     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
+     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
+  c. The disclaimer of warranties and limitation of liability provided
+     above shall be interpreted in a manner that, to the extent
+     possible, most closely approximates an absolute disclaimer and
+     waiver of all liability.
+Section 6 -- Term and Termination.
+  a. This Public License applies for the term of the Copyright and
+     Similar Rights licensed here. However, if You fail to comply with
+     this Public License, then Your rights under this Public License
+     terminate automatically.
+  b. Where Your right to use the Licensed Material has terminated under
+     Section 6(a), it reinstates:
+       1. automatically as of the date the violation is cured, provided
+          it is cured within 30 days of Your discovery of the
+          violation; or
+       2. upon express reinstatement by the Licensor.
+     For the avoidance of doubt, this Section 6(b) does not affect any
+     right the Licensor may have to seek remedies for Your violations
+     of this Public License.
+  c. For the avoidance of doubt, the Licensor may also offer the
+     Licensed Material under separate terms or conditions or stop
+     distributing the Licensed Material at any time; however, doing so
+     will not terminate this Public License.
+  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
+     License.
+Section 7 -- Other Terms and Conditions.
+  a. The Licensor shall not be bound by any additional or different
+     terms or conditions communicated by You unless expressly agreed.
+  b. Any arrangements, understandings, or agreements regarding the
+     Licensed Material not stated herein are separate from and
+     independent of the terms and conditions of this Public License.
+Section 8 -- Interpretation.
+  a. For the avoidance of doubt, this Public License does not, and
+     shall not be interpreted to, reduce, limit, restrict, or impose
+     conditions on any use of the Licensed Material that could lawfully
+     be made without permission under this Public License.
+  b. To the extent possible, if any provision of this Public License is
+     deemed unenforceable, it shall be automatically reformed to the
+     minimum extent necessary to make it enforceable. If the provision
+     cannot be reformed, it shall be severed from this Public License
+     without affecting the enforceability of the remaining terms and
+     conditions.
+  c. No term or condition of this Public License will be waived and no
+     failure to comply consented to unless expressly agreed to by the
+     Licensor.
+  d. Nothing in this Public License constitutes or may be interpreted
+     as a limitation upon, or waiver of, any privileges and immunities
+     that apply to the Licensor or You, including from the legal
+     processes of any jurisdiction or authority.
+=======================================================================
+Creative Commons is not a party to its public
+licenses. Notwithstanding, Creative Commons may elect to apply one of
+its public licenses to material it publishes and in those instances
+will be considered the “Licensor.” The text of the Creative Commons
+public licenses is dedicated to the public domain under the CC0 Public
+Domain Dedication. Except for the limited purpose of indicating that
+material is shared under a Creative Commons public license or as
+otherwise permitted by the Creative Commons policies published at
+creativecommons.org/policies, Creative Commons does not authorize the
+use of the trademark "Creative Commons" or any other trademark or logo
+of Creative Commons without its prior written consent including,
+without limitation, in connection with any unauthorized modifications
+to any of its public licenses or any other arrangements,
+understandings, or agreements concerning use of licensed material. For
+the avoidance of doubt, this paragraph does not form part of the
+public licenses.
+Creative Commons may be contacted at creativecommons.org.

licenses/LICENSE_EDM.txt ADDED Viewed

	@@ -0,0 +1,439 @@

+Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+Attribution-NonCommercial-ShareAlike 4.0 International
+=======================================================================
+Creative Commons Corporation ("Creative Commons") is not a law firm and
+does not provide legal services or legal advice. Distribution of
+Creative Commons public licenses does not create a lawyer-client or
+other relationship. Creative Commons makes its licenses and related
+information available on an "as-is" basis. Creative Commons gives no
+warranties regarding its licenses, any material licensed under their
+terms and conditions, or any related information. Creative Commons
+disclaims all liability for damages resulting from their use to the
+fullest extent possible.
+Using Creative Commons Public Licenses
+Creative Commons public licenses provide a standard set of terms and
+conditions that creators and other rights holders may use to share
+original works of authorship and other material subject to copyright
+and certain other rights specified in the public license below. The
+following considerations are for informational purposes only, are not
+exhaustive, and do not form part of our licenses.
+     Considerations for licensors: Our public licenses are
+     intended for use by those authorized to give the public
+     permission to use material in ways otherwise restricted by
+     copyright and certain other rights. Our licenses are
+     irrevocable. Licensors should read and understand the terms
+     and conditions of the license they choose before applying it.
+     Licensors should also secure all rights necessary before
+     applying our licenses so that the public can reuse the
+     material as expected. Licensors should clearly mark any
+     material not subject to the license. This includes other CC-
+     licensed material, or material used under an exception or
+     limitation to copyright. More considerations for licensors:
+    wiki.creativecommons.org/Considerations_for_licensors
+     Considerations for the public: By using one of our public
+     licenses, a licensor grants the public permission to use the
+     licensed material under specified terms and conditions. If
+     the licensor's permission is not necessary for any reason--for
+     example, because of any applicable exception or limitation to
+     copyright--then that use is not regulated by the license. Our
+     licenses grant only permissions under copyright and certain
+     other rights that a licensor has authority to grant. Use of
+     the licensed material may still be restricted for other
+     reasons, including because others have copyright or other
+     rights in the material. A licensor may make special requests,
+     such as asking that all changes be marked or described.
+     Although not required by our licenses, you are encouraged to
+     respect those requests where reasonable. More considerations
+     for the public:
+    wiki.creativecommons.org/Considerations_for_licensees
+=======================================================================
+Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
+Public License
+By exercising the Licensed Rights (defined below), You accept and agree
+to be bound by the terms and conditions of this Creative Commons
+Attribution-NonCommercial-ShareAlike 4.0 International Public License
+("Public License"). To the extent this Public License may be
+interpreted as a contract, You are granted the Licensed Rights in
+consideration of Your acceptance of these terms and conditions, and the
+Licensor grants You such rights in consideration of benefits the
+Licensor receives from making the Licensed Material available under
+these terms and conditions.
+Section 1 -- Definitions.
+  a. Adapted Material means material subject to Copyright and Similar
+     Rights that is derived from or based upon the Licensed Material
+     and in which the Licensed Material is translated, altered,
+     arranged, transformed, or otherwise modified in a manner requiring
+     permission under the Copyright and Similar Rights held by the
+     Licensor. For purposes of this Public License, where the Licensed
+     Material is a musical work, performance, or sound recording,
+     Adapted Material is always produced where the Licensed Material is
+     synched in timed relation with a moving image.
+  b. Adapter's License means the license You apply to Your Copyright
+     and Similar Rights in Your contributions to Adapted Material in
+     accordance with the terms and conditions of this Public License.
+  c. BY-NC-SA Compatible License means a license listed at
+     creativecommons.org/compatiblelicenses, approved by Creative
+     Commons as essentially the equivalent of this Public License.
+  d. Copyright and Similar Rights means copyright and/or similar rights
+     closely related to copyright including, without limitation,
+     performance, broadcast, sound recording, and Sui Generis Database
+     Rights, without regard to how the rights are labeled or
+     categorized. For purposes of this Public License, the rights
+     specified in Section 2(b)(1)-(2) are not Copyright and Similar
+     Rights.
+  e. Effective Technological Measures means those measures that, in the
+     absence of proper authority, may not be circumvented under laws
+     fulfilling obligations under Article 11 of the WIPO Copyright
+     Treaty adopted on December 20, 1996, and/or similar international
+     agreements.
+  f. Exceptions and Limitations means fair use, fair dealing, and/or
+     any other exception or limitation to Copyright and Similar Rights
+     that applies to Your use of the Licensed Material.
+  g. License Elements means the license attributes listed in the name
+     of a Creative Commons Public License. The License Elements of this
+     Public License are Attribution, NonCommercial, and ShareAlike.
+  h. Licensed Material means the artistic or literary work, database,
+     or other material to which the Licensor applied this Public
+     License.
+  i. Licensed Rights means the rights granted to You subject to the
+     terms and conditions of this Public License, which are limited to
+     all Copyright and Similar Rights that apply to Your use of the
+     Licensed Material and that the Licensor has authority to license.
+  j. Licensor means the individual(s) or entity(ies) granting rights
+     under this Public License.
+  k. NonCommercial means not primarily intended for or directed towards
+     commercial advantage or monetary compensation. For purposes of
+     this Public License, the exchange of the Licensed Material for
+     other material subject to Copyright and Similar Rights by digital
+     file-sharing or similar means is NonCommercial provided there is
+     no payment of monetary compensation in connection with the
+     exchange.
+  l. Share means to provide material to the public by any means or
+     process that requires permission under the Licensed Rights, such
+     as reproduction, public display, public performance, distribution,
+     dissemination, communication, or importation, and to make material
+     available to the public including in ways that members of the
+     public may access the material from a place and at a time
+     individually chosen by them.
+  m. Sui Generis Database Rights means rights other than copyright
+     resulting from Directive 96/9/EC of the European Parliament and of
+     the Council of 11 March 1996 on the legal protection of databases,
+     as amended and/or succeeded, as well as other essentially
+     equivalent rights anywhere in the world.
+  n. You means the individual or entity exercising the Licensed Rights
+     under this Public License. Your has a corresponding meaning.
+Section 2 -- Scope.
+  a. License grant.
+       1. Subject to the terms and conditions of this Public License,
+          the Licensor hereby grants You a worldwide, royalty-free,
+          non-sublicensable, non-exclusive, irrevocable license to
+          exercise the Licensed Rights in the Licensed Material to:
+            a. reproduce and Share the Licensed Material, in whole or
+               in part, for NonCommercial purposes only; and
+            b. produce, reproduce, and Share Adapted Material for
+               NonCommercial purposes only.
+       2. Exceptions and Limitations. For the avoidance of doubt, where
+          Exceptions and Limitations apply to Your use, this Public
+          License does not apply, and You do not need to comply with
+          its terms and conditions.
+       3. Term. The term of this Public License is specified in Section
+          6(a).
+       4. Media and formats; technical modifications allowed. The
+          Licensor authorizes You to exercise the Licensed Rights in
+          all media and formats whether now known or hereafter created,
+          and to make technical modifications necessary to do so. The
+          Licensor waives and/or agrees not to assert any right or
+          authority to forbid You from making technical modifications
+          necessary to exercise the Licensed Rights, including
+          technical modifications necessary to circumvent Effective
+          Technological Measures. For purposes of this Public License,
+          simply making modifications authorized by this Section 2(a)
+          (4) never produces Adapted Material.
+       5. Downstream recipients.
+            a. Offer from the Licensor -- Licensed Material. Every
+               recipient of the Licensed Material automatically
+               receives an offer from the Licensor to exercise the
+               Licensed Rights under the terms and conditions of this
+               Public License.
+            b. Additional offer from the Licensor -- Adapted Material.
+               Every recipient of Adapted Material from You
+               automatically receives an offer from the Licensor to
+               exercise the Licensed Rights in the Adapted Material
+               under the conditions of the Adapter's License You apply.
+            c. No downstream restrictions. You may not offer or impose
+               any additional or different terms or conditions on, or
+               apply any Effective Technological Measures to, the
+               Licensed Material if doing so restricts exercise of the
+               Licensed Rights by any recipient of the Licensed
+               Material.
+       6. No endorsement. Nothing in this Public License constitutes or
+          may be construed as permission to assert or imply that You
+          are, or that Your use of the Licensed Material is, connected
+          with, or sponsored, endorsed, or granted official status by,
+          the Licensor or others designated to receive attribution as
+          provided in Section 3(a)(1)(A)(i).
+  b. Other rights.
+       1. Moral rights, such as the right of integrity, are not
+          licensed under this Public License, nor are publicity,
+          privacy, and/or other similar personality rights; however, to
+          the extent possible, the Licensor waives and/or agrees not to
+          assert any such rights held by the Licensor to the limited
+          extent necessary to allow You to exercise the Licensed
+          Rights, but not otherwise.
+       2. Patent and trademark rights are not licensed under this
+          Public License.
+       3. To the extent possible, the Licensor waives any right to
+          collect royalties from You for the exercise of the Licensed
+          Rights, whether directly or through a collecting society
+          under any voluntary or waivable statutory or compulsory
+          licensing scheme. In all other cases the Licensor expressly
+          reserves any right to collect such royalties, including when
+          the Licensed Material is used other than for NonCommercial
+          purposes.
+Section 3 -- License Conditions.
+Your exercise of the Licensed Rights is expressly made subject to the
+following conditions.
+  a. Attribution.
+       1. If You Share the Licensed Material (including in modified
+          form), You must:
+            a. retain the following if it is supplied by the Licensor
+               with the Licensed Material:
+                 i. identification of the creator(s) of the Licensed
+                    Material and any others designated to receive
+                    attribution, in any reasonable manner requested by
+                    the Licensor (including by pseudonym if
+                    designated);
+                ii. a copyright notice;
+               iii. a notice that refers to this Public License;
+                iv. a notice that refers to the disclaimer of
+                    warranties;
+                 v. a URI or hyperlink to the Licensed Material to the
+                    extent reasonably practicable;
+            b. indicate if You modified the Licensed Material and
+               retain an indication of any previous modifications; and
+            c. indicate the Licensed Material is licensed under this
+               Public License, and include the text of, or the URI or
+               hyperlink to, this Public License.
+       2. You may satisfy the conditions in Section 3(a)(1) in any
+          reasonable manner based on the medium, means, and context in
+          which You Share the Licensed Material. For example, it may be
+          reasonable to satisfy the conditions by providing a URI or
+          hyperlink to a resource that includes the required
+          information.
+       3. If requested by the Licensor, You must remove any of the
+          information required by Section 3(a)(1)(A) to the extent
+          reasonably practicable.
+  b. ShareAlike.
+     In addition to the conditions in Section 3(a), if You Share
+     Adapted Material You produce, the following conditions also apply.
+       1. The Adapter's License You apply must be a Creative Commons
+          license with the same License Elements, this version or
+          later, or a BY-NC-SA Compatible License.
+       2. You must include the text of, or the URI or hyperlink to, the
+          Adapter's License You apply. You may satisfy this condition
+          in any reasonable manner based on the medium, means, and
+          context in which You Share Adapted Material.
+       3. You may not offer or impose any additional or different terms
+          or conditions on, or apply any Effective Technological
+          Measures to, Adapted Material that restrict exercise of the
+          rights granted under the Adapter's License You apply.
+Section 4 -- Sui Generis Database Rights.
+Where the Licensed Rights include Sui Generis Database Rights that
+apply to Your use of the Licensed Material:
+  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
+     to extract, reuse, reproduce, and Share all or a substantial
+     portion of the contents of the database for NonCommercial purposes
+     only;
+  b. if You include all or a substantial portion of the database
+     contents in a database in which You have Sui Generis Database
+     Rights, then the database in which You have Sui Generis Database
+     Rights (but not its individual contents) is Adapted Material,
+     including for purposes of Section 3(b); and
+  c. You must comply with the conditions in Section 3(a) if You Share
+     all or a substantial portion of the contents of the database.
+For the avoidance of doubt, this Section 4 supplements and does not
+replace Your obligations under this Public License where the Licensed
+Rights include other Copyright and Similar Rights.
+Section 5 -- Disclaimer of Warranties and Limitation of Liability.
+  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
+     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
+     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
+     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
+     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
+     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
+     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
+     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
+     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
+     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.
+  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
+     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
+     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
+     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
+     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
+     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
+     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
+     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
+     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.
+  c. The disclaimer of warranties and limitation of liability provided
+     above shall be interpreted in a manner that, to the extent
+     possible, most closely approximates an absolute disclaimer and
+     waiver of all liability.
+Section 6 -- Term and Termination.
+  a. This Public License applies for the term of the Copyright and
+     Similar Rights licensed here. However, if You fail to comply with
+     this Public License, then Your rights under this Public License
+     terminate automatically.
+  b. Where Your right to use the Licensed Material has terminated under
+     Section 6(a), it reinstates:
+       1. automatically as of the date the violation is cured, provided
+          it is cured within 30 days of Your discovery of the
+          violation; or
+       2. upon express reinstatement by the Licensor.
+     For the avoidance of doubt, this Section 6(b) does not affect any
+     right the Licensor may have to seek remedies for Your violations
+     of this Public License.
+  c. For the avoidance of doubt, the Licensor may also offer the
+     Licensed Material under separate terms or conditions or stop
+     distributing the Licensed Material at any time; however, doing so
+     will not terminate this Public License.
+  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
+     License.
+Section 7 -- Other Terms and Conditions.
+  a. The Licensor shall not be bound by any additional or different
+     terms or conditions communicated by You unless expressly agreed.
+  b. Any arrangements, understandings, or agreements regarding the
+     Licensed Material not stated herein are separate from and
+     independent of the terms and conditions of this Public License.
+Section 8 -- Interpretation.
+  a. For the avoidance of doubt, this Public License does not, and
+     shall not be interpreted to, reduce, limit, restrict, or impose
+     conditions on any use of the Licensed Material that could lawfully
+     be made without permission under this Public License.
+  b. To the extent possible, if any provision of this Public License is
+     deemed unenforceable, it shall be automatically reformed to the
+     minimum extent necessary to make it enforceable. If the provision
+     cannot be reformed, it shall be severed from this Public License
+     without affecting the enforceability of the remaining terms and
+     conditions.
+  c. No term or condition of this Public License will be waived and no
+     failure to comply consented to unless expressly agreed to by the
+     Licensor.
+  d. Nothing in this Public License constitutes or may be interpreted
+     as a limitation upon, or waiver of, any privileges and immunities
+     that apply to the Licensor or You, including from the legal
+     processes of any jurisdiction or authority.
+=======================================================================
+Creative Commons is not a party to its public
+licenses. Notwithstanding, Creative Commons may elect to apply one of
+its public licenses to material it publishes and in those instances
+will be considered the "Licensor." The text of the Creative Commons
+public licenses is dedicated to the public domain under the CC0 Public
+Domain Dedication. Except for the limited purpose of indicating that
+material is shared under a Creative Commons public license or as
+otherwise permitted by the Creative Commons policies published at
+creativecommons.org/policies, Creative Commons does not authorize the
+use of the trademark "Creative Commons" or any other trademark or logo
+of Creative Commons without its prior written consent including,
+without limitation, in connection with any unauthorized modifications
+to any of its public licenses or any other arrangements,
+understandings, or agreements concerning use of licensed material. For
+the avoidance of doubt, this paragraph does not form part of the
+public licenses.
+Creative Commons may be contacted at creativecommons.org.

licenses/LICENSE_UVIT.txt ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2022 Fan Bao
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

lmdb2wds.py ADDED Viewed

	@@ -0,0 +1,39 @@

+'''
+This file implements the direction conversion from the latent ImageNet dataset to WebDataset.
+'''
+import os
+from argparse import ArgumentParser
+from tqdm import tqdm
+import numpy as np
+import pickle
+import webdataset as wds
+from train_utils.datasets import ImageNetLatentDataset
+def convert2wds(args):
+    os.makedirs(args.outdir, exist_ok=True)
+    wds_path = os.path.join(args.outdir, f'latent_imagenet_512_{args.split}-%04d.tar')
+    dataset = ImageNetLatentDataset(args.datadir, resolution=args.resolution, num_channels=args.num_channels, split=args.split)
+    with wds.ShardWriter(wds_path, maxcount=args.maxcount, maxsize=args.maxsize) as sink:
+        for i in tqdm(range(len(dataset)), dynamic_ncols=True):
+            if i % args.maxcount == 0:
+                print(f'writing to the {i // args.maxcount}th shard')
+            img, label = dataset[i]          # C, H, W
+            label = np.argmax(label)         # int
+            sink.write({'__key__': f'{i:07d}', 'latent': pickle.dumps(img), 'cls': label})
+if __name__ == "__main__":
+    parser = ArgumentParser('Convert the latent imagenet dataset to WebDataset')
+    parser.add_argument('--maxcount', type=int, default=10010, help='max number of entries per shard')
+    parser.add_argument('--maxsize', type=int, default=10 ** 10, help='max size per shard')
+    parser.add_argument('--outdir', type=str, default='latent_imagenet_wds', help='path to save the converted dataset')
+    parser.add_argument('--datadir', type=str, default='latent_imagenet', help='path to the latent imagenet dataset')
+    parser.add_argument('--resolution', type=int, default=64, help='image resolution')
+    parser.add_argument('--num_channels', type=int, default=8, help='number of image channels')
+    parser.add_argument('--split', type=str, default='train', help='split of the dataset')
+    args = parser.parse_args()
+    convert2wds(args)

models/maskdit.py ADDED Viewed

	@@ -0,0 +1,781 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+# This code is adapted from https://github.com/facebookresearch/mae/blob/main/models_mae.py
+# and https://github.com/facebookresearch/DiT/blob/main/models.py.
+# The original code is licensed under a Attribution-NonCommercial 4.0 InternationalCreative Commons License,
+# which is can be found at licenses/LICENSE_MAE.txt and licenses/LICENSE_DIT.txt.
+import torch
+import torch.nn as nn
+import numpy as np
+import math
+from functools import partial
+from timm.models.vision_transformer import PatchEmbed, Attention, Mlp
+def modulate(x, shift, scale):
+    return x * (1 + scale.unsqueeze(1)) + shift.unsqueeze(1)
+#################################################################################
+#               Embedding Layers for Timesteps and Class Labels                 #
+#################################################################################
+class TimestepEmbedder(nn.Module):
+    """
+    Embeds scalar timesteps into vector representations.
+    """
+    def __init__(self, hidden_size, frequency_embedding_size=256):
+        super().__init__()
+        self.mlp = nn.Sequential(
+            nn.Linear(frequency_embedding_size, hidden_size, bias=True),
+            nn.SiLU(),
+            nn.Linear(hidden_size, hidden_size, bias=True),
+        )
+        self.frequency_embedding_size = frequency_embedding_size
+    @staticmethod
+    def timestep_embedding(t, dim, max_period=10000):
+        """
+        Create sinusoidal timestep embeddings.
+        :param t: a 1-D Tensor of N indices, one per batch element.
+                          These may be fractional.
+        :param dim: the dimension of the output.
+        :param max_period: controls the minimum frequency of the embeddings.
+        :return: an (N, D) Tensor of positional embeddings.
+        """
+        # https://github.com/openai/glide-text2im/blob/main/glide_text2im/nn.py
+        half = dim // 2
+        freqs = torch.exp(
+            -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half
+        ).to(device=t.device)
+        args = t[:, None].float() * freqs[None]
+        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
+        if dim % 2:
+            embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
+        return embedding
+    def forward(self, t):
+        t_freq = self.timestep_embedding(t, self.frequency_embedding_size)
+        t_emb = self.mlp(t_freq)
+        return t_emb
+class LabelEmbedder(nn.Module):
+    """
+    Embeds class labels into vector representations. Also handles label dropout for classifier-free guidance.
+    """
+    def __init__(self, num_classes, hidden_size, dropout_prob):
+        super().__init__()
+        self.embedding_table = nn.Linear(num_classes, hidden_size, bias=False)
+        self.num_classes = num_classes
+        self.dropout_prob = dropout_prob
+    def forward(self, y):
+        embeddings = self.embedding_table(y)
+        return embeddings
+#################################################################################
+#                          Token Masking and Unmasking                          #
+#################################################################################
+def get_mask(batch, length, mask_ratio, device):
+    """
+    Get the binary mask for the input sequence.
+    Args:
+        - batch: batch size
+        - length: sequence length
+        - mask_ratio: ratio of tokens to mask
+    return:
+        mask_dict with following keys:
+        - mask: binary mask, 0 is keep, 1 is remove
+        - ids_keep: indices of tokens to keep
+        - ids_restore: indices to restore the original order
+    """
+    len_keep = int(length * (1 - mask_ratio))
+    noise = torch.rand(batch, length, device=device)  # noise in [0, 1]
+    ids_shuffle = torch.argsort(noise, dim=1)  # ascend: small is keep, large is remove
+    ids_restore = torch.argsort(ids_shuffle, dim=1)
+    # keep the first subset
+    ids_keep = ids_shuffle[:, :len_keep]
+    mask = torch.ones([batch, length], device=device)
+    mask[:, :len_keep] = 0
+    mask = torch.gather(mask, dim=1, index=ids_restore)
+    return {'mask': mask,
+            'ids_keep': ids_keep,
+            'ids_restore': ids_restore}
+def mask_out_token(x, ids_keep):
+    """
+    Mask out the tokens specified by ids_keep.
+    Args:
+        - x: input sequence, [N, L, D]
+        - ids_keep: indices of tokens to keep
+    return:
+        - x_masked: masked sequence
+    """
+    N, L, D = x.shape  # batch, length, dim
+    x_masked = torch.gather(x, dim=1, index=ids_keep.unsqueeze(-1).repeat(1, 1, D))
+    return x_masked
+def mask_tokens(x, mask_ratio):
+    """
+    Perform per-sample random masking by per-sample shuffling.
+    Per-sample shuffling is done by argsort random noise.
+    x: [N, L, D], sequence
+    """
+    N, L, D = x.shape  # batch, length, dim
+    len_keep = int(L * (1 - mask_ratio))
+    noise = torch.rand(N, L, device=x.device)  # noise in [0, 1]
+    # sort noise for each sample
+    ids_shuffle = torch.argsort(noise, dim=1)  # ascend: small is keep, large is remove
+    ids_restore = torch.argsort(ids_shuffle, dim=1)
+    # keep the first subset
+    ids_keep = ids_shuffle[:, :len_keep]
+    x_masked = torch.gather(x, dim=1, index=ids_keep.unsqueeze(-1).repeat(1, 1, D))
+    # generate the binary mask: 0 is keep, 1 is remove
+    mask = torch.ones([N, L], device=x.device)
+    mask[:, :len_keep] = 0
+    mask = torch.gather(mask, dim=1, index=ids_restore)
+    return x_masked, mask, ids_restore
+def unmask_tokens(x, ids_restore, mask_token, extras=0):
+    # x: [N, T, D] if extras == 0 (i.e., no cls token) else x: [N, T+1, D]
+    mask_tokens = mask_token.repeat(x.shape[0], ids_restore.shape[1] + extras - x.shape[1], 1)
+    x_ = torch.cat([x[:, extras:, :], mask_tokens], dim=1)  # no cls token
+    x_ = torch.gather(x_, dim=1, index=ids_restore.unsqueeze(-1).repeat(1, 1, x.shape[2]))  # unshuffle
+    x = torch.cat([x[:, :extras, :], x_], dim=1)  # append cls token
+    return x
+#################################################################################
+#                                 Core DiT Model                                #
+#################################################################################
+class DiTBlock(nn.Module):
+    """
+    A DiT block with adaptive layer norm zero (adaLN-Zero) conditioning.
+    """
+    def __init__(self, hidden_size, c_emb_dize, num_heads, mlp_ratio=4.0, **block_kwargs):
+        super().__init__()
+        self.norm1 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.attn = Attention(hidden_size, num_heads=num_heads, qkv_bias=True, **block_kwargs)
+        self.norm2 = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        mlp_hidden_dim = int(hidden_size * mlp_ratio)
+        approx_gelu = lambda: nn.GELU(approximate="tanh")
+        self.mlp = Mlp(in_features=hidden_size, hidden_features=mlp_hidden_dim, act_layer=approx_gelu, drop=0)
+        self.adaLN_modulation = nn.Sequential(
+            nn.SiLU(),
+            nn.Linear(c_emb_dize, 6 * hidden_size, bias=True)
+        )
+    def forward(self, x, c):
+        shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.adaLN_modulation(c).chunk(6, dim=1)
+        x = x + gate_msa.unsqueeze(1) * self.attn(modulate(self.norm1(x), shift_msa, scale_msa))
+        x = x + gate_mlp.unsqueeze(1) * self.mlp(modulate(self.norm2(x), shift_mlp, scale_mlp))
+        return x
+class DecoderLayer(nn.Module):
+    """
+    The final layer of DiT.
+    """
+    def __init__(self, hidden_size, decoder_hidden_size):
+        super().__init__()
+        self.norm_decoder = nn.LayerNorm(hidden_size, elementwise_affine=False, eps=1e-6)
+        self.linear = nn.Linear(hidden_size, decoder_hidden_size, bias=True)
+        self.adaLN_modulation = nn.Sequential(
+            nn.SiLU(),
+            nn.Linear(hidden_size, 2 * hidden_size, bias=True)
+        )
+    def forward(self, x, c):
+        shift, scale = self.adaLN_modulation(c).chunk(2, dim=1)
+        x = modulate(self.norm_decoder(x), shift, scale)
+        x = self.linear(x)
+        return x
+class FinalLayer(nn.Module):
+    """
+    The final layer of DiT.
+    """
+    def __init__(self, final_hidden_size, c_emb_size, patch_size, out_channels):
+        super().__init__()
+        self.norm_final = nn.LayerNorm(final_hidden_size, elementwise_affine=False, eps=1e-6)
+        self.linear = nn.Linear(final_hidden_size, patch_size * patch_size * out_channels, bias=True)
+        self.adaLN_modulation = nn.Sequential(
+            nn.SiLU(),
+            nn.Linear(c_emb_size, 2 * final_hidden_size, bias=True)
+        )
+    def forward(self, x, c):
+        shift, scale = self.adaLN_modulation(c).chunk(2, dim=1)
+        x = modulate(self.norm_final(x), shift, scale)
+        x = self.linear(x)
+        return x
+class DiT(nn.Module):
+    """
+    Diffusion model with a Transformer backbone.
+    """
+    def __init__(
+            self,
+            input_size=32,
+            patch_size=2,
+            in_channels=4,
+            hidden_size=1152,
+            depth=28,
+            num_heads=16,
+            mlp_ratio=4.0,
+            class_dropout_prob=0.1,
+            num_classes=1000,  # 0 = unconditional
+            learn_sigma=False,
+            use_decoder=False,  # decide if add a lightweight DiT decoder
+            mae_loss_coef=0,  # 0 = no mae loss
+            pad_cls_token=False,  # decide if use cls_token as mask token for decoder
+            direct_cls_token=False,  # decide if directly pass cls_toekn to decoder (0 = not pass to decoder)
+            ext_feature_dim=0,  # decide if condition on external features (0 = no feature)
+            use_encoder_feat=False,  # decide if condition on encoder output feature
+            norm_layer=partial(nn.LayerNorm, eps=1e-6),  # normalize the encoder output feature
+    ):
+        super().__init__()
+        self.learn_sigma = learn_sigma
+        self.in_channels = in_channels
+        self.out_channels = in_channels * 2 if learn_sigma else in_channels
+        self.patch_size = patch_size
+        self.num_heads = num_heads
+        self.class_dropout_prob = class_dropout_prob
+        self.num_classes = num_classes
+        self.use_decoder = use_decoder
+        self.mae_loss_coef = mae_loss_coef
+        self.pad_cls_token = pad_cls_token
+        self.direct_cls_token = direct_cls_token
+        self.ext_feature_dim = ext_feature_dim
+        self.use_encoder_feat = use_encoder_feat
+        self.feat_norm = norm_layer(hidden_size, elementwise_affine=False)
+        self.x_embedder = PatchEmbed(input_size, patch_size, in_channels, hidden_size)
+        self.t_embedder = TimestepEmbedder(hidden_size)
+        self.y_embedder = LabelEmbedder(num_classes, hidden_size, class_dropout_prob) if num_classes else None
+        num_patches = self.x_embedder.num_patches
+        self.cls_token = None
+        self.extras = 0
+        self.decoder_extras = 0
+        if self.pad_cls_token:
+            self.cls_token = nn.Parameter(torch.zeros(1, 1, hidden_size))
+            self.extras = 1
+            self.decoder_extras = 1
+        self.feat_embedder = None
+        if self.ext_feature_dim > 0:
+            self.feat_embedder = nn.Linear(self.ext_feature_dim, hidden_size, bias=True)
+        # Will use fixed sin-cos embedding:
+        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + self.extras, hidden_size), requires_grad=False)
+        self.blocks = nn.ModuleList([
+            DiTBlock(hidden_size, hidden_size, num_heads, mlp_ratio=mlp_ratio) for _ in range(depth)
+        ])
+        self.decoder_pos_embed = None
+        self.decoder_layer = None
+        self.decoder_blocks = None
+        self.mask_token = None
+        self.cls_token_embedder = None
+        self.enc_feat_embedder = None
+        final_hidden_size = hidden_size
+        if self.use_decoder:
+            decoder_hidden_size = 512
+            decoder_depth = 8
+            decoder_num_heads = 16
+            if not self.direct_cls_token:
+                self.decoder_extras = 0
+            self.decoder_pos_embed = nn.Parameter(
+                torch.zeros(1, num_patches + self.decoder_extras, decoder_hidden_size),
+                requires_grad=False)
+            self.decoder_layer = DecoderLayer(hidden_size, decoder_hidden_size)
+            self.decoder_blocks = nn.ModuleList([
+                DiTBlock(decoder_hidden_size, hidden_size, decoder_num_heads, mlp_ratio=mlp_ratio) for _ in
+                range(decoder_depth)
+            ])
+            if self.mae_loss_coef > 0:
+                self.mask_token = nn.Parameter(torch.zeros(1, 1, decoder_hidden_size))  # Similar to MAE
+            if self.pad_cls_token:
+                self.cls_token_embedder = nn.Linear(hidden_size, hidden_size, bias=True)
+            if self.use_encoder_feat:
+                self.enc_feat_embedder = nn.Linear(hidden_size, hidden_size, bias=True)
+            final_hidden_size = decoder_hidden_size
+        self.final_layer = FinalLayer(final_hidden_size, hidden_size, patch_size, self.out_channels)
+        self.initialize_weights()
+    def initialize_weights(self):
+        # Initialize transformer layers:
+        def _basic_init(module):
+            if isinstance(module, nn.Linear):
+                nn.init.xavier_uniform_(module.weight)
+                if module.bias is not None:
+                    nn.init.constant_(module.bias, 0)
+        self.apply(_basic_init)
+        # Initialize (and freeze) pos_embed by sin-cos embedding:
+        pos_embed = get_2d_sincos_pos_embed(self.pos_embed.shape[-1], int(self.x_embedder.num_patches ** 0.5),
+                                            cls_token=self.pad_cls_token, extra_tokens=self.extras)
+        self.pos_embed.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0))
+        # Initialize patch_embed like nn.Linear (instead of nn.Conv2d):
+        w = self.x_embedder.proj.weight.data
+        nn.init.xavier_uniform_(w.view([w.shape[0], -1]))
+        nn.init.constant_(self.x_embedder.proj.bias, 0)
+        # Initialize label embedding table:
+        if self.y_embedder is not None:
+            nn.init.normal_(self.y_embedder.embedding_table.weight, std=0.02)
+        # Initialize timestep embedding MLP:
+        nn.init.normal_(self.t_embedder.mlp[0].weight, std=0.02)
+        nn.init.normal_(self.t_embedder.mlp[2].weight, std=0.02)
+        # Initialize cls_token embedding:
+        if self.feat_embedder is not None:
+            nn.init.normal_(self.feat_embedder.weight, std=0.02)
+        # Initialize cls token
+        if self.cls_token is not None:
+            nn.init.normal_(self.cls_token, std=.02)
+        # Initialize cls_token embedding:
+        if self.cls_token_embedder is not None:
+            nn.init.normal_(self.cls_token_embedder.weight, std=0.02)
+        # Zero-out adaLN modulation layers in DiT blocks:
+        for block in self.blocks:
+            nn.init.constant_(block.adaLN_modulation[-1].weight, 0)
+            nn.init.constant_(block.adaLN_modulation[-1].bias, 0)
+        # Zero-out output layers:
+        nn.init.constant_(self.final_layer.adaLN_modulation[-1].weight, 0)
+        nn.init.constant_(self.final_layer.adaLN_modulation[-1].bias, 0)
+        nn.init.constant_(self.final_layer.linear.weight, 0)
+        nn.init.constant_(self.final_layer.linear.bias, 0)
+        # --------------------------- decoder initialization ---------------------------
+        # Initialize (and freeze) decoder_pos_embed by sin-cos embedding:
+        if self.decoder_pos_embed is not None:
+            pos_embed = get_2d_sincos_pos_embed(self.decoder_pos_embed.shape[-1],
+                                                int(self.x_embedder.num_patches ** 0.5),
+                                                cls_token=self.pad_cls_token, extra_tokens=self.decoder_extras)
+            self.decoder_pos_embed.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0))
+        # Initialize mask token
+        if self.mae_loss_coef > 0 and self.mask_token is not None:
+            nn.init.normal_(self.mask_token, std=.02)
+        # Zero-out adaLN modulation layers in DiT decoder blocks:
+        if self.decoder_blocks is not None:
+            for block in self.decoder_blocks:
+                nn.init.constant_(block.adaLN_modulation[-1].weight, 0)
+                nn.init.constant_(block.adaLN_modulation[-1].bias, 0)
+        # Zero-out decoder layers: (TODO: here we keep it the same with final layers but not sure if it makes sense)
+        if self.decoder_layer is not None:
+            nn.init.constant_(self.decoder_layer.adaLN_modulation[-1].weight, 0)
+            nn.init.constant_(self.decoder_layer.adaLN_modulation[-1].bias, 0)
+            nn.init.constant_(self.decoder_layer.linear.weight, 0)
+            nn.init.constant_(self.decoder_layer.linear.bias, 0)
+        # ------------------------------------------------------------------------------
+    def unpatchify(self, x):
+        """
+        x: (N, L, patch_size**2 * C)
+        imgs: (N, H, W, C)
+        """
+        c = self.out_channels
+        p = self.x_embedder.patch_size[0]
+        h = w = int(x.shape[1] ** 0.5)
+        assert h * w == x.shape[1]
+        x = x.reshape(shape=(x.shape[0], h, w, p, p, c))
+        x = torch.einsum('nhwpqc->nchpwq', x)
+        imgs = x.reshape(shape=(x.shape[0], c, h * p, w * p))
+        return imgs
+    def encode(self, x, t, y, mask_ratio=0, mask_dict=None, feat=None):
+        '''
+        Encode x and (t, y, feat) into a latent representation.
+        Return:
+            x_feat: feature
+            mask_dict with keys: 'ids_keep', 'ids_mask', 'mask_ratio'
+        '''
+        x = self.x_embedder(x) + self.pos_embed[:, self.extras:, :]  # (N, T, D), where T = H * W / patch_size ** 2
+        if mask_ratio > 0 and mask_dict is None:
+            mask_dict = get_mask(x.shape[0], x.shape[1], mask_ratio, device=x.device)
+        if mask_ratio > 0:
+            ids_keep = mask_dict['ids_keep']
+            x = mask_out_token(x, ids_keep)
+        # append cls token
+        if self.cls_token is not None:
+            cls_token = self.cls_token + self.pos_embed[:, :self.extras, :]
+            cls_tokens = cls_token.expand(x.shape[0], -1, -1)
+            x = torch.cat((cls_tokens, x), dim=1)
+        t = self.t_embedder(t)  # (N, D)
+        c = t
+        if self.y_embedder is not None:
+            y = self.y_embedder(y)  # (N, D)
+            c = c + y  # (N, D)
+        assert (self.feat_embedder is None) or (self.enc_feat_embedder is None)
+        if self.feat_embedder is not None:
+            assert feat.shape[-1] == self.ext_feature_dim
+            feat_embed = self.feat_embedder(feat)  # (N, D)
+            c = c + feat_embed  # (N, D)
+        if self.enc_feat_embedder is not None and feat is not None:
+            assert feat.shape[-1] == c.shape[-1]
+            feat_embed = self.enc_feat_embedder(feat)  # (N, D)
+            c = c + feat_embed  # (N, D)
+        for block in self.blocks:
+            x = block(x, c)  # (N, T, D)
+        x_feat = x[:, self.extras:, :].mean(dim=1)  # global pool without cls token
+        x_feat = self.feat_norm(x_feat)
+        return x_feat, mask_dict
+    def forward_encoder(self, x, t, y, mask_ratio=0, mask_dict=None, feat=None, train=True):
+        '''
+        Encode x and (t, y, feat) into a latent representation.
+        Return:
+            - out_enc: dict, containing the following keys: x, x_feat
+            - c: the conditional embedding
+        '''
+        out_enc = dict()
+        x = self.x_embedder(x) + self.pos_embed[:, self.extras:, :]  # (N, T, D), where T = H * W / patch_size ** 2
+        if mask_ratio > 0 and mask_dict is None:
+            mask_dict = get_mask(x.shape[0], x.shape[1], mask_ratio=mask_ratio, device=x.device)
+        if mask_ratio > 0:
+            ids_keep = mask_dict['ids_keep']
+            ids_restore = mask_dict['ids_restore']
+            if train:
+                x = mask_out_token(x, ids_keep)
+        # append cls token
+        if self.cls_token is not None:
+            cls_token = self.cls_token + self.pos_embed[:, :self.extras, :]
+            cls_tokens = cls_token.expand(x.shape[0], -1, -1)
+            x = torch.cat((cls_tokens, x), dim=1)
+        t = self.t_embedder(t)  # (N, D)
+        c = t
+        if self.y_embedder is not None:
+            y = self.y_embedder(y)  # (N, D)
+            c = c + y  # (N, D)
+        assert (self.feat_embedder is None) or (self.enc_feat_embedder is None)
+        if self.feat_embedder is not None:
+            assert feat.shape[-1] == self.ext_feature_dim
+            feat_embed = self.feat_embedder(feat)  # (N, D)
+            c = c + feat_embed  # (N, D)
+        if self.enc_feat_embedder is not None and feat is not None:
+            assert feat.shape[-1] == c.shape[-1]
+            feat_embed = self.enc_feat_embedder(feat)  # (N, D)
+            c = c + feat_embed  # (N, D)
+        for block in self.blocks:
+            x = block(x, c)  # (N, T, D)
+        out_enc['x'] = x
+        return out_enc, c, mask_dict
+    def forward(self, x, t, y, mask_ratio=0, mask_dict=None, feat=None):
+        """
+        Forward pass of DiT.
+        x: (N, C, H, W) tensor of spatial inputs (images or latent representations of images)
+        t: (N,) tensor of diffusion timesteps
+        y: (N,) tensor of class labels
+        """
+        if not self.training and self.use_encoder_feat:
+            feat, _ = self.encode(x, t, y, feat=feat)
+        out, c, mask_dict = self.forward_encoder(x, t, y, mask_ratio=mask_ratio, mask_dict=mask_dict, feat=feat, train=self.training)
+        if mask_ratio > 0:
+            ids_keep = mask_dict['ids_keep']
+            ids_restore = mask_dict['ids_restore']
+            out['mask'] = mask_dict['mask']
+        else:
+            ids_keep = ids_restore = None
+        x = out['x']
+        # Pass to a DiT decoder (if available)
+        if self.use_decoder:
+            if self.cls_token_embedder is not None:
+                # cls_token_output = x[:, :self.extras, :].squeeze(dim=1).detach().clone()  # stop gradient
+                cls_token_output = x[:, :self.extras, :].squeeze(dim=1)
+                cls_token_embed = self.cls_token_embedder(self.feat_norm(cls_token_output))  # normalize cls token
+                c = c + cls_token_embed  # pad cls_token output's embedding as feature conditioning
+            assert self.decoder_layer is not None
+            diff_extras = self.extras - self.decoder_extras
+            x = self.decoder_layer(x[:, diff_extras:, :], c)  # remove cls token (if necessary)
+            if self.training and mask_ratio > 0:
+                mask_token = self.mask_token
+                if mask_token is None:
+                    mask_token = torch.zeros(1, 1, x.shape[2]).to(x)  # concat zeros to match shape
+                x = unmask_tokens(x, ids_restore, mask_token, extras=self.decoder_extras)
+            assert self.decoder_pos_embed is not None
+            x = x + self.decoder_pos_embed
+            assert self.decoder_blocks is not None
+            for block in self.decoder_blocks:
+                x = block(x, c)  # (N, T, D)
+        x = self.final_layer(x, c)  # (N, T or T+1, patch_size ** 2 * out_channels)
+        if not self.use_decoder and (self.training and mask_ratio > 0):
+            mask_token = torch.zeros(1, 1, x.shape[2]).to(x)  # concat zeros to match shape
+            x = unmask_tokens(x, ids_restore, mask_token, extras=self.extras)
+        x = x[:, self.decoder_extras:, :]  # remove cls token (if necessary)
+        x = self.unpatchify(x)  # (N, out_channels, H, W)
+        out['x'] = x
+        return out
+    def forward_with_cfg(self, x, t, y, cfg_scale, feat=None, **model_kwargs):
+        """
+        Forward pass of DiT, but also batches the unconditional forward pass for classifier-free guidance.
+        """
+        # https://github.com/openai/glide-text2im/blob/main/notebooks/text2im.ipynb
+        out = dict()
+        # Setup classifier-free guidance
+        x = torch.cat([x, x], 0)
+        y_null = torch.zeros_like(y)
+        y = torch.cat([y, y_null], 0)
+        if feat is not None:
+            feat = torch.cat([feat, feat], 0)
+        half = x[: len(x) // 2]
+        combined = torch.cat([half, half], dim=0)
+        assert self.num_classes and y is not None
+        model_out = self.forward(combined, t, y, feat=feat)['x']
+        # For exact reproducibility reasons, we apply classifier-free guidance on only
+        # three channels by default. The standard approach to cfg applies it to all channels.
+        # This can be done by uncommenting the following line and commenting-out the line following that.
+        eps, rest = model_out[:, :self.in_channels], model_out[:, self.in_channels:]
+        # eps, rest = model_out[:, :3], model_out[:, 3:]
+        cond_eps, uncond_eps = torch.split(eps, len(eps) // 2, dim=0)
+        half_eps = uncond_eps + cfg_scale * (cond_eps - uncond_eps)
+        half_rest = rest[: len(rest) // 2]
+        x = torch.cat([half_eps, half_rest], dim=1)
+        out['x'] = x
+        return out
+#################################################################################
+#                   Sine/Cosine Positional Embedding Functions                  #
+#################################################################################
+# https://github.com/facebookresearch/mae/blob/main/util/pos_embed.py
+def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False, extra_tokens=1):
+    """
+    grid_size: int of the grid height and width
+    return:
+    pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
+    """
+    grid_h = np.arange(grid_size, dtype=np.float32)
+    grid_w = np.arange(grid_size, dtype=np.float32)
+    grid = np.meshgrid(grid_w, grid_h)  # here w goes first
+    grid = np.stack(grid, axis=0)
+    grid = grid.reshape([2, 1, grid_size, grid_size])
+    pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
+    if cls_token and extra_tokens > 0:
+        pos_embed = np.concatenate([np.zeros([extra_tokens, embed_dim]), pos_embed], axis=0)
+    return pos_embed
+def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
+    assert embed_dim % 2 == 0
+    # use half of dimensions to encode grid_h
+    emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0])  # (H*W, D/2)
+    emb_w = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[1])  # (H*W, D/2)
+    emb = np.concatenate([emb_h, emb_w], axis=1)  # (H*W, D)
+    return emb
+def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
+    """
+    embed_dim: output dimension for each position
+    pos: a list of positions to be encoded: size (M,)
+    out: (M, D)
+    """
+    assert embed_dim % 2 == 0
+    omega = np.arange(embed_dim // 2, dtype=np.float64)
+    omega /= embed_dim / 2.
+    omega = 1. / 10000 ** omega  # (D/2,)
+    pos = pos.reshape(-1)  # (M,)
+    out = np.einsum('m,d->md', pos, omega)  # (M, D/2), outer product
+    emb_sin = np.sin(out)  # (M, D/2)
+    emb_cos = np.cos(out)  # (M, D/2)
+    emb = np.concatenate([emb_sin, emb_cos], axis=1)  # (M, D)
+    return emb
+#################################################################################
+#                                   DiT Configs                                  #
+#################################################################################
+def DiT_H_2(**kwargs):
+    return DiT(depth=32, hidden_size=1280, patch_size=2, num_heads=16, **kwargs)
+def DiT_H_4(**kwargs):
+    return DiT(depth=32, hidden_size=1280, patch_size=4, num_heads=16, **kwargs)
+def DiT_H_8(**kwargs):
+    return DiT(depth=32, hidden_size=1280, patch_size=8, num_heads=16, **kwargs)
+def DiT_XL_2(**kwargs):
+    return DiT(depth=28, hidden_size=1152, patch_size=2, num_heads=16, **kwargs)
+def DiT_XL_4(**kwargs):
+    return DiT(depth=28, hidden_size=1152, patch_size=4, num_heads=16, **kwargs)
+def DiT_XL_8(**kwargs):
+    return DiT(depth=28, hidden_size=1152, patch_size=8, num_heads=16, **kwargs)
+def DiT_L_2(**kwargs):
+    return DiT(depth=24, hidden_size=1024, patch_size=2, num_heads=16, **kwargs)
+def DiT_L_4(**kwargs):
+    return DiT(depth=24, hidden_size=1024, patch_size=4, num_heads=16, **kwargs)
+def DiT_L_8(**kwargs):
+    return DiT(depth=24, hidden_size=1024, patch_size=8, num_heads=16, **kwargs)
+def DiT_B_2(**kwargs):
+    return DiT(depth=12, hidden_size=768, patch_size=2, num_heads=12, **kwargs)
+def DiT_B_4(**kwargs):
+    return DiT(depth=12, hidden_size=768, patch_size=4, num_heads=12, **kwargs)
+def DiT_B_8(**kwargs):
+    return DiT(depth=12, hidden_size=768, patch_size=8, num_heads=12, **kwargs)
+def DiT_S_2(**kwargs):
+    return DiT(depth=12, hidden_size=384, patch_size=2, num_heads=6, **kwargs)
+def DiT_S_4(**kwargs):
+    return DiT(depth=12, hidden_size=384, patch_size=4, num_heads=6, **kwargs)
+def DiT_S_8(**kwargs):
+    return DiT(depth=12, hidden_size=384, patch_size=8, num_heads=6, **kwargs)
+DiT_models = {
+    'DiT-H/2': DiT_H_2, 'DiT-H/4': DiT_H_4, 'DiT-H/8': DiT_H_8,
+    'DiT-XL/2': DiT_XL_2, 'DiT-XL/4': DiT_XL_4, 'DiT-XL/8': DiT_XL_8,
+    'DiT-L/2': DiT_L_2, 'DiT-L/4': DiT_L_4, 'DiT-L/8': DiT_L_8,
+    'DiT-B/2': DiT_B_2, 'DiT-B/4': DiT_B_4, 'DiT-B/8': DiT_B_8,
+    'DiT-S/2': DiT_S_2, 'DiT-S/4': DiT_S_4, 'DiT-S/8': DiT_S_8,
+}
+# ----------------------------------------------------------------------------
+# Improved preconditioning proposed in the paper "Elucidating the Design
+# Space of Diffusion-Based Generative Models" (EDM).
+class EDMPrecond(nn.Module):
+    def __init__(self,
+                 img_resolution,  # Image resolution.
+                 img_channels,  # Number of color channels.
+                 num_classes=0,  # Number of class labels, 0 = unconditional.
+                 sigma_min=0,  # Minimum supported noise level.
+                 sigma_max=float('inf'),  # Maximum supported noise level.
+                 sigma_data=0.5,  # Expected standard deviation of the training data.
+                 model_type='DiT-B/2',  # Class name of the underlying model.
+                 **model_kwargs,  # Keyword arguments for the underlying model.
+                 ):
+        super().__init__()
+        self.img_resolution = img_resolution
+        self.img_channels = img_channels
+        self.num_classes = num_classes
+        self.sigma_min = sigma_min
+        self.sigma_max = sigma_max
+        self.sigma_data = sigma_data
+        self.model = DiT_models[model_type](input_size=img_resolution, in_channels=img_channels,
+                                            num_classes=num_classes, **model_kwargs)
+    def encode(self, x, sigma, class_labels=None, **model_kwargs):
+        sigma = sigma.to(x.dtype).reshape(-1, 1, 1, 1)
+        class_labels = None if self.num_classes == 0 else \
+            torch.zeros([x.shape[0], self.num_classes], device=x.device) if class_labels is None else \
+                class_labels.to(x.dtype).reshape(-1, self.num_classes)
+        c_in = 1 / (self.sigma_data ** 2 + sigma ** 2).sqrt()
+        c_noise = sigma.log() / 4
+        feat, mask_dict = self.model.encode((c_in * x).to(x.dtype), c_noise.flatten(), y=class_labels, **model_kwargs)
+        return feat
+    def forward(self, x, sigma, class_labels=None, cfg_scale=None, **model_kwargs):
+        model_fn = self.model if cfg_scale is None else partial(self.model.forward_with_cfg, cfg_scale=cfg_scale)
+        sigma = sigma.to(x.dtype).reshape(-1, 1, 1, 1)
+        class_labels = None if self.num_classes == 0 else \
+            torch.zeros([x.shape[0], self.num_classes], device=x.device) if class_labels is None else \
+                class_labels.to(x.dtype).reshape(-1, self.num_classes)
+        c_skip = self.sigma_data ** 2 / (sigma ** 2 + self.sigma_data ** 2)
+        c_out = sigma * self.sigma_data / (sigma ** 2 + self.sigma_data ** 2).sqrt()
+        c_in = 1 / (self.sigma_data ** 2 + sigma ** 2).sqrt()
+        c_noise = sigma.log() / 4
+        model_out = model_fn((c_in * x).to(x.dtype), c_noise.flatten(), y=class_labels, **model_kwargs)
+        F_x = model_out['x']
+        D_x = c_skip * x + c_out * F_x
+        model_out['x'] = D_x
+        return model_out
+    def round_sigma(self, sigma):
+        return torch.as_tensor(sigma)
+Precond_models = {
+    'edm': EDMPrecond
+}

sample.py ADDED Viewed

	@@ -0,0 +1,397 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+# This code is adapted from https://github.com/NVlabs/edm/blob/main/generate.py.
+# The original code is licensed under a Creative Commons
+# Attribution-NonCommercial-ShareAlike 4.0 International License, which is can be found at licenses/LICENSE_EDM.txt.
+import argparse
+import random
+import PIL.Image
+import lmdb
+import numpy as np
+import torch
+import torch.distributed as dist
+from torch.multiprocessing import Process
+from tqdm import tqdm
+from models.maskdit import Precond_models, DiT_models
+from utils import *
+import autoencoder
+# ----------------------------------------------------------------------------
+# Proposed EDM sampler (Algorithm 2).
+def edm_sampler(
+        net, latents, class_labels=None, cfg_scale=None, feat=None, randn_like=torch.randn_like,
+        num_steps=18, sigma_min=0.002, sigma_max=80, rho=7,
+        S_churn=0, S_min=0, S_max=float('inf'), S_noise=1,
+):
+    # Adjust noise levels based on what's supported by the network.
+    sigma_min = max(sigma_min, net.sigma_min)
+    sigma_max = min(sigma_max, net.sigma_max)
+    # Time step discretization.
+    step_indices = torch.arange(num_steps, dtype=torch.float64, device=latents.device)
+    t_steps = (sigma_max ** (1 / rho) + step_indices / (num_steps - 1) * (
+                sigma_min ** (1 / rho) - sigma_max ** (1 / rho))) ** rho
+    t_steps = torch.cat([net.round_sigma(t_steps), torch.zeros_like(t_steps[:1])])  # t_N = 0
+    # Main sampling loop.
+    x_next = latents.to(torch.float64) * t_steps[0]
+    for i, (t_cur, t_next) in enumerate(zip(t_steps[:-1], t_steps[1:])):  # 0, ..., N-1
+        x_cur = x_next
+        # Increase noise temporarily.
+        gamma = min(S_churn / num_steps, np.sqrt(2) - 1) if S_min <= t_cur <= S_max else 0
+        t_hat = net.round_sigma(t_cur + gamma * t_cur)
+        x_hat = x_cur + (t_hat ** 2 - t_cur ** 2).sqrt() * S_noise * randn_like(x_cur)
+        # Euler step.
+        denoised = net(x_hat.float(), t_hat, class_labels, cfg_scale, feat=feat)['x'].to(torch.float64)
+        d_cur = (x_hat - denoised) / t_hat
+        x_next = x_hat + (t_next - t_hat) * d_cur
+        # Apply 2nd order correction.
+        if i < num_steps - 1:
+            denoised = net(x_next.float(), t_next, class_labels, cfg_scale, feat=feat)['x'].to(torch.float64)
+            d_prime = (x_next - denoised) / t_next
+            x_next = x_hat + (t_next - t_hat) * (0.5 * d_cur + 0.5 * d_prime)
+    return x_next
+# ----------------------------------------------------------------------------
+# Generalized ablation sampler, representing the superset of all sampling
+# methods discussed in the paper.
+def ablation_sampler(
+        net, latents, class_labels=None, cfg_scale=None, feat=None, randn_like=torch.randn_like,
+        num_steps=18, sigma_min=None, sigma_max=None, rho=7,
+        solver='heun', discretization='edm', schedule='linear', scaling='none',
+        epsilon_s=1e-3, C_1=0.001, C_2=0.008, M=1000, alpha=1,
+        S_churn=0, S_min=0, S_max=float('inf'), S_noise=1,
+):
+    assert solver in ['euler', 'heun']
+    assert discretization in ['vp', 've', 'iddpm', 'edm']
+    assert schedule in ['vp', 've', 'linear']
+    assert scaling in ['vp', 'none']
+    # Helper functions for VP & VE noise level schedules.
+    vp_sigma = lambda beta_d, beta_min: lambda t: (np.e ** (0.5 * beta_d * (t ** 2) + beta_min * t) - 1) ** 0.5
+    vp_sigma_deriv = lambda beta_d, beta_min: lambda t: 0.5 * (beta_min + beta_d * t) * (sigma(t) + 1 / sigma(t))
+    vp_sigma_inv = lambda beta_d, beta_min: lambda sigma: ((beta_min ** 2 + 2 * beta_d * (
+                sigma ** 2 + 1).log()).sqrt() - beta_min) / beta_d
+    ve_sigma = lambda t: t.sqrt()
+    ve_sigma_deriv = lambda t: 0.5 / t.sqrt()
+    ve_sigma_inv = lambda sigma: sigma ** 2
+    # Select default noise level range based on the specified time step discretization.
+    if sigma_min is None:
+        vp_def = vp_sigma(beta_d=19.1, beta_min=0.1)(t=epsilon_s)
+        sigma_min = {'vp': vp_def, 've': 0.02, 'iddpm': 0.002, 'edm': 0.002}[discretization]
+    if sigma_max is None:
+        vp_def = vp_sigma(beta_d=19.1, beta_min=0.1)(t=1)
+        sigma_max = {'vp': vp_def, 've': 100, 'iddpm': 81, 'edm': 80}[discretization]
+    # Adjust noise levels based on what's supported by the network.
+    sigma_min = max(sigma_min, net.sigma_min)
+    sigma_max = min(sigma_max, net.sigma_max)
+    # Compute corresponding betas for VP.
+    vp_beta_d = 2 * (np.log(sigma_min ** 2 + 1) / epsilon_s - np.log(sigma_max ** 2 + 1)) / (epsilon_s - 1)
+    vp_beta_min = np.log(sigma_max ** 2 + 1) - 0.5 * vp_beta_d
+    # Define time steps in terms of noise level.
+    step_indices = torch.arange(num_steps, dtype=torch.float64, device=latents.device)
+    if discretization == 'vp':
+        orig_t_steps = 1 + step_indices / (num_steps - 1) * (epsilon_s - 1)
+        sigma_steps = vp_sigma(vp_beta_d, vp_beta_min)(orig_t_steps)
+    elif discretization == 've':
+        orig_t_steps = (sigma_max ** 2) * ((sigma_min ** 2 / sigma_max ** 2) ** (step_indices / (num_steps - 1)))
+        sigma_steps = ve_sigma(orig_t_steps)
+    elif discretization == 'iddpm':
+        u = torch.zeros(M + 1, dtype=torch.float64, device=latents.device)
+        alpha_bar = lambda j: (0.5 * np.pi * j / M / (C_2 + 1)).sin() ** 2
+        for j in torch.arange(M, 0, -1, device=latents.device):  # M, ..., 1
+            u[j - 1] = ((u[j] ** 2 + 1) / (alpha_bar(j - 1) / alpha_bar(j)).clip(min=C_1) - 1).sqrt()
+        u_filtered = u[torch.logical_and(u >= sigma_min, u <= sigma_max)]
+        sigma_steps = u_filtered[((len(u_filtered) - 1) / (num_steps - 1) * step_indices).round().to(torch.int64)]
+    else:
+        assert discretization == 'edm'
+        sigma_steps = (sigma_max ** (1 / rho) + step_indices / (num_steps - 1) * (
+                    sigma_min ** (1 / rho) - sigma_max ** (1 / rho))) ** rho
+    # Define noise level schedule.
+    if schedule == 'vp':
+        sigma = vp_sigma(vp_beta_d, vp_beta_min)
+        sigma_deriv = vp_sigma_deriv(vp_beta_d, vp_beta_min)
+        sigma_inv = vp_sigma_inv(vp_beta_d, vp_beta_min)
+    elif schedule == 've':
+        sigma = ve_sigma
+        sigma_deriv = ve_sigma_deriv
+        sigma_inv = ve_sigma_inv
+    else:
+        assert schedule == 'linear'
+        sigma = lambda t: t
+        sigma_deriv = lambda t: 1
+        sigma_inv = lambda sigma: sigma
+    # Define scaling schedule.
+    if scaling == 'vp':
+        s = lambda t: 1 / (1 + sigma(t) ** 2).sqrt()
+        s_deriv = lambda t: -sigma(t) * sigma_deriv(t) * (s(t) ** 3)
+    else:
+        assert scaling == 'none'
+        s = lambda t: 1
+        s_deriv = lambda t: 0
+    # Compute final time steps based on the corresponding noise levels.
+    t_steps = sigma_inv(net.round_sigma(sigma_steps))
+    t_steps = torch.cat([t_steps, torch.zeros_like(t_steps[:1])])  # t_N = 0
+    # Main sampling loop.
+    t_next = t_steps[0]
+    x_next = latents.to(torch.float64) * (sigma(t_next) * s(t_next))
+    for i, (t_cur, t_next) in enumerate(zip(t_steps[:-1], t_steps[1:])):  # 0, ..., N-1
+        x_cur = x_next
+        # Increase noise temporarily.
+        gamma = min(S_churn / num_steps, np.sqrt(2) - 1) if S_min <= sigma(t_cur) <= S_max else 0
+        t_hat = sigma_inv(net.round_sigma(sigma(t_cur) + gamma * sigma(t_cur)))
+        x_hat = s(t_hat) / s(t_cur) * x_cur + (sigma(t_hat) ** 2 - sigma(t_cur) ** 2).clip(min=0).sqrt() * s(
+            t_hat) * S_noise * randn_like(x_cur)
+        # Euler step.
+        h = t_next - t_hat
+        denoised = net(x_hat.float() / s(t_hat), sigma(t_hat), class_labels, cfg_scale, feat=feat)['x'].to(torch.float64)
+        d_cur = (sigma_deriv(t_hat) / sigma(t_hat) + s_deriv(t_hat) / s(t_hat)) * x_hat - sigma_deriv(t_hat) * s(
+            t_hat) / sigma(t_hat) * denoised
+        x_prime = x_hat + alpha * h * d_cur
+        t_prime = t_hat + alpha * h
+        # Apply 2nd order correction.
+        if solver == 'euler' or i == num_steps - 1:
+            x_next = x_hat + h * d_cur
+        else:
+            assert solver == 'heun'
+            denoised = net(x_prime.float() / s(t_prime), sigma(t_prime), class_labels, cfg_scale, feat=feat)['x'].to(torch.float64)
+            d_prime = (sigma_deriv(t_prime) / sigma(t_prime) + s_deriv(t_prime) / s(t_prime)) * x_prime - sigma_deriv(
+                t_prime) * s(t_prime) / sigma(t_prime) * denoised
+            x_next = x_hat + h * ((1 - 1 / (2 * alpha)) * d_cur + 1 / (2 * alpha) * d_prime)
+    return x_next
+# ----------------------------------------------------------------------------
+def retrieve_n_features(batch_size, feat_path, feat_dim, num_classes, device, split='train', sample_mode='rand_full'):
+    env = lmdb.open(os.path.join(feat_path, split), readonly=True, lock=False, create=False)
+    # Start a new read transaction
+    with env.begin() as txn:
+        # Read all images in one single transaction, with one lock
+        # We could split this up into multiple transactions if needed
+        length = int(txn.get('length'.encode('utf-8')).decode('utf-8'))
+        if sample_mode == 'rand_full':
+            image_ids = random.sample(range(length // 2), batch_size)
+            image_ids_y = image_ids
+        elif sample_mode == 'rand_repeat':
+            image_ids = random.sample(range(length // 2), 1) * batch_size
+            image_ids_y = image_ids
+        elif sample_mode == 'rand_y':
+            image_ids = random.sample(range(length // 2), 1) * batch_size
+            image_ids_y = random.sample(range(length // 2), batch_size)
+        else:
+            raise NotImplementedError
+        features, labels = [], []
+        for image_id, image_id_y in zip(image_ids, image_ids_y):
+            feat_bytes = txn.get(f'feat-{str(image_id)}'.encode('utf-8'))
+            y_bytes = txn.get(f'y-{str(image_id_y)}'.encode('utf-8'))
+            feat = np.frombuffer(feat_bytes, dtype=np.float32).reshape([feat_dim]).copy()
+            y = int(y_bytes.decode('utf-8'))
+            features.append(feat)
+            labels.append(y)
+        features = torch.from_numpy(np.stack(features)).to(device)
+        labels = torch.from_numpy(np.array(labels)).to(device)
+        class_labels = torch.zeros([batch_size, num_classes], device=device)
+        if num_classes > 0:
+            class_labels = torch.eye(num_classes, device=device)[labels]
+        assert features.shape[0] == class_labels.shape[0] == batch_size
+    return features, class_labels
+@torch.no_grad()
+def generate_with_net(args, net, device, rank, size):
+    seeds = args.seeds
+    num_batches = ((len(seeds) - 1) // (args.max_batch_size * size) + 1) * size
+    all_batches = torch.as_tensor(seeds).tensor_split(num_batches)
+    rank_batches = all_batches[rank:: size]
+    net.eval()
+    # Setup sampler
+    sampler_kwargs = dict(num_steps=args.num_steps, S_churn=args.S_churn,
+                          solver=args.solver, discretization=args.discretization,
+                          schedule=args.schedule, scaling=args.scaling)
+    sampler_kwargs = {key: value for key, value in sampler_kwargs.items() if value is not None}
+    have_ablation_kwargs = any(x in sampler_kwargs for x in ['solver', 'discretization', 'schedule', 'scaling'])
+    sampler_fn = ablation_sampler if have_ablation_kwargs else edm_sampler
+    mprint(f"sampler_kwargs: {sampler_kwargs}, \nsampler fn: {sampler_fn.__name__}")
+    # Setup autoencoder
+    vae = autoencoder.get_model(args.pretrained_path).to(device)
+    # generate images
+    mprint(f'Generating {len(seeds)} images to "{args.outdir}"...')
+    for batch_seeds in tqdm(rank_batches, unit='batch', disable=(rank != 0)):
+        dist.barrier()
+        batch_size = len(batch_seeds)
+        if batch_size == 0:
+            continue
+        # Pick latents and labels.
+        rnd = StackedRandomGenerator(device, batch_seeds)
+        latents = rnd.randn([batch_size, net.img_channels, net.img_resolution, net.img_resolution], device=device)
+        class_labels = torch.zeros([batch_size, net.num_classes], device=device)
+        if net.num_classes:
+            class_labels = torch.eye(net.num_classes, device=device)[
+                rnd.randint(net.num_classes, size=[batch_size], device=device)]
+        if args.class_idx is not None:
+            class_labels[:, :] = 0
+            class_labels[:, args.class_idx] = 1
+        # retrieve features from training set [support random only]
+        feat = None
+        # Generate images.
+        def recur_decode(z):
+            try:
+                return vae.decode(z)
+            except:  # reduce the batch for vae decoder but two forward passes when OOM happens occasionally
+                assert z.shape[2] % 2 == 0
+                z1, z2 = z.tensor_split(2)
+                return torch.cat([recur_decode(z1), recur_decode(z2)])
+        with torch.no_grad():
+            z = sampler_fn(net, latents.float(), class_labels.float(), randn_like=rnd.randn_like,
+                           cfg_scale=args.cfg_scale, feat=feat, **sampler_kwargs).float()
+            images = recur_decode(z)
+        # Save images.
+        images_np = images.add_(1).mul(127.5).clamp_(0, 255).to(torch.uint8).permute(0, 2, 3, 1).cpu().numpy()
+        # images_np = (images * 127.5 + 128).clip(0, 255).to(torch.uint8).permute(0, 2, 3, 1).cpu().numpy()
+        for seed, image_np in zip(batch_seeds, images_np):
+            image_dir = os.path.join(args.outdir, f'{seed - seed % 1000:06d}') if args.subdirs else args.outdir
+            os.makedirs(image_dir, exist_ok=True)
+            image_path = os.path.join(image_dir, f'{seed:06d}.png')
+            if image_np.shape[2] == 1:
+                PIL.Image.fromarray(image_np[:, :, 0], 'L').save(image_path)
+            else:
+                PIL.Image.fromarray(image_np, 'RGB').save(image_path)
+def generate(args):
+    device = torch.device("cuda")
+    mprint(f'cf_scale: {args.cfg_scale}')
+    if args.global_rank == 0:
+        os.makedirs(args.outdir, exist_ok=True)
+        logger = Logger(file_name=f'{args.outdir}/log.txt', file_mode="a+", should_flush=True)
+    # Create model:
+    net = Precond_models[args.precond](
+        img_resolution=args.image_size,
+        img_channels=args.image_channels,
+        num_classes=args.num_classes,
+        model_type=args.model_type,
+        use_decoder=args.use_decoder,
+        mae_loss_coef=args.mae_loss_coef,
+        pad_cls_token=args.pad_cls_token,
+        ext_feature_dim=args.ext_feature_dim
+    ).to(device)
+    mprint(
+        f"{args.model_type} (use_decoder: {args.use_decoder}) Model Parameters: {sum(p.numel() for p in net.parameters()):,}")
+    # Load checkpoints
+    ckpt = torch.load(args.ckpt_path, map_location=device)
+    net.load_state_dict(ckpt['ema'])
+    mprint(f'Load weights from {args.ckpt_path}')
+    generate_with_net(args, net, device)
+    # Done.
+    cleanup()
+    if args.global_rank == 0:
+        logger.close()
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser('sampling parameters')
+    # ddp
+    parser.add_argument('--num_proc_node', type=int, default=1, help='The number of nodes in multi node env.')
+    parser.add_argument('--num_process_per_node', type=int, default=1, help='number of gpus')
+    parser.add_argument('--node_rank', type=int, default=0, help='The index of node.')
+    parser.add_argument('--local_rank', type=int, default=0, help='rank of process in the node')
+    parser.add_argument('--master_address', type=str, default='localhost', help='address for master')
+    # sampling
+    parser.add_argument("--feat_path", type=str, default='')
+    parser.add_argument("--ext_feature_dim", type=int, default=0)
+    parser.add_argument('--ckpt_path', type=str, required=True, help='Network pickle filename')
+    parser.add_argument('--outdir', type=str, required=True, help='sampling results save filename')
+    parser.add_argument('--seeds', type=parse_int_list, default='0-63', help='Random seeds (e.g. 1,2,5-10)')
+    parser.add_argument('--subdirs', action='store_true', help='Create subdirectory for every 1000 seeds')
+    parser.add_argument('--class_idx', type=int, default=None, help='Class label  [default: random]')
+    parser.add_argument('--max_batch_size', type=int, default=64, help='Maximum batch size per GPU')
+    parser.add_argument("--cfg_scale", type=parse_float_none, default=None, help='None = no guidance, by default = 4.0')
+    parser.add_argument('--num_steps', type=int, default=18, help='Number of sampling steps')
+    parser.add_argument('--S_churn', type=int, default=0, help='Stochasticity strength')
+    parser.add_argument('--solver', type=str, default=None, choices=['euler', 'heun'], help='Ablate ODE solver')
+    parser.add_argument('--discretization', type=str, default=None, choices=['vp', 've', 'iddpm', 'edm'],
+                        help='Ablate ODE solver')
+    parser.add_argument('--schedule', type=str, default=None, choices=['vp', 've', 'linear'],
+                        help='Ablate noise schedule sigma(t)')
+    parser.add_argument('--scaling', type=str, default=None, choices=['vp', 'none'], help='Ablate signal scaling s(t)')
+    parser.add_argument('--pretrained_path', type=str, default='assets/stable_diffusion/autoencoder_kl.pth',
+                        help='Autoencoder ckpt')
+    # model
+    parser.add_argument("--image_size", type=int, default=32)
+    parser.add_argument("--image_channels", type=int, default=4)
+    parser.add_argument("--num_classes", type=int, default=1000, help='0 means unconditional')
+    parser.add_argument("--model_type", type=str, choices=list(DiT_models.keys()), default="DiT-XL/2")
+    parser.add_argument('--precond', type=str, choices=['vp', 've', 'edm'], default='edm', help='precond train & loss')
+    parser.add_argument("--use_decoder", type=str2bool, default=False)
+    parser.add_argument("--pad_cls_token", type=str2bool, default=False)
+    parser.add_argument('--mae_loss_coef', type=float, default=0, help='0 means no MAE loss')
+    parser.add_argument('--sample_mode', type=str, default='rand_full', help='[rand_full, rand_repeat]')
+    args = parser.parse_args()
+    args.global_size = args.num_proc_node * args.num_process_per_node
+    size = args.num_process_per_node
+    if size > 1:
+        processes = []
+        for rank in range(size):
+            args.local_rank = rank
+            args.global_rank = rank + args.node_rank * args.num_process_per_node
+            p = Process(target=init_processes, args=(generate, args))
+            p.start()
+            processes.append(p)
+        for p in processes:
+            p.join()
+    else:
+        print('Single GPU run')
+        assert args.global_size == 1 and args.local_rank == 0
+        args.global_rank = 0
+        init_processes(generate, args)

scripts/download_assets.sh ADDED Viewed

	@@ -0,0 +1,8 @@

+# download pretrained VAE
+python3 download_assets.py --name vae --dest assets/stable_diffusion
+# download ImageNet256 training set
+python3 download_assets.py --name imagenet256-latent-lmdb --dest ../data/imagenet256
+# download ImageNet512 training set
+python3 download_assets.py --name imagenet512-latent-wds --dest ../data/imagenet512-wds

scripts/finetune_latent512.sh ADDED Viewed

	@@ -0,0 +1,14 @@

+accelerate launch \
+--main_process_ip $MASTER_ADDR \
+--main_process_port $MASTER_PORT \
+--num_machines 4 \
+--machine_rank $NODE_RANK \
+--num_processes  32 \
+train_wds.py \
+--config configs/finetune/imagenet512-latent.yaml \
+--resample \
+--ckpt_path checkpoints/1050000.pt \
+--use_ckpt_path False --use_strict_load False \
+--no_amp

scripts/prepare_latent256.sh ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ # Encode ImageNet 256x256 into latent space
2	+
3	+ python3 extract_latent.py --resolution 256 --ckpt assets/vae/autoencoder_kl.pth --batch_size 64 --outdir ../data/imagenet256-latent

scripts/prepare_latent512.sh ADDED Viewed

	@@ -0,0 +1,6 @@

+# Encode ImageNet 512x512 into latent space
+python3 extract_latent.py --resolution 512 --ckpt assets/vae/autoencoder_kl.pth --batch_size 64 --outdir ../data/imagenet512-latent
+# Convert lmdb to webdataset
+python3 lmdb2wds.py --maxcount 10010 --datadir ../data/imagenet512-latent --outdir ../data/imagenet512-latent-wds --resolution 64 --num_channels 8

scripts/train_latent512.sh ADDED Viewed

	@@ -0,0 +1,11 @@

+accelerate launch \
+--main_process_ip $MASTER_ADDR \
+--main_process_port $MASTER_PORT \
+--num_machines 4 \
+--machine_rank $NODE_RANK \
+--num_processes  32 \
+train_wds.py \
+--config configs/train/imagenet512-latent.yaml \
+--resample

torch_utils/__init__.py ADDED Viewed

File without changes

torch_utils/persistence.py ADDED Viewed

	@@ -0,0 +1,276 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+# This code is adapted from https://github.com/NVlabs/edm/blob/main/fid.py.
+# The original code is licensed under a Creative Commons
+# Attribution-NonCommercial-ShareAlike 4.0 International License, which is can be found at licenses/LICENSE_EDM.txt.
+"""Facilities for pickling Python code alongside other data.
+The pickled code is automatically imported into a separate Python module
+during unpickling. This way, any previously exported pickles will remain
+usable even if the original code is no longer available, or if the current
+version of the code is not consistent with what was originally pickled."""
+import sys
+import pickle
+import io
+import inspect
+import copy
+import uuid
+import types
+#----------------------------------------------------------------------------
+class EasyDict(dict):
+    """Convenience class that behaves like a dict but allows access with the attribute syntax."""
+    def __getattr__(self, name):
+        try:
+            return self[name]
+        except KeyError:
+            raise AttributeError(name)
+    def __setattr__(self, name, value):
+        self[name] = value
+    def __delattr__(self, name):
+        del self[name]
+#----------------------------------------------------------------------------
+_version            = 6         # internal version number
+_decorators         = set()     # {decorator_class, ...}
+_import_hooks       = []        # [hook_function, ...]
+_module_to_src_dict = dict()    # {module: src, ...}
+_src_to_module_dict = dict()    # {src: module, ...}
+#----------------------------------------------------------------------------
+def persistent_class(orig_class):
+    r"""Class decorator that extends a given class to save its source code
+    when pickled.
+    Example:
+        from torch_utils import persistence
+        @persistence.persistent_class
+        class MyNetwork(torch.nn.Module):
+            def __init__(self, num_inputs, num_outputs):
+                super().__init__()
+                self.fc = MyLayer(num_inputs, num_outputs)
+                ...
+        @persistence.persistent_class
+        class MyLayer(torch.nn.Module):
+            ...
+    When pickled, any instance of `MyNetwork` and `MyLayer` will save its
+    source code alongside other internal state (e.g., parameters, buffers,
+    and submodules). This way, any previously exported pickle will remain
+    usable even if the class definitions have been modified or are no
+    longer available.
+    The decorator saves the source code of the entire Python module
+    containing the decorated class. It does *not* save the source code of
+    any imported modules. Thus, the imported modules must be available
+    during unpickling, also including `torch_utils.persistence` itself.
+    It is ok to call functions defined in the same module from the
+    decorated class. However, if the decorated class depends on other
+    classes defined in the same module, they must be decorated as well.
+    This is illustrated in the above example in the case of `MyLayer`.
+    It is also possible to employ the decorator just-in-time before
+    calling the constructor. For example:
+        cls = MyLayer
+        if want_to_make_it_persistent:
+            cls = persistence.persistent_class(cls)
+        layer = cls(num_inputs, num_outputs)
+    As an additional feature, the decorator also keeps track of the
+    arguments that were used to construct each instance of the decorated
+    class. The arguments can be queried via `obj.init_args` and
+    `obj.init_kwargs`, and they are automatically pickled alongside other
+    object state. This feature can be disabled on a per-instance basis
+    by setting `self._record_init_args = False` in the constructor.
+    A typical use case is to first unpickle a previous instance of a
+    persistent class, and then upgrade it to use the latest version of
+    the source code:
+        with open('old_pickle.pkl', 'rb') as f:
+            old_net = pickle.load(f)
+        new_net = MyNetwork(*old_obj.init_args, **old_obj.init_kwargs)
+        misc.copy_params_and_buffers(old_net, new_net, require_all=True)
+    """
+    assert isinstance(orig_class, type)
+    if is_persistent(orig_class):
+        return orig_class
+    assert orig_class.__module__ in sys.modules
+    orig_module = sys.modules[orig_class.__module__]
+    orig_module_src = _module_to_src(orig_module)
+    class Decorator(orig_class):
+        _orig_module_src = orig_module_src
+        _orig_class_name = orig_class.__name__
+        def __init__(self, *args, **kwargs):
+            super().__init__(*args, **kwargs)
+            record_init_args = getattr(self, '_record_init_args', True)
+            self._init_args = copy.deepcopy(args) if record_init_args else None
+            self._init_kwargs = copy.deepcopy(kwargs) if record_init_args else None
+            assert orig_class.__name__ in orig_module.__dict__
+            _check_pickleable(self.__reduce__())
+        @property
+        def init_args(self):
+            assert self._init_args is not None
+            return copy.deepcopy(self._init_args)
+        @property
+        def init_kwargs(self):
+            assert self._init_kwargs is not None
+            return EasyDict(copy.deepcopy(self._init_kwargs))
+        def __reduce__(self):
+            fields = list(super().__reduce__())
+            fields += [None] * max(3 - len(fields), 0)
+            if fields[0] is not _reconstruct_persistent_obj:
+                meta = dict(type='class', version=_version, module_src=self._orig_module_src, class_name=self._orig_class_name, state=fields[2])
+                fields[0] = _reconstruct_persistent_obj # reconstruct func
+                fields[1] = (meta,) # reconstruct args
+                fields[2] = None # state dict
+            return tuple(fields)
+    Decorator.__name__ = orig_class.__name__
+    Decorator.__module__ = orig_class.__module__
+    _decorators.add(Decorator)
+    return Decorator
+#----------------------------------------------------------------------------
+def is_persistent(obj):
+    r"""Test whether the given object or class is persistent, i.e.,
+    whether it will save its source code when pickled.
+    """
+    try:
+        if obj in _decorators:
+            return True
+    except TypeError:
+        pass
+    return type(obj) in _decorators # pylint: disable=unidiomatic-typecheck
+#----------------------------------------------------------------------------
+def import_hook(hook):
+    r"""Register an import hook that is called whenever a persistent object
+    is being unpickled. A typical use case is to patch the pickled source
+    code to avoid errors and inconsistencies when the API of some imported
+    module has changed.
+    The hook should have the following signature:
+        hook(meta) -> modified meta
+    `meta` is an instance of `EasyDict` with the following fields:
+        type:       Type of the persistent object, e.g. `'class'`.
+        version:    Internal version number of `torch_utils.persistence`.
+        module_src  Original source code of the Python module.
+        class_name: Class name in the original Python module.
+        state:      Internal state of the object.
+    Example:
+        @persistence.import_hook
+        def wreck_my_network(meta):
+            if meta.class_name == 'MyNetwork':
+                print('MyNetwork is being imported. I will wreck it!')
+                meta.module_src = meta.module_src.replace("True", "False")
+            return meta
+    """
+    assert callable(hook)
+    _import_hooks.append(hook)
+#----------------------------------------------------------------------------
+def _reconstruct_persistent_obj(meta):
+    r"""Hook that is called internally by the `pickle` module to unpickle
+    a persistent object.
+    """
+    meta = EasyDict(meta)
+    meta.state = EasyDict(meta.state)
+    for hook in _import_hooks:
+        meta = hook(meta)
+        assert meta is not None
+    assert meta.version == _version
+    module = _src_to_module(meta.module_src)
+    assert meta.type == 'class'
+    orig_class = module.__dict__[meta.class_name]
+    decorator_class = persistent_class(orig_class)
+    obj = decorator_class.__new__(decorator_class)
+    setstate = getattr(obj, '__setstate__', None)
+    if callable(setstate):
+        setstate(meta.state) # pylint: disable=not-callable
+    else:
+        obj.__dict__.update(meta.state)
+    return obj
+#----------------------------------------------------------------------------
+def _module_to_src(module):
+    r"""Query the source code of a given Python module.
+    """
+    src = _module_to_src_dict.get(module, None)
+    if src is None:
+        src = inspect.getsource(module)
+        _module_to_src_dict[module] = src
+        _src_to_module_dict[src] = module
+    return src
+def _src_to_module(src):
+    r"""Get or create a Python module for the given source code.
+    """
+    module = _src_to_module_dict.get(src, None)
+    if module is None:
+        module_name = "_imported_module_" + uuid.uuid4().hex
+        module = types.ModuleType(module_name)
+        sys.modules[module_name] = module
+        _module_to_src_dict[module] = src
+        _src_to_module_dict[src] = module
+        exec(src, module.__dict__) # pylint: disable=exec-used
+    return module
+#----------------------------------------------------------------------------
+def _check_pickleable(obj):
+    r"""Check that the given object is pickleable, raising an exception if
+    it is not. This function is expected to be considerably more efficient
+    than actually pickling the object.
+    """
+    def recurse(obj):
+        if isinstance(obj, (list, tuple, set)):
+            return [recurse(x) for x in obj]
+        if isinstance(obj, dict):
+            return [[recurse(x), recurse(y)] for x, y in obj.items()]
+        if isinstance(obj, (str, int, float, bool, bytes, bytearray)):
+            return None # Python primitive types are pickleable.
+        if f'{type(obj).__module__}.{type(obj).__name__}' in ['numpy.ndarray', 'torch.Tensor', 'torch.nn.parameter.Parameter']:
+            return None # NumPy arrays and PyTorch tensors are pickleable.
+        if is_persistent(obj):
+            return None # Persistent objects are pickleable, by virtue of the constructor check.
+        return obj
+    with io.BytesIO() as f:
+        pickle.dump(recurse(obj), f)
+#----------------------------------------------------------------------------

train.py ADDED Viewed

	@@ -0,0 +1,336 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+'''
+Training MaskDiT on latent dataset in LMDB format. Used for experiments on Imagenet256x256.
+'''
+import argparse
+import os.path
+from copy import deepcopy
+from time import time
+from omegaconf import OmegaConf
+import apex
+import torch
+import accelerate
+from torch.utils.data import DataLoader
+from fid import calc
+from models.maskdit import Precond_models
+from train_utils.loss import Losses
+from train_utils.datasets import ImageNetLatentDataset
+from train_utils.helper import get_mask_ratio_fn, requires_grad, update_ema, unwrap_model
+from sample import generate_with_net
+from utils import dist, mprint, get_latest_ckpt, Logger, sample, \
+    str2bool, parse_str_none, parse_int_list, parse_float_none
+# ------------------------------------------------------------
+def train_loop(args):
+    # load configuration
+    config = OmegaConf.load(args.config)
+    if not args.no_amp:
+        config.train.amp = 'fp16'
+    else:
+        config.train.amp = 'no'
+    if config.train.tf32:
+        torch.backends.cudnn.allow_tf32 = True
+        torch.set_float32_matmul_precision('high')
+    accelerator = accelerate.Accelerator(mixed_precision=config.train.amp,
+                                         gradient_accumulation_steps=config.train.grad_accum,
+                                         log_with='wandb')
+    # setup wandb
+    if args.use_wandb:
+        wandb_init_kwargs = {
+            'entity': config.wandb.entity,
+            'project': config.wandb.project,
+            'group': config.wandb.group,
+        }
+        accelerator.init_trackers(config.wandb.project, config=OmegaConf.to_container(config), init_kwargs=wandb_init_kwargs)
+    mprint('start training...')
+    size = accelerator.num_processes
+    rank = accelerator.process_index
+    print(f'global_rank: {rank}, global_size: {size}')
+    device = accelerator.device
+    seed = args.global_seed
+    torch.manual_seed(seed)
+    mprint(f"enable_amp: {not args.no_amp}, TF32: {config.train.tf32}")
+    # Select batch size per GPU
+    num_accumulation_rounds = config.train.grad_accum
+    micro_batch = config.train.batchsize
+    batch_gpu_total = micro_batch * num_accumulation_rounds
+    global_batch_size = batch_gpu_total * size
+    mprint(f"Global batchsize: {global_batch_size},  batchsize per GPU: {batch_gpu_total}, micro_batch: {micro_batch}.")
+    class_dropout_prob = config.model.class_dropout_prob
+    log_every = config.log.log_every
+    ckpt_every = config.log.ckpt_every
+    mask_ratio_fn = get_mask_ratio_fn(config.model.mask_ratio_fn, config.model.mask_ratio, config.model.mask_ratio_min)
+    # Setup an experiment folder
+    model_name = config.model.model_type.replace("/", "-")  # e.g., DiT-XL/2 --> DiT-XL-2 (for naming folders)
+    data_name = config.data.dataset
+    if args.ckpt_path is not None and args.use_ckpt_path:  # use the existing exp path (mainly used for fine-tuning)
+        checkpoint_dir = os.path.dirname(args.ckpt_path)
+        experiment_dir = os.path.dirname(checkpoint_dir)
+        exp_name = os.path.basename(experiment_dir)
+    else:  # start a new exp path (and resume from the latest checkpoint if possible)
+        cond_gen = 'cond' if config.model.num_classes else 'uncond'
+        exp_name = f'{model_name}-{config.model.precond}-{data_name}-{cond_gen}-m{config.model.mask_ratio}-de{int(config.model.use_decoder)}' \
+                   f'-mae{config.model.mae_loss_coef}-bs-{global_batch_size}-lr{config.train.lr}{config.log.tag}'
+        experiment_dir = f"{args.results_dir}/{exp_name}"
+        checkpoint_dir = f"{experiment_dir}/checkpoints"  # Stores saved model checkpoints
+        os.makedirs(checkpoint_dir, exist_ok=True)
+        if args.ckpt_path is None:
+            args.ckpt_path = get_latest_ckpt(checkpoint_dir)  # Resumes from the latest checkpoint if it exists
+    mprint(f"Experiment directory created at {experiment_dir}")
+    if accelerator.is_main_process:
+        logger = Logger(file_name=f'{experiment_dir}/log.txt', file_mode="a+", should_flush=True)
+    mprint(f"Experiment directory created at {experiment_dir}")
+    # Setup dataset
+    dataset = ImageNetLatentDataset(
+        config.data.root, resolution=config.data.resolution,
+        num_channels=config.data.num_channels, xflip=config.train.xflip,
+        feat_path=config.data.feat_path, feat_dim=config.model.ext_feature_dim)
+    loader = DataLoader(
+        dataset, batch_size=batch_gpu_total, shuffle=False,
+        num_workers=args.num_workers,
+        pin_memory=True, persistent_workers=True,
+        drop_last=True
+    )
+    mprint(f"Dataset contains {len(dataset):,} images ({config.data.root})")
+    steps_per_epoch = len(dataset) // global_batch_size
+    mprint(f"{steps_per_epoch} steps per epoch")
+    model = Precond_models[config.model.precond](
+        img_resolution=config.model.in_size,
+        img_channels=config.model.in_channels,
+        num_classes=config.model.num_classes,
+        model_type=config.model.model_type,
+        use_decoder=config.model.use_decoder,
+        mae_loss_coef=config.model.mae_loss_coef,
+        pad_cls_token=config.model.pad_cls_token
+    ).to(device)
+    # Note that parameter initialization is done within the model constructor
+    ema = deepcopy(model).to(device)  # Create an EMA of the model for use after training
+    requires_grad(ema, False)
+    mprint(f"{config.model.model_type} ((use_decoder: {config.model.use_decoder})) Model Parameters: {sum(p.numel() for p in model.parameters()):,}")
+    mprint(f'extras: {model.model.extras}, cls_token: {model.model.cls_token}')
+    # Setup optimizer (we used default Adam betas=(0.9, 0.999) and a constant learning rate of 1e-4 in our paper):
+    optimizer = apex.optimizers.FusedAdam(model.parameters(), lr=config.train.lr, adam_w_mode=True, weight_decay=0)
+    # Load checkpoints
+    train_steps_start = 0
+    epoch_start = 0
+    if args.ckpt_path is not None:
+        ckpt = torch.load(args.ckpt_path, map_location=device)
+        model.load_state_dict(ckpt['model'], strict=args.use_strict_load)
+        ema.load_state_dict(ckpt['ema'], strict=args.use_strict_load)
+        mprint(f'Load weights from {args.ckpt_path}')
+        if args.use_strict_load:
+            optimizer.load_state_dict(ckpt['opt'])
+            for state in optimizer.state.values():
+                for k, v in state.items():
+                    if isinstance(v, torch.Tensor):
+                        state[k] = v.cuda()
+        mprint(f'Load optimizer state..')
+        train_steps_start = int(os.path.basename(args.ckpt_path).split('.pt')[0])
+        epoch_start = train_steps_start // steps_per_epoch
+        mprint(f"train_steps_start: {train_steps_start}")
+        del ckpt # conserve memory
+        # FID evaluation for the loaded weights
+        if args.enable_eval:
+            start_time = time()
+            args.outdir = os.path.join(experiment_dir, 'fid', f'edm-steps{args.num_steps}-ckpt{train_steps_start}_cfg{args.cfg_scale}')
+            os.makedirs(args.outdir, exist_ok=True)
+            generate_with_net(args, ema, device)
+            dist.barrier()
+            fid = calc(args.outdir, config.eval.ref_path, args.num_expected, args.global_seed, args.fid_batch_size)
+            mprint(f"time for fid calc: {time() - start_time}")
+            if args.use_wandb:
+                accelerator.log({f'eval/fid': fid}, step=train_steps_start)
+            mprint(f'guidance: {args.cfg_scale} FID: {fid}')
+            dist.barrier()
+    model, optimizer, loader = accelerator.prepare(model, optimizer, loader)
+    model = torch.compile(model)
+    # Setup loss
+    loss_fn = Losses[config.model.precond]()
+    # Prepare models for training:
+    if args.ckpt_path is None:
+        assert train_steps_start == 0
+        raw_model = unwrap_model(model)
+        update_ema(ema, raw_model, decay=0)  # Ensure EMA is initialized with synced weights
+    model.train()  # important! This enables embedding dropout for classifier-free guidance
+    ema.eval()  # EMA model should always be in eval mode
+    # Variables for monitoring/logging purposes:
+    train_steps = train_steps_start
+    log_steps = 0
+    running_loss = 0
+    start_time = time()
+    mprint(f"Training for {config.train.epochs} epochs...")
+    for epoch in range(epoch_start, config.train.epochs):
+        mprint(f"Beginning epoch {epoch}...")
+        for x, cond in loader:
+            x = x.to(device)
+            y = cond.to(device)
+            x = sample(x)
+            # Accumulate gradients.
+            loss_batch = 0
+            model.zero_grad(set_to_none=True)
+            curr_mask_ratio = mask_ratio_fn((train_steps - train_steps_start) / config.train.max_num_steps)
+            if class_dropout_prob > 0:
+                y = y * (torch.rand([y.shape[0], 1], device=device) >= class_dropout_prob)
+            for round_idx in range(num_accumulation_rounds):
+                x_ = x[round_idx * micro_batch: (round_idx + 1) * micro_batch]
+                y_ = y[round_idx * micro_batch: (round_idx + 1) * micro_batch]
+                with accelerator.accumulate(model):
+                    loss = loss_fn(net=model, images=x_, labels=y_,
+                                    mask_ratio=curr_mask_ratio,
+                                    mae_loss_coef=config.model.mae_loss_coef)
+                    loss_mean = loss.mean()
+                    accelerator.backward(loss_mean)
+                    # Update weights with lr warmup.
+                    lr_cur = config.train.lr * min(train_steps * global_batch_size / max(config.train.lr_rampup_kimg * 1000, 1e-8), 1)
+                    for g in optimizer.param_groups:
+                        g['lr'] = lr_cur
+                    optimizer.step()
+                    loss_batch += loss_mean.item()
+            raw_model = unwrap_model(model)
+            update_ema(ema, model.module)
+            # Log loss values:
+            running_loss += loss_batch
+            log_steps += 1
+            train_steps += 1
+            if train_steps > (train_steps_start + config.train.max_num_steps):
+                break
+            if train_steps % log_every == 0:
+                # Measure training speed:
+                torch.cuda.synchronize()
+                end_time = time()
+                steps_per_sec = log_steps / (end_time - start_time)
+                # Reduce loss history over all processes:
+                avg_loss = torch.tensor(running_loss / log_steps, device=device)
+                dist.all_reduce(avg_loss, op=dist.ReduceOp.SUM)
+                avg_loss = avg_loss.item() / size
+                mprint(f"(step={train_steps:07d}) Train Loss: {avg_loss:.4f}, Train Steps/Sec: {steps_per_sec:.2f}")
+                mprint(f'Peak GPU memory usage: {torch.cuda.max_memory_allocated() / 1024 ** 3:.2f} GB')
+                mprint(f'Reserved GPU memory: {torch.cuda.memory_reserved() / 1024 ** 3:.2f} GB')
+                if args.use_wandb:
+                    accelerator.log({f'train/loss': avg_loss, 'train/lr': lr_cur}, step=train_steps)
+                # Reset monitoring variables:
+                running_loss = 0
+                log_steps = 0
+                start_time = time()
+            # Save checkpoint:
+            if train_steps % ckpt_every == 0 and train_steps > train_steps_start:
+                if rank == 0:
+                    checkpoint = {
+                        "model": raw_model.state_dict(),
+                        "ema": ema.state_dict(),
+                        "opt": optimizer.state_dict(),
+                        "args": args
+                    }
+                    checkpoint_path = f"{checkpoint_dir}/{train_steps:07d}.pt"
+                    torch.save(checkpoint, checkpoint_path)
+                    mprint(f"Saved checkpoint to {checkpoint_path}")
+                    del checkpoint  # conserve memory
+                dist.barrier()
+                # FID evaluation during training
+                if args.enable_eval:
+                    start_time = time()
+                    args.outdir = os.path.join(experiment_dir, 'fid', f'edm-steps{args.num_steps}-ckpt{train_steps}_cfg{args.cfg_scale}')
+                    os.makedirs(args.outdir, exist_ok=True)
+                    generate_with_net(args, ema, device, rank, size)
+                    dist.barrier()
+                    fid = calc(args.outdir, args.ref_path, args.num_expected, args.global_seed, args.fid_batch_size)
+                    mprint(f"time for fid calc: {time() - start_time}, fid: {fid}")
+                    if args.use_wandb:
+                        accelerator.log({f'eval/fid': fid}, step=train_steps)
+                    mprint(f'Guidance: {args.cfg_scale}, FID: {fid}')
+                    dist.barrier()
+                start_time = time()
+    if accelerator.is_main_process:
+        logger.close()
+    accelerator.end_training()
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser('training parameters')
+    # basic config
+    parser.add_argument('--config', type=str, required=True, help='path to config file')
+    # training
+    parser.add_argument("--results_dir", type=str, default="results")
+    parser.add_argument("--ckpt_path", type=parse_str_none, default=None)
+    parser.add_argument("--global_seed", type=int, default=0)
+    parser.add_argument("--num_workers", type=int, default=4)
+    parser.add_argument('--no_amp', action='store_true', help="Disable automatic mixed precision.")
+    parser.add_argument("--use_wandb", action='store_true', help='enable wandb logging')
+    parser.add_argument("--use_ckpt_path", type=str2bool, default=True)
+    parser.add_argument("--use_strict_load", type=str2bool, default=True)
+    parser.add_argument("--tag", type=str, default='')
+    # sampling
+    parser.add_argument('--enable_eval', action='store_true', help='enable fid calc during training')
+    parser.add_argument('--seeds', type=parse_int_list, default='0-49999', help='Random seeds (e.g. 1,2,5-10)')
+    parser.add_argument('--subdirs', action='store_true', help='Create subdirectory for every 1000 seeds')
+    parser.add_argument('--class_idx', type=int, default=None, help='Class label  [default: random]')
+    parser.add_argument('--max_batch_size', type=int, default=50, help='Maximum batch size per GPU during sampling, must be a factor of 50k if torch.compile is used')
+    parser.add_argument("--cfg_scale", type=parse_float_none, default=None, help='None = no guidance, by default = 4.0')
+    parser.add_argument('--num_steps', type=int, default=40, help='Number of sampling steps')
+    parser.add_argument('--S_churn', type=int, default=0, help='Stochasticity strength')
+    parser.add_argument('--solver', type=str, default=None, choices=['euler', 'heun'], help='Ablate ODE solver')
+    parser.add_argument('--discretization', type=str, default=None, choices=['vp', 've', 'iddpm', 'edm'], help='Ablate ODE solver')
+    parser.add_argument('--schedule', type=str, default=None, choices=['vp', 've', 'linear'], help='Ablate noise schedule sigma(t)')
+    parser.add_argument('--scaling', type=str, default=None, choices=['vp', 'none'], help='Ablate signal scaling s(t)')
+    parser.add_argument('--pretrained_path', type=str, default='assets/stable_diffusion/autoencoder_kl.pth', help='Autoencoder ckpt')
+    parser.add_argument('--ref_path', type=str, default='assets/fid_stats/fid_stats_imagenet256_guided_diffusion.npz', help='Dataset reference statistics')
+    parser.add_argument('--num_expected', type=int, default=50000, help='Number of images to use')
+    parser.add_argument('--fid_batch_size', type=int, default=64, help='Maximum batch size per GPU')
+    args = parser.parse_args()
+    torch.backends.cudnn.benchmark = True
+    train_loop(args)

train_utils/datasets.py ADDED Viewed

	@@ -0,0 +1,412 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+import io
+import os
+import json
+import zipfile
+import lmdb
+import numpy as np
+from PIL import Image
+import torch
+from torchvision.datasets import ImageFolder, VisionDataset
+def center_crop_arr(pil_image, image_size):
+    """
+    Center cropping implementation from ADM.
+    https://github.com/openai/guided-diffusion/blob/8fb3ad9197f16bbc40620447b2742e13458d2831/guided_diffusion/image_datasets.py#L126
+    """
+    while min(*pil_image.size) >= 2 * image_size:
+        pil_image = pil_image.resize(
+            tuple(x // 2 for x in pil_image.size), resample=Image.BOX
+        )
+    scale = image_size / min(*pil_image.size)
+    pil_image = pil_image.resize(
+        tuple(round(x * scale) for x in pil_image.size), resample=Image.BICUBIC
+    )
+    arr = np.array(pil_image)
+    crop_y = (arr.shape[0] - image_size) // 2
+    crop_x = (arr.shape[1] - image_size) // 2
+    return Image.fromarray(arr[crop_y: crop_y + image_size, crop_x: crop_x + image_size])
+################################################################################
+# ImageNet - LMDB
+###############################################################################
+def lmdb_loader(path, lmdb_data, resolution):
+    # In-memory binary streams
+    with lmdb_data.begin(write=False, buffers=True) as txn:
+        bytedata = txn.get(path.encode('ascii'))
+    img = Image.open(io.BytesIO(bytedata)).convert('RGB')
+    arr = center_crop_arr(img, resolution)
+    # arr = arr.astype(np.float32) / 127.5 - 1
+    # arr = np.transpose(arr, [2, 0, 1])  # CHW
+    return arr
+def imagenet_lmdb_dataset(
+        root,
+        transform=None, target_transform=None,
+        resolution=256):
+    """
+    You can create this dataloader using:
+    train_data = imagenet_lmdb_dataset(traindir, transform=train_transform)
+    valid_data = imagenet_lmdb_dataset(validdir, transform=val_transform)
+    """
+    if root.endswith('/'):
+        root = root[:-1]
+    pt_path = os.path.join(
+        root + '_faster_imagefolder.lmdb.pt')
+    lmdb_path = os.path.join(
+        root + '_faster_imagefolder.lmdb')
+    if os.path.isfile(pt_path) and os.path.isdir(lmdb_path):
+        print('Loading pt {} and lmdb {}'.format(pt_path, lmdb_path))
+        data_set = torch.load(pt_path)
+    else:
+        data_set = ImageFolder(
+            root, None, None, None)
+        torch.save(data_set, pt_path, pickle_protocol=4)
+        print('Saving pt to {}'.format(pt_path))
+        print('Building lmdb to {}'.format(lmdb_path))
+        env = lmdb.open(lmdb_path, map_size=1e12)
+        with env.begin(write=True) as txn:
+            for path, class_index in data_set.imgs:
+                with open(path, 'rb') as f:
+                    data = f.read()
+                txn.put(path.encode('ascii'), data)
+    lmdb_dataset = ImageLMDB(lmdb_path, transform, target_transform, resolution, data_set.imgs, data_set.class_to_idx, data_set.classes)
+    return lmdb_dataset
+################################################################################
+# ImageNet Dataset class- LMDB
+###############################################################################
+class ImageLMDB(VisionDataset):
+    """
+    A data loader for ImageNet LMDB dataset, which is faster than the original ImageFolder.
+    """
+    def __init__(self, root, transform=None, target_transform=None,
+                 resolution=256, samples=None, class_to_idx=None, classes=None):
+        super().__init__(root, transform=transform,
+                         target_transform=target_transform)
+        self.root = root
+        self.resolution = resolution
+        self.samples = samples
+        self.class_to_idx = class_to_idx
+        self.classes = classes
+    def __getitem__(self, index: int):
+        path, target = self.samples[index]
+        # load image from path
+        if not hasattr(self, 'txn'):
+            self.open_db()
+        bytedata = self.txn.get(path.encode('ascii'))
+        img = Image.open(io.BytesIO(bytedata)).convert('RGB')
+        arr = center_crop_arr(img, self.resolution)
+        if self.transform is not None:
+            arr = self.transform(arr)
+        if self.target_transform is not None:
+            target = self.target_transform(target)
+        return arr, target
+    def __len__(self) -> int:
+        return len(self.samples)
+    def open_db(self):
+        self.env = lmdb.open(self.root, readonly=True, max_readers=256, lock=False, readahead=False, meminit=False)
+        self.txn = self.env.begin(write=False, buffers=True)
+################################################################################
+# ImageNet - LMDB - latent space
+###############################################################################
+# ----------------------------------------------------------------------------
+# Abstract base class for datasets.
+class Dataset(torch.utils.data.Dataset):
+    def __init__(self,
+                 name,  # Name of the dataset.
+                 raw_shape,  # Shape of the raw image data (NCHW).
+                 max_size=None,  # Artificially limit the size of the dataset. None = no limit. Applied before xflip.
+                 label_dim=1000,  # Ensure specific number of classes
+                 xflip=False,  # Artificially double the size of the dataset via x-flips. Applied after max_size.
+                 random_seed=0,  # Random seed to use when applying max_size.
+                 ):
+        self._name = name
+        self._raw_shape = list(raw_shape)
+        self._label_dim = label_dim
+        self._label_shape = None
+        # Apply max_size.
+        self._raw_idx = np.arange(self._raw_shape[0], dtype=np.int64)
+        if (max_size is not None) and (self._raw_idx.size > max_size):
+            np.random.RandomState(random_seed % (1 << 31)).shuffle(self._raw_idx)
+            self._raw_idx = np.sort(self._raw_idx[:max_size])
+        # Apply xflip. (Assume the dataset already contains the same number of xflipped samples)
+        if xflip:
+            self._raw_idx = np.concatenate([self._raw_idx, self._raw_idx + self._raw_shape[0]])
+    def close(self):  # to be overridden by subclass
+        pass
+    def _load_raw_data(self, raw_idx):  # to be overridden by subclass
+        raise NotImplementedError
+    def __getstate__(self):
+        return dict(self.__dict__, _raw_labels=None)
+    def __del__(self):
+        try:
+            self.close()
+        except:
+            pass
+    def __len__(self):
+        return self._raw_idx.size
+    def __getitem__(self, idx):
+        raw_idx = self._raw_idx[idx]
+        image, cond = self._load_raw_data(raw_idx)
+        assert isinstance(image, np.ndarray)
+        if isinstance(cond, list):  # [label, feature]
+            cond[0] = self._get_onehot(cond[0])
+        else:  # label
+            cond = self._get_onehot(cond)
+        return image.copy(), cond
+    def _get_onehot(self, label):
+        if isinstance(label, int) or label.dtype == np.int64:
+            onehot = np.zeros(self.label_shape, dtype=np.float32)
+            onehot[label] = 1
+            label = onehot
+        assert isinstance(label, np.ndarray)
+        return label.copy()
+    @property
+    def name(self):
+        return self._name
+    @property
+    def image_shape(self):
+        return list(self._raw_shape[1:])
+    @property
+    def num_channels(self):
+        assert len(self.image_shape) == 3  # CHW
+        return self.image_shape[0]
+    @property
+    def resolution(self):
+        assert len(self.image_shape) == 3  # CHW
+        assert self.image_shape[1] == self.image_shape[2]
+        return self.image_shape[1]
+    @property
+    def label_shape(self):
+        if self._label_shape is None:
+            self._label_shape = [self._label_dim]
+        return list(self._label_shape)
+    @property
+    def label_dim(self):
+        assert len(self.label_shape) == 1
+        return self.label_shape[0]
+    @property
+    def has_labels(self):
+        return True
+# ----------------------------------------------------------------------------
+# Dataset subclass that loads latent images recursively from the specified lmdb file.
+class ImageNetLatentDataset(Dataset):
+    def __init__(self,
+                 path,  # Path to directory or zip.
+                 resolution=32,  # Ensure specific resolution, default 32.
+                 num_channels=4,  # Ensure specific number of channels, default 4.
+                 split='train',  # train or val split
+                 feat_path=None, # Path to features lmdb file (only works when feat_cond=True)
+                 feat_dim=0,  # feature dim
+                 **super_kwargs,  # Additional arguments for the Dataset base class.
+                 ):
+        self._path = os.path.join(path, split)
+        self.feat_dim = feat_dim
+        if not hasattr(self, 'txn'):
+            self.open_lmdb()
+        self.feat_txn = None
+        if feat_path is not None and os.path.isdir(feat_path):
+            assert self.feat_dim > 0
+            self._feat_path = os.path.join(feat_path, split)
+            self.open_feat_lmdb()
+        length = int(self.txn.get('length'.encode('utf-8')).decode('utf-8'))
+        name = os.path.basename(path)
+        raw_shape = [length, num_channels, resolution, resolution]  # 1281167 x 4 x 32 x 32
+        if raw_shape[2] != resolution or raw_shape[3] != resolution:
+            raise IOError('Image files do not match the specified resolution')
+        super().__init__(name=name, raw_shape=raw_shape, **super_kwargs)
+    def open_lmdb(self):
+        self.env = lmdb.open(self._path, readonly=True, lock=False, create=False)
+        self.txn = self.env.begin(write=False)
+    def open_feat_lmdb(self):
+        self.feat_env = lmdb.open(self._feat_path, readonly=True, lock=False, create=False)
+        self.feat_txn = self.feat_env.begin(write=False)
+    def _load_raw_data(self, idx):
+        if not hasattr(self, 'txn'):
+            self.open_lmdb()
+        z_bytes = self.txn.get(f'z-{str(idx)}'.encode('utf-8'))
+        y_bytes = self.txn.get(f'y-{str(idx)}'.encode('utf-8'))
+        z = np.frombuffer(z_bytes, dtype=np.float32).reshape([-1, self.resolution, self.resolution]).copy()
+        y = int(y_bytes.decode('utf-8'))
+        cond = y
+        if self.feat_txn is not None:
+            feat_bytes = self.feat_txn.get(f'feat-{str(idx)}'.encode('utf-8'))
+            feat_y_bytes = self.feat_txn.get(f'y-{str(idx)}'.encode('utf-8'))
+            feat = np.frombuffer(feat_bytes, dtype=np.float32).reshape([self.feat_dim]).copy()
+            feat_y = int(feat_y_bytes.decode('utf-8'))
+            assert y == feat_y, 'Ordering mismatch between txn and feat_txn!'
+            cond = [y, feat]
+        return z, cond
+    def close(self):
+        try:
+            if self.env is not None:
+                self.env.close()
+            if self.feat_env is not None:
+                self.feat_env.close()
+        finally:
+            self.env = None
+            self.feat_env = None
+# ----------------------------------------------------------------------------
+# Dataset subclass that loads images recursively from the specified directory or zip file.
+class ImageFolderDataset(Dataset):
+    def __init__(self,
+                 path,  # Path to directory or zip.
+                 resolution=None,  # Ensure specific resolution, None = highest available.
+                 use_labels=False, # Enable conditioning labels? False = label dimension is zero.
+                 **super_kwargs,  # Additional arguments for the Dataset base class.
+                 ):
+        self._path = path
+        self._zipfile = None
+        self._raw_labels = None
+        self._use_labels = use_labels
+        if os.path.isdir(self._path):
+            self._type = 'dir'
+            self._all_fnames = {os.path.relpath(os.path.join(root, fname), start=self._path) for root, _dirs, files in
+                                os.walk(self._path) for fname in files}
+        elif self._file_ext(self._path) == '.zip':
+            self._type = 'zip'
+            self._all_fnames = set(self._get_zipfile().namelist())
+        else:
+            raise IOError('Path must point to a directory or zip')
+        Image.init()
+        self._image_fnames = sorted(fname for fname in self._all_fnames if self._file_ext(fname) in Image.EXTENSION)
+        if len(self._image_fnames) == 0:
+            raise IOError('No image files found in the specified path')
+        name = os.path.splitext(os.path.basename(self._path))[0]
+        raw_shape = [len(self._image_fnames)] + list(self._load_raw_image(0).shape)
+        if resolution is not None and (raw_shape[2] != resolution or raw_shape[3] != resolution):
+            raise IOError('Image files do not match the specified resolution')
+        super().__init__(name=name, raw_shape=raw_shape, **super_kwargs)
+    @staticmethod
+    def _file_ext(fname):
+        return os.path.splitext(fname)[1].lower()
+    def _get_zipfile(self):
+        assert self._type == 'zip'
+        if self._zipfile is None:
+            self._zipfile = zipfile.ZipFile(self._path)
+        return self._zipfile
+    def _open_file(self, fname):
+        if self._type == 'dir':
+            return open(os.path.join(self._path, fname), 'rb')
+        if self._type == 'zip':
+            return self._get_zipfile().open(fname, 'r')
+        return None
+    def close(self):
+        try:
+            if self._zipfile is not None:
+                self._zipfile.close()
+        finally:
+            self._zipfile = None
+    def __getstate__(self):
+        return dict(super().__getstate__(), _zipfile=None)
+    def _load_raw_data(self, raw_idx):
+        image = self._load_raw_image(raw_idx)
+        assert image.dtype == np.uint8
+        label = self._get_raw_labels()[raw_idx]
+        return image, label
+    def _load_raw_image(self, raw_idx):
+        fname = self._image_fnames[raw_idx]
+        with self._open_file(fname) as f:
+            image = np.array(Image.open(f))
+        if image.ndim == 2:
+            image = image[:, :, np.newaxis]  # HW => HWC
+        image = image.transpose(2, 0, 1)  # HWC => CHW
+        return image
+    def _get_raw_labels(self):
+        if self._raw_labels is None:
+            self._raw_labels = self._load_raw_labels() if self._use_labels else None
+            if self._raw_labels is None:
+                self._raw_labels = np.zeros([self._raw_shape[0], 0], dtype=np.float32)
+            assert isinstance(self._raw_labels, np.ndarray)
+            assert self._raw_labels.shape[0] == self._raw_shape[0]
+            assert self._raw_labels.dtype in [np.float32, np.int64]
+            if self._raw_labels.dtype == np.int64:
+                assert self._raw_labels.ndim == 1
+                assert np.all(self._raw_labels >= 0)
+        return self._raw_labels
+    def _load_raw_labels(self):
+        fname = 'dataset.json'
+        if fname not in self._all_fnames:
+            return None
+        with self._open_file(fname) as f:
+            labels = json.load(f)['labels']
+        if labels is None:
+            return None
+        labels = dict(labels)
+        labels = [labels[fname.replace('\\', '/')] for fname in self._image_fnames]
+        labels = np.array(labels)
+        labels = labels.astype({1: np.int64, 2: np.float32}[labels.ndim])
+        return labels
+# ----------------------------------------------------------------------------

train_utils/helper.py ADDED Viewed

	@@ -0,0 +1,69 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+from collections import OrderedDict
+import torch
+import numpy as np
+def get_mask_ratio_fn(name='constant', ratio_scale=0.5, ratio_min=0.0):
+    if name == 'cosine2':
+        return lambda x: (ratio_scale - ratio_min) * np.cos(np.pi * x / 2) ** 2 + ratio_min
+    elif name == 'cosine3':
+        return lambda x: (ratio_scale - ratio_min) * np.cos(np.pi * x / 2) ** 3 + ratio_min
+    elif name == 'cosine4':
+        return lambda x: (ratio_scale - ratio_min) * np.cos(np.pi * x / 2) ** 4 + ratio_min
+    elif name == 'cosine5':
+        return lambda x: (ratio_scale - ratio_min) * np.cos(np.pi * x / 2) ** 5 + ratio_min
+    elif name == 'cosine6':
+        return lambda x: (ratio_scale - ratio_min) * np.cos(np.pi * x / 2) ** 6 + ratio_min
+    elif name == 'exp':
+        return lambda x: (ratio_scale - ratio_min) * np.exp(-x * 7) + ratio_min
+    elif name == 'linear':
+        return lambda x: (ratio_scale - ratio_min) * x + ratio_min
+    elif name == 'constant':
+        return lambda x: ratio_scale
+    else:
+        raise ValueError('Unknown mask ratio function: {}'.format(name))
+def get_one_hot(labels, num_classes=1000):
+    one_hot = torch.zeros(labels.shape[0], num_classes, device=labels.device)
+    one_hot.scatter_(1, labels.view(-1, 1), 1)
+    return one_hot
+def requires_grad(model, flag=True):
+    """
+    Set requires_grad flag for all parameters in a model.
+    """
+    for p in model.parameters():
+        p.requires_grad = flag
+# ------------------------------------------------------------
+# Training Helper Function
+@torch.no_grad()
+def update_ema(ema_model, model, decay=0.9999):
+    """
+    Step the EMA model towards the current model.
+    """
+    ema_params = OrderedDict(ema_model.named_parameters())
+    model_params = OrderedDict(model.named_parameters())
+    for name, param in model_params.items():
+        if param.requires_grad:
+            ema_name = name.replace('_orig_mod.', '')
+            ema_params[ema_name].mul_(decay).add_(param.data, alpha=1 - decay)
+def unwrap_model(model):
+    """
+    Unwrap a model from any distributed or compiled wrappers.
+    """
+    if isinstance(model, torch._dynamo.eval_frame.OptimizedModule):
+        model = model._orig_mod
+    if isinstance(model, (torch.nn.parallel.DistributedDataParallel, torch.nn.DataParallel)):
+        model = model.module
+    return model

train_utils/loss.py ADDED Viewed

	@@ -0,0 +1,101 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+# This code is adapted from https://github.com/NVlabs/edm/blob/main/training/loss.py.
+# The original code is licensed under a Creative Commons
+# Attribution-NonCommercial-ShareAlike 4.0 International License, which is can be found at licenses/LICENSE_EDM.txt.
+"""Loss functions used in the paper
+"Elucidating the Design Space of Diffusion-Based Generative Models"."""
+import torch
+import torch.nn.functional as F
+from utils import *
+from train_utils.helper import unwrap_model
+# Improved loss function proposed in the paper "Elucidating the Design Space
+# of Diffusion-Based Generative Models" (EDM).
+class EDMLoss:
+    def __init__(self, P_mean=-1.2, P_std=1.2, sigma_data=0.5):
+        self.P_mean = P_mean
+        self.P_std = P_std
+        self.sigma_data = sigma_data
+    def __call__(self, net,
+                 images,
+                 labels=None,
+                 mask_ratio=0,
+                 mae_loss_coef=0,
+                 feat=None, augment_pipe=None):
+        # sample x_t
+        rnd_normal = torch.randn([images.shape[0], 1, 1, 1], device=images.device)
+        sigma = (rnd_normal * self.P_std + self.P_mean).exp()
+        weight = (sigma ** 2 + self.sigma_data ** 2) / (sigma * self.sigma_data) ** 2
+        y, augment_labels = augment_pipe(images) if augment_pipe is not None else (images, None)
+        n = torch.randn_like(y) * sigma
+        model_out = net(y + n, sigma, labels, mask_ratio=mask_ratio, mask_dict=None, feat=feat)
+        D_yn = model_out['x']
+        assert D_yn.shape == y.shape
+        loss = weight * ((D_yn - y) ** 2)  # (N, C, H, W)
+        if mask_ratio > 0:
+            assert net.training and 'mask' in model_out
+            loss = F.avg_pool2d(loss.mean(dim=1), net.module.model.patch_size).flatten(1)  # (N, L)
+            unmask = 1 - model_out['mask']
+            loss = (loss * unmask).sum(dim=1) / unmask.sum(dim=1)  # (N)
+            assert loss.ndim == 1
+            if mae_loss_coef > 0:
+                loss += mae_loss_coef * mae_loss(net.module, y + n, D_yn, 1 - unmask)
+        else:
+            loss = mean_flat(loss)  # (N)
+        raw_net = unwrap_model(net)
+        if mask_ratio == 0.0 and raw_net.model.mask_token is not None:
+            loss += 0 * torch.sum(raw_net.model.mask_token)
+        assert loss.ndim == 1
+        return loss
+# ----------------------------------------------------------------------------
+Losses = {
+    'edm': EDMLoss
+}
+# ----------------------------------------------------------------------------
+def patchify(imgs, patch_size=2, num_channels=4):
+    """
+    imgs: (N, 3, H, W)
+    x: (N, L, patch_size**2 *3)
+    """
+    p, c = patch_size, num_channels
+    assert imgs.shape[2] == imgs.shape[3] and imgs.shape[2] % p == 0
+    h = w = imgs.shape[2] // p
+    x = imgs.reshape(shape=(imgs.shape[0], c, h, p, w, p))
+    x = torch.einsum('nchpwq->nhwpqc', x)
+    x = x.reshape(shape=(imgs.shape[0], h * w, p**2 * c))
+    return x
+def mae_loss(net, target, pred, mask, norm_pix_loss=True):
+    target = patchify(target, net.model.patch_size, net.model.out_channels)
+    pred = patchify(pred, net.model.patch_size, net.model.out_channels)
+    if norm_pix_loss:
+        mean = target.mean(dim=-1, keepdim=True)
+        var = target.var(dim=-1, keepdim=True)
+        target = (target - mean) / (var + 1.e-6)**.5
+    loss = (pred - target) ** 2
+    loss = loss.mean(dim=-1)  # [N, L], mean loss per patch
+    loss = (loss * mask).sum(dim=1) / mask.sum(dim=1)  # mean loss on removed patches, (N)
+    assert loss.ndim == 1
+    return loss

train_wds.py ADDED Viewed

	@@ -0,0 +1,400 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+'''
+Training MaskDiT on latent dataset in WebDataset format. Used for experiments on Imagenet512x512.
+'''
+import argparse
+import os.path
+from copy import deepcopy
+from time import time
+from omegaconf import OmegaConf
+import pickle
+from itertools import islice
+import apex
+import torch
+import webdataset as wds
+import accelerate
+from fid import calc
+from models.maskdit import Precond_models
+from train_utils.loss import Losses
+from train_utils.helper import get_mask_ratio_fn, get_one_hot, requires_grad, update_ema, unwrap_model
+from sample import generate_with_net
+from utils import dist, mprint, get_latest_ckpt, Logger, sample, \
+    str2bool, parse_str_none, parse_int_list, parse_float_none
+# ------------------------------------------------------------
+# WebDataset Helper Function
+def nodesplitter(src, group=None):
+    rank, world_size, worker, num_workers = wds.utils.pytorch_worker_info()
+    if world_size > 1:
+        for s in islice(src, rank, None, world_size):
+            yield s
+    else:
+        for s in src:
+            yield s
+def get_file_paths(dir):
+    return [os.path.join(dir, file) for file in os.listdir(dir)]
+def split_by_proc(data_list, global_rank, total_size):
+    '''
+    Evenly split the data_list into total_size parts and return the part indexed by global_rank.
+    '''
+    assert len(data_list) >= total_size
+    assert global_rank < total_size
+    return data_list[global_rank::total_size]
+def decode_data(item):
+    output = {}
+    img = pickle.loads(item['latent'])
+    output['latent'] = img
+    label = int(item['cls'].decode('utf-8'))
+    output['label'] = label
+    return output
+def make_loader(root, mode='train', batch_size=32,
+                num_workers=4, cache_dir=None,
+                resampled=False, world_size=1, total_num=1281167,
+                bufsize=1000, initial=100):
+    data_list = get_file_paths(root)
+    num_batches_in_total = total_num // (batch_size * world_size)
+    if resampled:
+        repeat = True
+        splitter = False
+    else:
+        repeat = False
+        splitter = nodesplitter
+    dataset = (
+        wds.WebDataset(
+        data_list,
+        cache_dir=cache_dir,
+        repeat=repeat,
+        resampled=resampled,
+        handler=wds.handlers.warn_and_stop,
+        nodesplitter=splitter,
+        )
+        .shuffle(bufsize, initial=initial)
+        .map(decode_data, handler=wds.handlers.warn_and_stop)
+        .to_tuple('latent label')
+        .batched(batch_size, partial=False)
+        )
+    loader = wds.WebLoader(dataset, batch_size=None, num_workers=num_workers, shuffle=False, persistent_workers=True)
+    if resampled:
+        loader = loader.with_epoch(num_batches_in_total)
+    return loader
+# ------------------------------------------------------------
+def train_loop(args):
+    # load configuration
+    config = OmegaConf.load(args.config)
+    if not args.no_amp:
+        config.train.amp = 'fp16'
+    else:
+        config.train.amp = 'no'
+    if config.train.tf32:
+        torch.set_float32_matmul_precision('high')
+    accelerator = accelerate.Accelerator(mixed_precision=config.train.amp,
+                                         gradient_accumulation_steps=config.train.grad_accum,
+                                         log_with='wandb')
+    # setup wandb
+    if args.use_wandb:
+        wandb_init_kwargs = {
+            'entity': config.wandb.entity,
+            'project': config.wandb.project,
+            'group': config.wandb.group,
+        }
+        accelerator.init_trackers(config.wandb.project, config=OmegaConf.to_container(config), init_kwargs=wandb_init_kwargs)
+    mprint('start training...')
+    size = accelerator.num_processes
+    rank = accelerator.process_index
+    print(f'global_rank: {rank}, global_size: {size}')
+    device = accelerator.device
+    seed = args.global_seed
+    torch.manual_seed(seed)
+    mprint(f"enable_amp: {not args.no_amp}, TF32: {config.train.tf32}")
+    # Select batch size per GPU
+    num_accumulation_rounds = config.train.grad_accum
+    micro_batch = config.train.batchsize
+    batch_gpu_total = micro_batch * num_accumulation_rounds
+    global_batch_size = batch_gpu_total * size
+    mprint(f"Global batchsize: {global_batch_size},  batchsize per GPU: {batch_gpu_total}, micro_batch: {micro_batch}.")
+    class_dropout_prob = config.model.class_dropout_prob
+    log_every = config.log.log_every
+    ckpt_every = config.log.ckpt_every
+    mask_ratio_fn = get_mask_ratio_fn(config.model.mask_ratio_fn, config.model.mask_ratio, config.model.mask_ratio_min)
+    # Setup an experiment folder
+    model_name = config.model.model_type.replace("/", "-")  # e.g., DiT-XL/2 --> DiT-XL-2 (for naming folders)
+    data_name = config.data.dataset
+    if args.ckpt_path is not None and args.use_ckpt_path:  # use the existing exp path (mainly used for fine-tuning)
+        checkpoint_dir = os.path.dirname(args.ckpt_path)
+        experiment_dir = os.path.dirname(checkpoint_dir)
+        exp_name = os.path.basename(experiment_dir)
+    else:  # start a new exp path (and resume from the latest checkpoint if possible)
+        cond_gen = 'cond' if config.model.num_classes else 'uncond'
+        exp_name = f'{model_name}-{config.model.precond}-{data_name}-{cond_gen}-m{config.model.mask_ratio}-de{int(config.model.use_decoder)}' \
+                   f'-mae{config.model.mae_loss_coef}-bs-{global_batch_size}-lr{config.train.lr}{config.log.tag}'
+        experiment_dir = f"{args.results_dir}/{exp_name}"
+        checkpoint_dir = f"{experiment_dir}/checkpoints"  # Stores saved model checkpoints
+        os.makedirs(checkpoint_dir, exist_ok=True)
+        if args.ckpt_path is None:
+            args.ckpt_path = get_latest_ckpt(checkpoint_dir)  # Resumes from the latest checkpoint if it exists
+    if accelerator.is_main_process:
+        logger = Logger(file_name=f'{experiment_dir}/log.txt', file_mode="a+", should_flush=True)
+    mprint(f"Experiment directory created at {experiment_dir}")
+    # Setup dataset
+    loader = make_loader(config.data.root,
+                         mode='train',
+                         batch_size=batch_gpu_total,
+                         num_workers=args.num_workers,
+                         resampled=args.resample,
+                         world_size=size,
+                         total_num=config.data.total_num)
+    steps_per_epoch = config.data.total_num // global_batch_size
+    mprint(f"{steps_per_epoch} steps per epoch")
+    model = Precond_models[config.model.precond](
+        img_resolution=config.model.in_size,
+        img_channels=config.model.in_channels,
+        num_classes=config.model.num_classes,
+        model_type=config.model.model_type,
+        use_decoder=config.model.use_decoder,
+        mae_loss_coef=config.model.mae_loss_coef,
+        pad_cls_token=config.model.pad_cls_token
+    ).to(device)
+    # Note that parameter initialization is done within the model constructor
+    ema = deepcopy(model)  # Create an EMA of the model for use after training
+    requires_grad(ema, False)
+    ema = ema.to(device)
+    mprint(f"{config.model.model_type} ((use_decoder: {config.model.use_decoder})) Model Parameters: {sum(p.numel() for p in model.parameters()):,}")
+    mprint(f'extras: {model.model.extras}, cls_token: {model.model.cls_token}')
+    # Setup optimizer (we used default Adam betas=(0.9, 0.999) and a constant learning rate of 1e-4 in our paper):
+    optimizer = apex.optimizers.FusedAdam(model.parameters(), lr=config.train.lr, adam_w_mode=True, weight_decay=0)
+    # optimizer = torch.optim.Adam(model.parameters(), lr=config.train.lr)
+    # Load checkpoints
+    train_steps_start = 0
+    epoch_start = 0
+    if args.ckpt_path is not None:
+        ckpt = torch.load(args.ckpt_path, map_location=device)
+        model.load_state_dict(ckpt['model'], strict=args.use_strict_load)
+        ema.load_state_dict(ckpt['ema'], strict=args.use_strict_load)
+        mprint(f'Load weights from {args.ckpt_path}')
+        if args.use_strict_load:
+            optimizer.load_state_dict(ckpt['opt'])
+            for state in optimizer.state.values():
+                for k, v in state.items():
+                    if isinstance(v, torch.Tensor):
+                        state[k] = v.cuda()
+            mprint(f'Load optimizer state..')
+        train_steps_start = int(os.path.basename(args.ckpt_path).split('.pt')[0])
+        epoch_start = train_steps_start // steps_per_epoch
+        mprint(f"train_steps_start: {train_steps_start}")
+        del ckpt # conserve memory
+        # FID evaluation for the loaded weights
+        if args.enable_eval:
+            start_time = time()
+            args.outdir = os.path.join(experiment_dir, 'fid', f'edm-steps{args.num_steps}-ckpt{train_steps_start}_cfg{args.cfg_scale}')
+            os.makedirs(args.outdir, exist_ok=True)
+            generate_with_net(args, ema, device, rank, size)
+            dist.barrier()
+            fid = calc(args.outdir, config.eval.ref_path, args.num_expected, args.global_seed, args.fid_batch_size)
+            mprint(f"time for fid calc: {time() - start_time}")
+            if args.use_wandb:
+                accelerator.log({"eval/fid": fid}, step=train_steps_start)
+            mprint(f'guidance: {args.cfg_scale} FID: {fid}')
+            dist.barrier()
+    model, optimizer = accelerator.prepare(model, optimizer)
+    model = torch.compile(model)
+    # Setup loss
+    loss_fn = Losses[config.model.precond]()
+    # Prepare models for training:
+    if args.ckpt_path is None:
+        assert train_steps_start == 0
+        raw_model = unwrap_model(model)
+        update_ema(ema, raw_model, decay=0)  # Ensure EMA is initialized with synced weights
+    model.train()  # important! This enables embedding dropout for classifier-free guidance
+    ema.eval()  # EMA model should always be in eval mode
+    # Variables for monitoring/logging purposes:
+    train_steps = train_steps_start
+    log_steps = 0
+    running_loss = 0
+    start_time = time()
+    mprint(f"Training for {config.train.epochs} epochs...")
+    for epoch in range(epoch_start, config.train.epochs):
+        mprint(f"Beginning epoch {epoch}...")
+        for x, cond in loader:
+            x = x.to(device)
+            y = cond.to(device)
+            y = get_one_hot(y, num_classes=config.model.num_classes)
+            x = sample(x)
+            loss_batch = 0
+            model.zero_grad(set_to_none=True)
+            curr_mask_ratio = mask_ratio_fn((train_steps - train_steps_start) / config.train.max_num_steps)
+            if class_dropout_prob > 0:
+                y = y * (torch.rand([y.shape[0], 1], device=device) >= class_dropout_prob)
+            for round_idx in range(num_accumulation_rounds):
+                x_ = x[round_idx * micro_batch:(round_idx + 1) * micro_batch]
+                y_ = y[round_idx * micro_batch:(round_idx + 1) * micro_batch]
+                with accelerator.accumulate(model):
+                    loss = loss_fn(net=model, images=x_, labels=y_,
+                                   mask_ratio=curr_mask_ratio,
+                                   mae_loss_coef=config.model.mae_loss_coef)
+                    loss_mean = loss.mean()
+                    accelerator.backward(loss_mean)
+                    # Update weights with lr warmup.
+                    lr_cur = config.train.lr * min(train_steps * global_batch_size / max(config.train.lr_rampup_kimg * 1000, 1), 1)
+                    for g in optimizer.param_groups:
+                        g['lr'] = lr_cur
+                    optimizer.step()
+                    loss_batch = loss_mean.item()
+            raw_model = unwrap_model(model)
+            update_ema(ema, raw_model)
+            # Log loss values:
+            running_loss += loss_batch
+            log_steps += 1
+            train_steps += 1
+            if train_steps > (train_steps_start + config.train.max_num_steps):
+                break
+            if train_steps % log_every == 0:
+                # Measure training speed:
+                torch.cuda.synchronize()
+                end_time = time()
+                steps_per_sec = log_steps / (end_time - start_time)
+                # Reduce loss history over all processes:
+                avg_loss = torch.tensor(running_loss / log_steps, device=device)
+                dist.all_reduce(avg_loss, op=dist.ReduceOp.SUM)
+                avg_loss = avg_loss.item() / size
+                mprint(f"(step={train_steps:07d}) Train Loss: {avg_loss:.4f}, Train Steps/Sec: {steps_per_sec:.2f}")
+                mprint(f'Peak GPU memory usage: {torch.cuda.max_memory_allocated() / 1024 ** 3:.2f} GB')
+                mprint(f'Reserved GPU memory: {torch.cuda.memory_reserved() / 1024 ** 3:.2f} GB')
+                if args.use_wandb:
+                    accelerator.log({"train/loss": avg_loss, "train/lr": lr_cur}, step=train_steps)
+                # Reset monitoring variables:
+                running_loss = 0
+                log_steps = 0
+                start_time = time()
+            # Save checkpoint:
+            if train_steps % ckpt_every == 0 and train_steps > train_steps_start:
+                if accelerator.is_main_process:
+                    checkpoint = {
+                        "model": raw_model.state_dict(),
+                        "ema": ema.state_dict(),
+                        "opt": optimizer.state_dict(),
+                        "args": args
+                    }
+                    checkpoint_path = f"{checkpoint_dir}/{train_steps:07d}.pt"
+                    torch.save(checkpoint, checkpoint_path)
+                    mprint(f"Saved checkpoint to {checkpoint_path}")
+                    del checkpoint  # conserve memory
+                dist.barrier()
+                # FID evaluation during training
+                if args.enable_eval:
+                    start_time = time()
+                    args.outdir = os.path.join(experiment_dir, 'fid', f'edm-steps{args.num_steps}-ckpt{train_steps}_cfg{args.cfg_scale}')
+                    os.makedirs(args.outdir, exist_ok=True)
+                    generate_with_net(args, ema, device, rank, size)
+                    dist.barrier()
+                    fid = calc(args.outdir, args.ref_path, args.num_expected, args.global_seed, args.fid_batch_size)
+                    mprint(f"time for fid calc: {time() - start_time}, fid: {fid}")
+                    if args.use_wandb:
+                        accelerator.log({"eval/fid": fid}, step=train_steps)
+                    mprint(f'Guidance: {args.cfg_scale}, FID: {fid}')
+                    dist.barrier()
+                start_time = time()
+    if accelerator.is_main_process:
+        logger.close()
+    accelerator.end_training()
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser('training parameters')
+    # basic config
+    parser.add_argument('--config', type=str, required=True, help='path to config file')
+    # training
+    parser.add_argument("--results_dir", type=str, default="results")
+    parser.add_argument("--ckpt_path", type=parse_str_none, default=None)
+    parser.add_argument("--global_seed", type=int, default=0)
+    parser.add_argument("--num_workers", type=int, default=4)
+    parser.add_argument('--no_amp', action='store_true', help="Disable automatic mixed precision.")
+    parser.add_argument("--use_wandb", action='store_true', help='enable wandb logging')
+    parser.add_argument("--use_ckpt_path", type=str2bool, default=True)
+    parser.add_argument("--use_strict_load", type=str2bool, default=True)
+    parser.add_argument("--tag", type=str, default='')
+    parser.add_argument("--resample", action='store_true', help='enable shard resample')
+    # sampling
+    parser.add_argument('--enable_eval', action='store_true', help='enable fid calc during training')
+    parser.add_argument('--seeds', type=parse_int_list, default='100000-104999', help='Random seeds (e.g. 1,2,5-10)')
+    parser.add_argument('--subdirs', action='store_true', help='Create subdirectory for every 1000 seeds')
+    parser.add_argument('--class_idx', type=int, default=None, help='Class label  [default: random]')
+    parser.add_argument('--max_batch_size', type=int, default=25, help='Maximum batch size per GPU during sampling, must be a factor of 50k if torch.compile is used')
+    parser.add_argument("--cfg_scale", type=parse_float_none, default=None, help='None = no guidance, by default = 4.0')
+    parser.add_argument('--num_steps', type=int, default=40, help='Number of sampling steps')
+    parser.add_argument('--S_churn', type=int, default=0, help='Stochasticity strength')
+    parser.add_argument('--solver', type=str, default=None, choices=['euler', 'heun'], help='Ablate ODE solver')
+    parser.add_argument('--discretization', type=str, default=None, choices=['vp', 've', 'iddpm', 'edm'], help='Ablate ODE solver')
+    parser.add_argument('--schedule', type=str, default=None, choices=['vp', 've', 'linear'], help='Ablate noise schedule sigma(t)')
+    parser.add_argument('--scaling', type=str, default=None, choices=['vp', 'none'], help='Ablate signal scaling s(t)')
+    parser.add_argument('--pretrained_path', type=str, default='assets/stable_diffusion/autoencoder_kl.pth', help='Autoencoder ckpt')
+    parser.add_argument('--ref_path', type=str, default='assets/fid_stats/VIRTUAL_imagenet512.npz', help='Dataset reference statistics')
+    parser.add_argument('--num_expected', type=int, default=5000, help='Number of images to use')
+    parser.add_argument('--fid_batch_size', type=int, default=64, help='Maximum batch size per GPU')
+    args = parser.parse_args()
+    torch.backends.cudnn.benchmark = True
+    train_loop(args)

utils.py ADDED Viewed

	@@ -0,0 +1,225 @@

+# MIT License
+# Copyright (c) [2023] [Anima-Lab]
+# This code is adapted from https://github.com/NVlabs/edm/blob/main/generate.py.
+# The original code is licensed under a Creative Commons
+# Attribution-NonCommercial-ShareAlike 4.0 International License, which is can be found at licenses/LICENSE_EDM.txt.
+import os
+import re
+import sys
+import contextlib
+import torch
+import torch.distributed as dist
+#----------------------------------------------------------------------------
+# Get the latest checkpoint from the save dir
+def get_latest_ckpt(dir):
+    latest_id = -1
+    for file in os.listdir(dir):
+        if file.endswith('.pt'):
+            m = re.search(r'(\d+)\.pt', file)
+            if m:
+                ckpt_id = int(m.group(1))
+                latest_id = max(latest_id, ckpt_id)
+    if latest_id == -1:
+        return None
+    else:
+        ckpt_path = os.path.join(dir, f'{latest_id:07d}.pt')
+        return ckpt_path
+def get_ckpt_paths(dir, id_min, id_max):
+    ckpt_dict = {}
+    for file in os.listdir(dir):
+        if file.endswith('.pt'):
+            m = re.search(r'(\d+)\.pt', file)
+            if m:
+                ckpt_id = int(m.group(1))
+                if id_min <= ckpt_id <= id_max:
+                    ckpt_dict[ckpt_id] = os.path.join(dir, f'{ckpt_id:07d}.pt')
+    return ckpt_dict
+#----------------------------------------------------------------------------
+# Take the mean over all non-batch dimensions.
+def mean_flat(tensor):
+    return tensor.mean(dim=list(range(1, tensor.ndim)))
+#----------------------------------------------------------------------------
+# Convert latent (mean, logvar) to latent variable (inherited from autoencoder.py)
+def sample(moments, scale_factor=0.18215):
+    mean, logvar = torch.chunk(moments, 2, dim=1)
+    logvar = torch.clamp(logvar, -30.0, 20.0)
+    std = torch.exp(0.5 * logvar)
+    z = mean + std * torch.randn_like(mean)
+    z = scale_factor * z
+    return z
+#----------------------------------------------------------------------------
+# Context manager for easily enabling/disabling DistributedDataParallel
+# synchronization.
+@contextlib.contextmanager
+def ddp_sync(module, sync):
+    assert isinstance(module, torch.nn.Module)
+    if sync or not isinstance(module, torch.nn.parallel.DistributedDataParallel):
+        yield
+    else:
+        with module.no_sync():
+            yield
+#----------------------------------------------------------------------------
+#  Distributed training helper functions
+def init_processes(fn, args):
+    """ Initialize the distributed environment. """
+    os.environ['MASTER_ADDR'] = args.master_address
+    os.environ['MASTER_PORT'] = '6020'
+    print(f'MASTER_ADDR = {os.environ["MASTER_ADDR"]}')
+    print(f'MASTER_PORT = {os.environ["MASTER_PORT"]}')
+    torch.cuda.set_device(args.local_rank)
+    dist.init_process_group(backend='nccl', init_method='env://', rank=args.global_rank, world_size=args.global_size)
+    fn(args)
+    if args.global_size > 1:
+        cleanup()
+def mprint(*args, **kwargs):
+    """
+    Print only from rank 0.
+    """
+    if dist.get_rank() == 0:
+        print(*args, **kwargs)
+def cleanup():
+    """
+    End DDP training.
+    """
+    dist.barrier()
+    mprint("Done!")
+    dist.barrier()
+    dist.destroy_process_group()
+#----------------------------------------------------------------------------
+# Wrapper for torch.Generator that allows specifying a different random seed
+# for each sample in a minibatch.
+class StackedRandomGenerator:
+    def __init__(self, device, seeds):
+        super().__init__()
+        self.generators = [torch.Generator(device).manual_seed(int(seed) % (1 << 32)) for seed in seeds]
+    def randn(self, size, **kwargs):
+        assert size[0] == len(self.generators)
+        return torch.stack([torch.randn(size[1:], generator=gen, **kwargs) for gen in self.generators])
+    def randn_like(self, input):
+        return self.randn(input.shape, dtype=input.dtype, layout=input.layout, device=input.device)
+    def randint(self, *args, size, **kwargs):
+        assert size[0] == len(self.generators)
+        return torch.stack([torch.randint(*args, size=size[1:], generator=gen, **kwargs) for gen in self.generators])
+#----------------------------------------------------------------------------
+# Parse a comma separated list of numbers or ranges and return a list of ints.
+# Example: '1,2,5-10' returns [1, 2, 5, 6, 7, 8, 9, 10]
+def parse_int_list(s):
+    if isinstance(s, list): return s
+    ranges = []
+    range_re = re.compile(r'^(\d+)-(\d+)$')
+    for p in s.split(','):
+        m = range_re.match(p)
+        if m:
+            ranges.extend(range(int(m.group(1)), int(m.group(2))+1))
+        else:
+            ranges.append(int(p))
+    return ranges
+# Parse 'None' to None and others to float value
+def parse_float_none(s):
+    assert isinstance(s, str)
+    return None if s == 'None' else float(s)
+# Parse 'None' to None and others to str
+def parse_str_none(s):
+    assert isinstance(s, str)
+    return None if s == 'None' else s
+# Parse 'true' to True
+def str2bool(s):
+    return s.lower() in ['true', '1', 'yes']
+#----------------------------------------------------------------------------
+# logging info.
+class Logger(object):
+    """
+    Redirect stderr to stdout, optionally print stdout to a file,
+    and optionally force flushing on both stdout and the file.
+    """
+    def __init__(self, file_name=None, file_mode="w", should_flush=True):
+        self.file = None
+        if file_name is not None:
+            self.file = open(file_name, file_mode)
+        self.should_flush = should_flush
+        self.stdout = sys.stdout
+        self.stderr = sys.stderr
+        sys.stdout = self
+        sys.stderr = self
+    def __enter__(self):
+        return self
+    def __exit__(self, exc_type, exc_value, traceback):
+        self.close()
+    def write(self, text):
+        """Write text to stdout (and a file) and optionally flush."""
+        if len(text) == 0: # workaround for a bug in VSCode debugger: sys.stdout.write(''); sys.stdout.flush() => crash
+            return
+        if self.file is not None:
+            self.file.write(text)
+        self.stdout.write(text)
+        if self.should_flush:
+            self.flush()
+    def flush(self):
+        """Flush written text to both stdout and a file, if open."""
+        if self.file is not None:
+            self.file.flush()
+        self.stdout.flush()
+    def close(self):
+        """Flush, close possible files, and remove stdout/stderr mirroring."""
+        self.flush()
+        # if using multiple loggers, prevent closing in wrong order
+        if sys.stdout is self:
+            sys.stdout = self.stdout
+        if sys.stderr is self:
+            sys.stderr = self.stderr
+        if self.file is not None:
+            self.file.close()