StoryVisualizationaTask

#193

by Anyou - opened Dec 13, 2023

base: refs/heads/main

←

from: refs/pr/193

Discussion Files changed

+6656

-216

Files changed (30) hide show

README.md +0 -207
__init__.py +0 -0
config.yaml +63 -0
data_script/flintstones_hdf5.py +51 -0
data_script/pororo_hdf5.py +83 -0
data_script/vist_hdf5.py +111 -0
data_script/vist_img_download.py +61 -0
datasets/flintstones.py +93 -0
datasets/pororo.py +144 -0
datasets/vistdii.py +94 -0
datasets/vistsis.py +94 -0
environment.yml +271 -0
fid_utils.py +41 -0
main.py +537 -0
models/blip_override/blip.py +240 -0
models/blip_override/med.py +955 -0
models/blip_override/med_config.json +21 -0
models/blip_override/vit.py +302 -0
models/diffusers_override/attention.py +669 -0
models/diffusers_override/unet_2d_blocks.py +1602 -0
models/diffusers_override/unet_2d_condition.py +359 -0
models/inception.py +314 -0
v1-5-pruned-emaonly.ckpt → pororo_100.h5 +2 -2
readme-storyvisualization.md +123 -0
requirements.txt +10 -0
run.sh +1 -0
test.py +94 -0
transtoyolo.py +320 -0
v1-5-pruned-emaonly.safetensors +0 -3
v1-5-pruned.safetensors +0 -3

README.md DELETED Viewed

@@ -1,207 +0,0 @@
----
-license: creativeml-openrail-m
-tags:
-- stable-diffusion
-- stable-diffusion-diffusers
-- text-to-image
-inference: true
-extra_gated_prompt: |-
-  This model is open access and available to all, with a CreativeML OpenRAIL-M license further specifying rights and usage.
-  The CreativeML OpenRAIL License specifies:
-  1. You can't use the model to deliberately produce nor share illegal or harmful outputs or content
-  2. CompVis claims no rights on the outputs you generate, you are free to use them and are accountable for their use which must not go against the provisions set in the license
-  3. You may re-distribute the weights and use the model commercially and/or as a service. If you do, please be aware you have to include the same use restrictions as the ones in the license and share a copy of the CreativeML OpenRAIL-M to all your users (please read the license entirely and carefully)
-  Please read the full license carefully here: https://huggingface.co/spaces/CompVis/stable-diffusion-license
-extra_gated_heading: Please read the LICENSE to access this model
----
-# Stable Diffusion v1-5 Model Card
-Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
-For more information about how Stable Diffusion functions, please have a look at [🤗's Stable Diffusion blog](https://huggingface.co/blog/stable_diffusion).
-The **Stable-Diffusion-v1-5** checkpoint was initialized with the weights of the [Stable-Diffusion-v1-2](https:/steps/huggingface.co/CompVis/stable-diffusion-v1-2)
-checkpoint and subsequently fine-tuned on 595k steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
-You can use this both with the [🧨Diffusers library](https://github.com/huggingface/diffusers) and the [RunwayML GitHub repository](https://github.com/runwayml/stable-diffusion).
-### Diffusers
-```py
-from diffusers import StableDiffusionPipeline
-import torch
-model_id = "runwayml/stable-diffusion-v1-5"
-pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
-pipe = pipe.to("cuda")
-prompt = "a photo of an astronaut riding a horse on mars"
-image = pipe(prompt).images[0]
-image.save("astronaut_rides_horse.png")
-```
-For more detailed instructions, use-cases and examples in JAX follow the instructions [here](https://github.com/huggingface/diffusers#text-to-image-generation-with-stable-diffusion)
-### Original GitHub Repository
-1. Download the weights
-   - [v1-5-pruned-emaonly.ckpt](https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.ckpt) - 4.27GB, ema-only weight. uses less VRAM - suitable for inference
-   - [v1-5-pruned.ckpt](https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned.ckpt) - 7.7GB, ema+non-ema weights. uses more VRAM - suitable for fine-tuning
-2. Follow instructions [here](https://github.com/runwayml/stable-diffusion).
-## Model Details
-- **Developed by:** Robin Rombach, Patrick Esser
-- **Model type:** Diffusion-based text-to-image generation model
-- **Language(s):** English
-- **License:** [The CreativeML OpenRAIL M license](https://huggingface.co/spaces/CompVis/stable-diffusion-license) is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses), adapted from the work that [BigScience](https://bigscience.huggingface.co/) and [the RAIL Initiative](https://www.licenses.ai/) are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) on which our license is based.
-- **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses a fixed, pretrained text encoder ([CLIP ViT-L/14](https://arxiv.org/abs/2103.00020)) as suggested in the [Imagen paper](https://arxiv.org/abs/2205.11487).
-- **Resources for more information:** [GitHub Repository](https://github.com/CompVis/stable-diffusion), [Paper](https://arxiv.org/abs/2112.10752).
-- **Cite as:**
-      @InProceedings{Rombach_2022_CVPR,
-          author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
-          title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
-          booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
-          month     = {June},
-          year      = {2022},
-          pages     = {10684-10695}
-      }
-# Uses
-## Direct Use
-The model is intended for research purposes only. Possible research areas and
-tasks include
-- Safe deployment of models which have the potential to generate harmful content.
-- Probing and understanding the limitations and biases of generative models.
-- Generation of artworks and use in design and other artistic processes.
-- Applications in educational or creative tools.
-- Research on generative models.
-Excluded uses are described below.
- ### Misuse, Malicious Use, and Out-of-Scope Use
-_Note: This section is taken from the [DALLE-MINI model card](https://huggingface.co/dalle-mini/dalle-mini), but applies in the same way to Stable Diffusion v1_.
-The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
-#### Out-of-Scope Use
-The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
-#### Misuse and Malicious Use
-Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
-- Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.
-- Intentionally promoting or propagating discriminatory content or harmful stereotypes.
-- Impersonating individuals without their consent.
-- Sexual content without consent of the people who might see it.
-- Mis- and disinformation
-- Representations of egregious violence and gore
-- Sharing of copyrighted or licensed material in violation of its terms of use.
-- Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.
-## Limitations and Bias
-### Limitations
-- The model does not achieve perfect photorealism
-- The model cannot render legible text
-- The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”
-- Faces and people in general may not be generated properly.
-- The model was trained mainly with English captions and will not work as well in other languages.
-- The autoencoding part of the model is lossy
-- The model was trained on a large-scale dataset
-  [LAION-5B](https://laion.ai/blog/laion-5b/) which contains adult material
-  and is not fit for product use without additional safety mechanisms and
-  considerations.
-- No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data.
-  The training data can be searched at [https://rom1504.github.io/clip-retrieval/](https://rom1504.github.io/clip-retrieval/) to possibly assist in the detection of memorized images.
-### Bias
-While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
-Stable Diffusion v1 was trained on subsets of [LAION-2B(en)](https://laion.ai/blog/laion-5b/),
-which consists of images that are primarily limited to English descriptions.
-Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for.
-This affects the overall output of the model, as white and western cultures are often set as the default. Further, the
-ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts.
-### Safety Module
-The intended use of this model is with the [Safety Checker](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) in Diffusers.
-This checker works by checking model outputs against known hard-coded NSFW concepts.
-The concepts are intentionally hidden to reduce the likelihood of reverse-engineering this filter.
-Specifically, the checker compares the class probability of harmful concepts in the embedding space of the `CLIPTextModel` *after generation* of the images.
-The concepts are passed into the model with the generated image and compared to a hand-engineered weight for each NSFW concept.
-## Training
-**Training Data**
-The model developers used the following dataset for training the model:
-- LAION-2B (en) and subsets thereof (see next section)
-**Training Procedure**
-Stable Diffusion v1-5 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
-- Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
-- Text prompts are encoded through a ViT-L/14 text-encoder.
-- The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
-- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.
-Currently six Stable Diffusion checkpoints are provided, which were trained as follows.
-- [`stable-diffusion-v1-1`](https://huggingface.co/CompVis/stable-diffusion-v1-1): 237,000 steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
-  194,000 steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
-- [`stable-diffusion-v1-2`](https://huggingface.co/CompVis/stable-diffusion-v1-2): Resumed from `stable-diffusion-v1-1`.
-  515,000 steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
-filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
-- [`stable-diffusion-v1-3`](https://huggingface.co/CompVis/stable-diffusion-v1-3): Resumed from `stable-diffusion-v1-2` - 195,000 steps at resolution `512x512` on "laion-improved-aesthetics" and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
-- [`stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4) Resumed from `stable-diffusion-v1-2` - 225,000 steps at resolution `512x512` on "laion-aesthetics v2 5+" and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
-- [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) Resumed from `stable-diffusion-v1-2` - 595,000 steps at resolution `512x512` on "laion-aesthetics v2 5+" and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
-- [`stable-diffusion-inpainting`](https://huggingface.co/runwayml/stable-diffusion-inpainting) Resumed from `stable-diffusion-v1-5` - then 440,000 steps of inpainting training at resolution 512x512 on “laion-aesthetics v2 5+” and 10% dropping of the text-conditioning. For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint. During training, we generate synthetic masks and in 25% mask everything.
-- **Hardware:** 32 x 8 x A100 GPUs
-- **Optimizer:** AdamW
-- **Gradient Accumulations**: 2
-- **Batch:** 32 x 8 x 2 x 4 = 2048
-- **Learning rate:** warmup to 0.0001 for 10,000 steps and then kept constant
-## Evaluation Results
-Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
-5.0, 6.0, 7.0, 8.0) and 50 PNDM/PLMS sampling
-steps show the relative improvements of the checkpoints:
-![pareto](https://huggingface.co/CompVis/stable-diffusion/resolve/main/v1-1-to-v1-5.png)
-Evaluated using 50 PLMS steps and 10000 random prompts from the COCO2017 validation set, evaluated at 512x512 resolution.  Not optimized for FID scores.
-## Environmental Impact
-**Stable Diffusion v1** **Estimated Emissions**
-Based on that information, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.
-- **Hardware Type:** A100 PCIe 40GB
-- **Hours used:** 150000
-- **Cloud Provider:** AWS
-- **Compute Region:** US-east
-- **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 11250 kg CO2 eq.
-## Citation
-```bibtex
-    @InProceedings{Rombach_2022_CVPR,
-        author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
-        title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
-        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
-        month     = {June},
-        year      = {2022},
-        pages     = {10684-10695}
-    }
-```
-*This model card was written by: Robin Rombach and Patrick Esser and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*

__init__.py ADDED Viewed

Binary file (2 Bytes). View file

config.yaml ADDED Viewed

	@@ -0,0 +1,63 @@

+# device
+mode: sample  # train sample
+gpu_ids: [3]  # gpu ids
+batch_size: 1  # batch size each item denotes one story
+num_workers: 4  # number of workers
+num_cpu_cores: -1  # number of cpu cores
+seed: 0  # random seed
+ckpt_dir: /root/lihui/StoryVisualization/save_ckpt_epoch5_new # checkpoint directory
+run_name: ARLDM # name for this run
+# task
+dataset: pororo  # pororo flintstones vistsis vistdii
+task: visualization  # continuation visualization
+# train
+init_lr: 1e-5  # initial learning rate
+warmup_epochs: 1  # warmup epochs
+max_epochs: 5   #50  # max epochs
+train_model_file: /root/lihui/StoryVisualization/save_ckpt_3last50/ARLDM/last.ckpt # model file for resume, none for train from scratch
+freeze_clip: True  #False  # whether to freeze clip
+freeze_blip: True  #False  # whether to freeze blip
+freeze_resnet: True  #False  # whether to freeze resnet
+# sample
+test_model_file: /root/lihui/StoryVisualization/save_ckpt_3last50/ARLDM/last.ckpt # model file for test
+calculate_fid: True  # whether to calculate FID scores
+scheduler: ddim  # ddim pndm
+guidance_scale: 6  # guidance scale
+num_inference_steps: 250  # number of inference steps
+sample_output_dir: /root/lihui/StoryVisualization/save_samples_128_epoch50 # output directory
+pororo:
+  hdf5_file: /root/lihui/StoryVisualization/pororo.h5
+  max_length: 85
+  new_tokens: [ "pororo", "loopy", "eddy", "harry", "poby", "tongtong", "crong", "rody", "petty" ]
+  clip_embedding_tokens: 49416
+  blip_embedding_tokens: 30530
+flintstones:
+  hdf5_file: /path/to/flintstones.h5
+  max_length: 91
+  new_tokens: [ "fred", "barney", "wilma", "betty", "pebbles", "dino", "slate" ]
+  clip_embedding_tokens: 49412
+  blip_embedding_tokens: 30525
+vistsis:
+  hdf5_file: /path/to/vist.h5
+  max_length: 100
+  clip_embedding_tokens: 49408
+  blip_embedding_tokens: 30524
+vistdii:
+  hdf5_file: /path/to/vist.h5
+  max_length: 65
+  clip_embedding_tokens: 49408
+  blip_embedding_tokens: 30524
+hydra:
+  run:
+    dir: .
+  output_subdir: null
+hydra/job_logging: disabled
+hydra/hydra_logging: disabled

data_script/flintstones_hdf5.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import argparse
+import json
+import os
+import pickle
+import cv2
+import h5py
+import numpy as np
+from tqdm import tqdm
+def main(args):
+    splits = json.load(open(os.path.join(args.data_dir, 'train-val-test_split.json'), 'r'))
+    train_ids, val_ids, test_ids = splits["train"], splits["val"], splits["test"]
+    followings = pickle.load(open(os.path.join(args.data_dir, 'following_cache4.pkl'), 'rb'))
+    annotations = json.load(open(os.path.join(args.data_dir, 'flintstones_annotations_v1-0.json')))
+    descriptions = dict()
+    for sample in annotations:
+        descriptions[sample["globalID"]] = sample["description"]
+    f = h5py.File(args.save_path, "w")
+    for subset, ids in {'train': train_ids, 'val': val_ids, 'test': test_ids}.items():
+        ids = [i for i in ids if i in followings and len(followings[i]) == 4]
+        length = len(ids)
+        group = f.create_group(subset)
+        images = list()
+        for i in range(5):
+            images.append(
+                group.create_dataset('image{}'.format(i), (length,), dtype=h5py.vlen_dtype(np.dtype('uint8'))))
+        text = group.create_dataset('text', (length,), dtype=h5py.string_dtype(encoding='utf-8'))
+        for i, item in enumerate(tqdm(ids, leave=True, desc="saveh5")):
+            globalIDs = [item] + followings[item]
+            txt = list()
+            for j, globalID in enumerate(globalIDs):
+                img = np.load(os.path.join(args.data_dir, 'video_frames_sampled', '{}.npy'.format(globalID)))
+                img = np.concatenate(img, axis=0).astype(np.uint8)
+                img = cv2.imencode('.png', img)[1].tobytes()
+                img = np.frombuffer(img, np.uint8)
+                images[j][i] = img
+                txt.append(descriptions[globalID])
+            text[i] = '|'.join([t.replace('\n', '').replace('\t', '').strip() for t in txt])
+    f.close()
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='arguments for flintstones hdf5 file saving')
+    parser.add_argument('--data_dir', type=str, required=True, help='flintstones data directory')
+    parser.add_argument('--save_path', type=str, required=True, help='path to save hdf5')
+    args = parser.parse_args()
+    main(args)

data_script/pororo_hdf5.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import argparse
+import os
+import cv2
+import h5py
+import numpy as np
+from PIL import Image
+from tqdm import tqdm
+def main(args):
+    # 使用numpy库的load函数来加载名为descriptions.npy的文件。该文件是一个Python字典对象，因此我们使用item()方法将其转换为字典对象。
+    # ——os.path.join函数用于连接文件路径
+    # ——args.data_dir作为基础目录，将'descriptions.npy'添加到该目录中
+    # ——指定allow_pickle=True，表示允许加载包含Python对象的文件
+    # ——指定encoding='latin1'，表示使用拉丁字符编码加载该文件
+    descriptions = np.load(os.path.join(args.data_dir, 'descriptions.npy'), allow_pickle=True, encoding='latin1').item()
+    # imgs_list包含一组图像文件的路径，
+    # followings_list包含每个图像的一些附加信息
+    imgs_list = np.load(os.path.join(args.data_dir, 'img_cache4.npy'), encoding='latin1')
+    followings_list = np.load(os.path.join(args.data_dir, 'following_cache4.npy'))
+    # 使用numpy库的load函数来加载名为train_seen_unseen_ids.npy的文件
+    # 该文件包含三个numpy数组：train_ids、val_ids和test_ids，分别代表训练集、验证集和测试集的ID列表。
+    # 使用元组来一次性加载这三个数组，并将它们赋值给相应的变量。
+    train_ids, val_ids, test_ids = np.load(os.path.join(args.data_dir, 'train_seen_unseen_ids.npy'), allow_pickle=True)
+    # 按照ID的顺序逐一排序
+    train_ids = np.sort(train_ids)
+    val_ids = np.sort(val_ids)
+    test_ids = np.sort(test_ids)
+    # 创建一个新的HDF5文件，并指定文件名为args.save_path。
+    # 使用h5py库的File函数来创建文件对象，指定打开方式为写模式("w")。
+    # 在这个文件中存储处理后的图像和文本数据。
+    f = h5py.File(args.save_path, "w")
+    for subset, ids in {'train': train_ids, 'val': val_ids, 'test': test_ids}.items():
+        length = len(ids)
+        # 为每个数据集（train、val和test）创建一个组
+        # 针对每个数据集都创建了5个数据集，名为'image0'、'image1'、'image2'、'image3'、'image4'，分别对应于当前图像及其相关联的4个图像。
+        # 目的：将每个图像及其相关联的图像数据保存到同一个HDF5文件中，并按照一定的组织方式存储，方便后续的数据读取和处理。
+        group = f.create_group(subset)
+        # 创建一个长度为ids列表长度的空列表images，按照image0-4顺序添加了5个HDF5数据集对象
+        images = list()
+        # 为当前数据集中的每个图像创建了五个数据集。
+        # 每个数据集都使用vlen_dtype(np.dtype('uint8'))作为数据类型，并将其添加到当前组group中。
+        # ——vlen_dtype(np.dtype('uint8'))表示可变长度的无符号8位整数数组。
+        for i in range(5):
+            images.append(
+                group.create_dataset('image{}'.format(i), (length,), dtype=h5py.vlen_dtype(np.dtype('uint8'))))
+        # 创建一个数据集text，用于存储与当前数据集中图像相关的文本描述。该数据集的数据类型为字符串，编码方式为utf-8，并将其添加到当前组group中。
+        text = group.create_dataset('text', (length,), dtype=h5py.string_dtype(encoding='utf-8'))
+        # 遍历当前数据集中的每个图像，并将相关数据保存到HDF5文件中
+        for i, item in enumerate(tqdm(ids, leave=True, desc="saveh5")):
+            # 获取与当前图像相关的所有图像的路径，存储到列表img_paths中。
+            # ——imgs_list是一个字典，存储了所有图像的路径
+            # ——followings_list是一个字典，存储了与每个图像相关的四张图像的路径
+            img_paths = [str(imgs_list[item])[2:-1]] + [str(followings_list[item][i])[2:-1] for i in range(4)]
+            # 打开img_paths列表中的每个图像，并将其转换为RGB格式的PIL图像对象。
+            imgs = [Image.open(os.path.join(args.data_dir, img_path)).convert('RGB') for img_path in img_paths]
+            # 将每个PIL图像对象转换为numpy数组
+            for j, img in enumerate(imgs):
+                img = np.array(img).astype(np.uint8)
+                # 使用OpenCV将其编码为png格式的二进制数据
+                img = cv2.imencode('.png', img)[1].tobytes()
+                # 将该二进制数据转换为numpy数组
+                img = np.frombuffer(img, np.uint8)
+                # 将其存储到images列表中与当前图像相关的数据集中
+                images[j][i] = img
+            # 获取与当前图像相关的所有图像的文件名，并将其存储到列表tgt_img_ids中
+            tgt_img_ids = [str(img_path).replace('.png', '') for img_path in img_paths]
+            # 根据目标图像的文件名，获取其对应的文本描述，并将其存储到列表txt中。
+            txt = [descriptions[tgt_img_id][0] for tgt_img_id in tgt_img_ids]
+            # 将txt列表中的所有文本描述合并为一个字符串，并将其中的"\n"、"\t"等无关字符替换为空格。然后，将该字符串存储到数据集text中
+            text[i] = '|'.join([t.replace('\n', '').replace('\t', '').strip() for t in txt])
+    f.close()
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='arguments for flintstones pororo file saving')
+    parser.add_argument('--data_dir', type=str, required=True, help='pororo data directory')
+    parser.add_argument('--save_path', type=str, required=True, help='path to save hdf5')
+    args = parser.parse_args()
+    main(args)

data_script/vist_hdf5.py ADDED Viewed

	@@ -0,0 +1,111 @@

+import argparse
+import json
+import os
+import cv2
+import h5py
+import numpy as np
+from PIL import Image
+from tqdm import tqdm
+def main(args):
+    train_data = json.load(open(os.path.join(args.sis_json_dir, 'train.story-in-sequence.json')))
+    val_data = json.load(open(os.path.join(args.sis_json_dir, 'val.story-in-sequence.json')))
+    test_data = json.load(open(os.path.join(args.sis_json_dir, 'test.story-in-sequence.json')))
+    prefix = ["train", "val", "test"]
+    whole_album = {}
+    for i, data in enumerate([train_data, val_data, test_data]):
+        album_mapping = {}
+        for annot_new in data["annotations"]:
+            annot = annot_new[0]
+            assert len(annot_new) == 1
+            if annot['story_id'] not in album_mapping:
+                album_mapping[annot['story_id']] = {"flickr_id": [annot['photo_flickr_id']],
+                                                    "sis": [annot['original_text']],
+                                                    "length": 1}
+            else:
+                album_mapping[annot['story_id']]["flickr_id"].append(annot['photo_flickr_id'])
+                album_mapping[annot['story_id']]["sis"].append(
+                    annot['original_text'])
+                album_mapping[annot['story_id']]["length"] += 1
+        whole_album[prefix[i]] = album_mapping
+    for p in prefix:
+        deletables = []
+        for story_id, story in whole_album[p].items():
+            if story['length'] != 5:
+                print("deleting {}".format(story_id))
+                deletables.append(story_id)
+                continue
+            d = [os.path.exists(os.path.join(args.img_dir, "{}.jpg".format(_))) for _ in story["flickr_id"]]
+            if sum(d) < 5:
+                print("deleting {}".format(story_id))
+                deletables.append(story_id)
+            else:
+                pass
+        for i in deletables:
+            del whole_album[p][i]
+    train_data = json.load(open(os.path.join(args.sis_json_dir, 'train.description-in-isolation.json')))
+    val_data = json.load(open(os.path.join(args.sis_json_dir, 'val.description-in-isolation.json')))
+    test_data = json.load(open(os.path.join(args.sis_json_dir, 'test.description-in-isolation.json')))
+    flickr_id2text = {}
+    for i, data in enumerate([train_data, val_data, test_data]):
+        for l in data['annotations']:
+            assert len(l) == 1
+            if l[0]['photo_flickr_id'] in flickr_id2text:
+                flickr_id2text[l[0]['photo_flickr_id']] = \
+                    max([flickr_id2text[l[0]['photo_flickr_id']], l[0]['original_text']], key=len)
+            else:
+                flickr_id2text[l[0]['photo_flickr_id']] = l[0]['original_text']
+    for p in prefix:
+        deletables = []
+        for story_id, story in whole_album[p].items():
+            story['dii'] = []
+            for i, flickr_id in enumerate(story['flickr_id']):
+                if flickr_id not in flickr_id2text:
+                    print("{} not found in story {}".format(flickr_id, story_id))
+                    deletables.append(story_id)
+                    break
+                story['dii'].append(flickr_id2text[flickr_id])
+        for i in deletables:
+            del whole_album[p][i]
+    f = h5py.File(args.save_path, "w")
+    for p in prefix:
+        group = f.create_group(p)
+        story_dict = whole_album[p]
+        length = len(story_dict)
+        images = list()
+        for i in range(5):
+            images.append(
+                group.create_dataset('image{}'.format(i), (length,), dtype=h5py.vlen_dtype(np.dtype('uint8'))))
+        sis = group.create_dataset('sis', (length,), dtype=h5py.string_dtype(encoding='utf-8'))
+        dii = group.create_dataset('dii', (length,), dtype=h5py.string_dtype(encoding='utf-8'))
+        for i, (story_id, story) in enumerate(tqdm(story_dict.items(), leave=True, desc="saveh5")):
+            imgs = [Image.open('{}/{}.jpg'.format(args.img_dir, flickr_id)).convert('RGB') for flickr_id in
+                    story['flickr_id']]
+            for j, img in enumerate(imgs):
+                img = np.array(img).astype(np.uint8)
+                img = cv2.imencode('.png', img)[1].tobytes()
+                img = np.frombuffer(img, np.uint8)
+                images[j][i] = img
+            sis[i] = '|'.join([t.replace('\n', '').replace('\t', '').strip() for t in story['sis']])
+            txt_dii = [t.replace('\n', '').replace('\t', '').strip() for t in story['dii']]
+            txt_dii = sorted(set(txt_dii), key=txt_dii.index)
+            dii[i] = '|'.join(txt_dii)
+    f.close()
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='arguments for vist hdf5 file saving')
+    parser.add_argument('--sis_json_dir', type=str, required=True, help='sis json file directory')
+    parser.add_argument('--dii_json_dir', type=str, required=True, help='dii json file directory')
+    parser.add_argument('--img_dir', type=str, required=True, help='json file directory')
+    parser.add_argument('--save_path', type=str, required=True, help='path to save hdf5')
+    args = parser.parse_args()
+    main(args)

data_script/vist_img_download.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import json
+import requests
+from io import BytesIO
+from PIL import Image
+from tqdm import tqdm
+from multiprocessing import Process
+import os
+import argparse
+def download_subprocess(dii, save_dir):
+    for image in tqdm(dii):
+        key, value = image.popitem()
+        try:
+            img_data = requests.get(value).content
+            img = Image.open(BytesIO(img_data)).convert('RGB')
+            h = img.size[0]
+            w = img.size[1]
+            if min(h, w) > 512:
+                img = img.resize((int(h / (w / 512)), 512) if h > w else (512, int(w / (h / 512))))
+            img.save('{}/{}.jpg'.format(save_dir, key))
+        except:
+            print(key, value)
+def main(args):
+    train_data = json.load(open(os.path.join(args.json_dir, 'train.description-in-isolation.json')))
+    val_data = json.load(open(os.path.join(args.json_dir, 'val.description-in-isolation.json')))
+    test_data = json.load(open(os.path.join(args.json_dir, 'test.description-in-isolation.json')))
+    dii = []
+    for subset in [train_data, val_data, test_data]:
+        for image in subset["images"]:
+            try:
+                dii.append({image['id']: image['url_o']})
+            except:
+                dii.append({image['id']: image['url_m']})
+    dii = [image for image in dii if not os.path.exists('{}/{}.jpg'.format(args.save_dir, list(image)[0]))]
+    print('total images: {}'.format(len(dii)))
+    def splitlist(inlist, chunksize):
+        return [inlist[x:x + chunksize] for x in range(0, len(inlist), chunksize)]
+    dii_splitted = splitlist(dii, int((len(dii) / args.num_process)))
+    process_list = []
+    for dii_sub_list in dii_splitted:
+        p = Process(target=download_subprocess, args=(dii_sub_list,))
+        process_list.append(p)
+        p.Daemon = True
+        p.start()
+    for p in process_list:
+        p.join()
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='arguments for vist images downloading')
+    parser.add_argument('--json_dir', type=str, required=True, help='dii json file directory')
+    parser.add_argument('--img_dir', type=str, required=True, help='images saving directory')
+    parser.add_argument('--num_process', type=int, default=32)
+    args = parser.parse_args()
+    main(args)

datasets/flintstones.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import random
+import cv2
+import h5py
+import numpy as np
+import torch
+from torch.utils.data import Dataset
+from torchvision import transforms
+from transformers import CLIPTokenizer
+from models.blip_override.blip import init_tokenizer
+class StoryDataset(Dataset):
+    """
+    A custom subset class for the LRW (includes train, val, test) subset
+    """
+    def __init__(self, subset, args):
+        super(StoryDataset, self).__init__()
+        self.args = args
+        self.h5_file = args.get(args.dataset).hdf5_file
+        self.subset = subset
+        self.augment = transforms.Compose([
+            transforms.ToPILImage(),
+            transforms.Resize([512, 512]),
+            transforms.ToTensor(),
+            transforms.Normalize([0.5], [0.5])
+        ])
+        self.dataset = args.dataset
+        self.max_length = args.get(args.dataset).max_length
+        self.clip_tokenizer = CLIPTokenizer.from_pretrained('runwayml/stable-diffusion-v1-5', subfolder="tokenizer")
+        self.blip_tokenizer = init_tokenizer()
+        msg = self.clip_tokenizer.add_tokens(list(args.get(args.dataset).new_tokens))
+        print("clip {} new tokens added".format(msg))
+        msg = self.blip_tokenizer.add_tokens(list(args.get(args.dataset).new_tokens))
+        print("blip {} new tokens added".format(msg))
+        self.blip_image_processor = transforms.Compose([
+            transforms.ToPILImage(),
+            transforms.Resize([224, 224]),
+            transforms.ToTensor(),
+            transforms.Normalize([0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711])
+        ])
+    def open_h5(self):
+        h5 = h5py.File(self.h5_file, "r")
+        self.h5 = h5[self.subset]
+    def __getitem__(self, index):
+        if not hasattr(self, 'h5'):
+            self.open_h5()
+        images = list()
+        for i in range(5):
+            im = self.h5['image{}'.format(i)][index]
+            im = cv2.imdecode(im, cv2.IMREAD_COLOR)
+            idx = random.randint(0, 4)
+            images.append(im[idx * 128: (idx + 1) * 128])
+        source_images = torch.stack([self.blip_image_processor(im) for im in images])
+        images = images[1:] if self.args.task == 'continuation' else images
+        images = torch.stack([self.augment(im) for im in images]) \
+            if self.subset in ['train', 'val'] else torch.from_numpy(np.array(images)).permute(0, 3, 1, 2)
+        texts = self.h5['text'][index].decode('utf-8').split('|')
+        # tokenize caption using default tokenizer
+        tokenized = self.clip_tokenizer(
+            texts[1:] if self.args.task == 'continuation' else texts,
+            padding="max_length",
+            max_length=self.max_length,
+            truncation=False,
+            return_tensors="pt",
+        )
+        captions, attention_mask = tokenized['input_ids'], tokenized['attention_mask']
+        tokenized = self.blip_tokenizer(
+            texts,
+            padding="max_length",
+            max_length=self.max_length,
+            truncation=False,
+            return_tensors="pt",
+        )
+        source_caption, source_attention_mask = tokenized['input_ids'], tokenized['attention_mask']
+        return images, captions, attention_mask, source_images, source_caption, source_attention_mask
+    def __len__(self):
+        if not hasattr(self, 'h5'):
+            self.open_h5()
+        return len(self.h5['text'])

datasets/pororo.py ADDED Viewed

	@@ -0,0 +1,144 @@

+import copy
+import os
+import random
+from PIL import Image
+import cv2
+import h5py
+import numpy as np
+import torch
+from torch.utils.data import Dataset
+from torchvision import transforms
+from transformers import CLIPTokenizer
+from models.blip_override.blip import init_tokenizer
+class StoryDataset(Dataset):
+    """
+    A custom subset class for the LRW (includes train, val, test) subset
+    """
+    # StoryDataset 类的构造函数
+    def __init__(self, subset, args):
+        # 用来调用父类 Dataset 的初始化函数，确保该类能够继承 Dataset 类的所有方法和属性。
+        super(StoryDataset, self).__init__()
+        # args 则是该类的其他参数，是一个命名空间（namespace）对象
+        self.args = args
+        # 一个 HDF5 文件的路径，存储了训练、验证和测试集的图像和文本数据。
+        # ——args.get(args.dataset)表示从命名空间对象args中获取指定数据集（训练集、验证集或测试集）的参数。
+        self.h5_file = args.get(args.dataset).hdf5_file
+        # 初始化函数中 subset 表示要读取的子集的类型（如训练集、验证集、测试集）
+        self.subset = subset
+        # 一个图像变换函数序列（transform），用来对图像进行预处理，包括将图像转化为 PIL 格式，调整图像大小，将图像转换为 Tensor，并进行归一化。
+        self.augment = transforms.Compose([
+            transforms.ToPILImage(),
+           # transforms.Resize([256, 256]),
+            transforms.Resize([512, 512]),
+            transforms.ToTensor(),
+            transforms.Normalize([0.5], [0.5])
+        ])
+        # 表示当前数据集的类型（训练集、验证集或测试集）
+        self.dataset = args.dataset
+        # 最大的 caption 长度,在进行tokenize操作时，caption中的单词数量将被填充到该长度。
+        self.max_length = args.get(args.dataset).max_length
+        # 一个使用CLIP模型进行tokenize的tokenizer
+        self.clip_tokenizer = CLIPTokenizer.from_pretrained('runwayml/stable-diffusion-v1-5', subfolder="tokenizer")
+        # 一个自定义的tokenizer，用于处理文本输入
+        self.blip_tokenizer = init_tokenizer()
+        msg = self.clip_tokenizer.add_tokens(list(args.get(args.dataset).new_tokens))
+        print("clip {} new tokens added".format(msg))
+        msg = self.blip_tokenizer.add_tokens(list(args.get(args.dataset).new_tokens))
+        print("blip {} new tokens added".format(msg))
+        # 一个用于对输入的图像进行处理的函数序列，包括转换为PIL图像、重置图像大小、转换为tensor、归一化等。
+        self.blip_image_processor = transforms.Compose([
+            transforms.ToPILImage(),
+            transforms.Resize([224, 224]),
+            transforms.ToTensor(),
+            transforms.Normalize([0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711])
+        ])
+    #  打开与数据集对应的h5文件
+    def open_h5(self):
+        h5 = h5py.File(self.h5_file, "r")
+        self.h5 = h5[self.subset]
+    # 用于按索引获取数据。
+    # 对于每个图像，都进行数据增强操作，以进行数据增强。
+    # 然后，将文本输入的caption进行tokenize操作，
+    # 使用CLIP tokenizer和自定义tokenizer分别进行tokenize。
+    # 最后，将处理好的图像、caption和attention mask返回
+    def __getitem__(self, index):
+        # 首先调用open_h5()打开数据集的h5文件
+        if not hasattr(self, 'h5'):
+            self.open_h5()
+        #index = 1
+        images = list()
+        for i in range(5):
+            # 从h5文件中读取一组图像和对应的文本。
+            im = self.h5['image{}'.format(i)][index]
+            # print(im)
+            # pil_img = Image.fromarray(im)
+            # # 保存图像
+            # pil_img.save(os.path.join('/root/lihui/StoryVisualization/ori_test_images', '{:04d}.png'.format(i)))
+            # 对每个图像解码
+            im = cv2.imdecode(im, cv2.IMREAD_COLOR)
+            # 随机选择一个128像素的图像切片
+            idx = random.randint(0, im.shape[0] / 128 - 1)
+            # 将切片后的图像加到images列表中
+            images.append(im[idx * 128: (idx + 1) * 128])
+        # 深拷贝，后续不随images变化
+        ori_images = copy.deepcopy(images)
+        # 保存test原始图像
+        # for i, im in enumerate(images):
+        #     file_path = '/root/lihui/StoryVisualization/ori_test_images/group{:02d}_image{:02d}.png'.format(index + 1,
+        #                                                                                                     i + 1)
+        #     cv2.imwrite(file_path, im)
+        # 将图像转换为张量
+        source_images = torch.stack([self.blip_image_processor(im) for im in images])
+        # 如果为continuation任务，将列表中的第一个图像从images中移除
+        images = images[1:] if self.args.task == 'continuation' else images
+        # 如果subset的值为train/val，则使用augment方法对images列表中的所有图像进行数据增强，并将其转换为张量
+        # 否则使用numpy.array方法将images列表转换为张量，并进行转置操作
+        images = torch.stack([self.augment(im) for im in images]) \
+            if self.subset in ['train', 'val'] else torch.from_numpy(np.array(images)).permute(0, 3, 1, 2)
+        ######################
+        # 读取当前索引处的文本，并使用decode方法将其解码为UTF-8
+        texts = self.h5['text'][index].decode('utf-8').split('|')
+        # print(f"index: {index}")
+        # for text in texts:
+        #     print(f"texts: {text}")
+        # tokenize caption using default tokenizer
+        tokenized = self.clip_tokenizer(
+            texts[1:] if self.args.task == 'continuation' else texts,
+            padding="max_length",
+            max_length=self.max_length,
+            truncation=False,
+            return_tensors="pt",
+        )
+        captions, attention_mask = tokenized['input_ids'], tokenized['attention_mask']
+        tokenized = self.blip_tokenizer(
+            texts,
+            padding="max_length",
+            max_length=self.max_length,
+            truncation=False,
+            return_tensors="pt",
+        )
+        source_caption, source_attention_mask = tokenized['input_ids'], tokenized['attention_mask']
+        return images, captions, attention_mask, source_images, source_caption, source_attention_mask, texts, ori_images
+    # 返回数据集中样本的数量
+    # 如果是测试集，则返回100，否则返回对应的数据集中的样本数量
+    def __len__(self):
+        if not hasattr(self, 'h5'):
+            self.open_h5()
+        if self.subset == 'test':
+            #print('')
+            return 1
+        # if self.subset == 'test':
+        #     return 100
+        return len(self.h5['text'])

datasets/vistdii.py ADDED Viewed

	@@ -0,0 +1,94 @@

+import cv2
+import h5py
+import numpy as np
+import torch
+from torch.utils.data import Dataset
+from torchvision import transforms
+from transformers import CLIPTokenizer
+from models.blip_override.blip import init_tokenizer
+class StoryDataset(Dataset):
+    """
+    A custom subset class for the LRW (includes train, val, test) subset
+    """
+    def __init__(self, subset, args):
+        super(StoryDataset, self).__init__()
+        self.args = args
+        self.h5_file = args.get(args.dataset).hdf5_file
+        self.subset = subset
+        self.augment = transforms.Compose([
+            transforms.ToPILImage(),
+            transforms.Resize(512),
+            transforms.RandomCrop(512) if self.subset == 'train' else transforms.CenterCrop(512),
+            transforms.ToTensor(),
+            transforms.Normalize([0.5], [0.5])
+        ]) if self.subset in ['train', 'val'] else transforms.Compose([
+            transforms.ToPILImage(),
+            transforms.Resize(64),
+            transforms.CenterCrop(64)
+        ])
+        self.dataset = args.dataset
+        self.max_length = args.get(args.dataset).max_length
+        self.clip_tokenizer = CLIPTokenizer.from_pretrained('runwayml/stable-diffusion-v1-5', subfolder="tokenizer")
+        self.blip_tokenizer = init_tokenizer()
+        self.blip_image_processor = transforms.Compose([
+            transforms.ToPILImage(),
+            transforms.Resize(224),
+            transforms.RandomCrop(224) if self.subset == 'train' else transforms.CenterCrop(224),
+            transforms.ToTensor(),
+            transforms.Normalize([0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711])
+        ])
+    def open_h5(self):
+        h5 = h5py.File(self.h5_file, "r")
+        self.h5 = h5[self.subset]
+    def __getitem__(self, index):
+        if not hasattr(self, 'h5'):
+            self.open_h5()
+        images = list()
+        for i in range(5):
+            im = self.h5['image{}'.format(i)][index]
+            im = cv2.imdecode(im, cv2.IMREAD_COLOR)
+            images.append(im)
+        source_images = torch.stack([self.blip_image_processor(im) for im in images])
+        images = images[1:] if self.args.task == 'continuation' else images
+        images = [self.augment(im) for im in images]
+        images = torch.stack(images) if self.subset in ['train', 'val'] \
+            else torch.from_numpy(np.array([np.array(im) for im in images])).permute(0, 3, 1, 2)
+        texts = self.h5['dii'][index].decode('utf-8').split('|')
+        # tokenize caption using default tokenizer
+        tokenized = self.clip_tokenizer(
+            texts[1:] if self.args.task == 'continuation' else texts,
+            padding="max_length",
+            max_length=self.max_length,
+            truncation=False,
+            return_tensors="pt",
+        )
+        captions, attention_mask = tokenized['input_ids'], tokenized['attention_mask']
+        tokenized = self.blip_tokenizer(
+            texts,
+            padding="max_length",
+            max_length=self.max_length,
+            truncation=False,
+            return_tensors="pt",
+        )
+        source_caption, source_attention_mask = tokenized['input_ids'], tokenized['attention_mask']
+        return images, captions, attention_mask, source_images, source_caption, source_attention_mask
+    def __len__(self):
+        if not hasattr(self, 'h5'):
+            self.open_h5()
+        return len(self.h5['dii'])

datasets/vistsis.py ADDED Viewed

	@@ -0,0 +1,94 @@

+import cv2
+import h5py
+import numpy as np
+import torch
+from torch.utils.data import Dataset
+from torchvision import transforms
+from transformers import CLIPTokenizer
+from models.blip_override.blip import init_tokenizer
+class StoryDataset(Dataset):
+    """
+    A custom subset class for the LRW (includes train, val, test) subset
+    """
+    def __init__(self, subset, args):
+        super(StoryDataset, self).__init__()
+        self.args = args
+        self.h5_file = args.get(args.dataset).hdf5_file
+        self.subset = subset
+        self.augment = transforms.Compose([
+            transforms.ToPILImage(),
+            transforms.Resize(512),
+            transforms.RandomCrop(512) if self.subset == 'train' else transforms.CenterCrop(512),
+            transforms.ToTensor(),
+            transforms.Normalize([0.5], [0.5])
+        ]) if self.subset in ['train', 'val'] else transforms.Compose([
+            transforms.ToPILImage(),
+            transforms.Resize(64),
+            transforms.CenterCrop(64)
+        ])
+        self.dataset = args.dataset
+        self.max_length = args.get(args.dataset).max_length
+        self.clip_tokenizer = CLIPTokenizer.from_pretrained('runwayml/stable-diffusion-v1-5', subfolder="tokenizer")
+        self.blip_tokenizer = init_tokenizer()
+        self.blip_image_processor = transforms.Compose([
+            transforms.ToPILImage(),
+            transforms.Resize(224),
+            transforms.RandomCrop(224) if self.subset == 'train' else transforms.CenterCrop(224),
+            transforms.ToTensor(),
+            transforms.Normalize([0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711])
+        ])
+    def open_h5(self):
+        h5 = h5py.File(self.h5_file, "r")
+        self.h5 = h5[self.subset]
+    def __getitem__(self, index):
+        if not hasattr(self, 'h5'):
+            self.open_h5()
+        images = list()
+        for i in range(5):
+            im = self.h5['image{}'.format(i)][index]
+            im = cv2.imdecode(im, cv2.IMREAD_COLOR)
+            images.append(im)
+        source_images = torch.stack([self.blip_image_processor(im) for im in images])
+        images = images[1:] if self.args.task == 'continuation' else images
+        images = [self.augment(im) for im in images]
+        images = torch.stack(images) if self.subset in ['train', 'val'] \
+            else torch.from_numpy(np.array([np.array(im) for im in images])).permute(0, 3, 1, 2)
+        texts = self.h5['sis'][index].decode('utf-8').split('|')
+        # tokenize caption using default tokenizer
+        tokenized = self.clip_tokenizer(
+            texts[1:] if self.args.task == 'continuation' else texts,
+            padding="max_length",
+            max_length=self.max_length,
+            truncation=False,
+            return_tensors="pt",
+        )
+        captions, attention_mask = tokenized['input_ids'], tokenized['attention_mask']
+        tokenized = self.blip_tokenizer(
+            texts,
+            padding="max_length",
+            max_length=self.max_length,
+            truncation=False,
+            return_tensors="pt",
+        )
+        source_caption, source_attention_mask = tokenized['input_ids'], tokenized['attention_mask']
+        return images, captions, attention_mask, source_images, source_caption, source_attention_mask
+    def __len__(self):
+        if not hasattr(self, 'h5'):
+            self.open_h5()
+        return len(self.h5['sis'])

environment.yml ADDED Viewed

	@@ -0,0 +1,271 @@

+name: story
+channels:
+  - pytorch
+  - nvidia
+  - defaults
+dependencies:
+  - _libgcc_mutex=0.1=main
+  - _openmp_mutex=5.1=1_gnu
+  - blas=1.0=mkl
+  - brotlipy=0.7.0=py38h27cfd23_1003
+  - bzip2=1.0.8=h7b6447c_0
+  - ca-certificates=2023.01.10=h06a4308_0
+  - certifi=2022.12.7=py38h06a4308_0
+  - cffi=1.15.1=py38h5eee18b_3
+  - cryptography=39.0.1=py38h9ce1e76_0
+  - cuda-cudart=11.7.99=0
+  - cuda-cupti=11.7.101=0
+  - cuda-libraries=11.7.1=0
+  - cuda-nvrtc=11.7.99=0
+  - cuda-nvtx=11.7.91=0
+  - cuda-runtime=11.7.1=0
+  - ffmpeg=4.3=hf484d3e_0
+  - flit-core=3.8.0=py38h06a4308_0
+  - freetype=2.12.1=h4a9f257_0
+  - giflib=5.2.1=h5eee18b_3
+  - gmp=6.2.1=h295c915_3
+  - gnutls=3.6.15=he1e5248_0
+  - idna=3.4=py38h06a4308_0
+  - intel-openmp=2021.4.0=h06a4308_3561
+  - jpeg=9e=h5eee18b_1
+  - lame=3.100=h7b6447c_0
+  - lcms2=2.12=h3be6417_0
+  - ld_impl_linux-64=2.38=h1181459_1
+  - lerc=3.0=h295c915_0
+  - libcublas=11.10.3.66=0
+  - libcufft=10.7.2.124=h4fbf590_0
+  - libcufile=1.6.0.25=0
+  - libcurand=10.3.2.56=0
+  - libcusolver=11.4.0.1=0
+  - libcusparse=11.7.4.91=0
+  - libdeflate=1.17=h5eee18b_0
+  - libffi=3.4.2=h6a678d5_6
+  - libgcc-ng=11.2.0=h1234567_1
+  - libgomp=11.2.0=h1234567_1
+  - libiconv=1.16=h7f8727e_2
+  - libidn2=2.3.2=h7f8727e_0
+  - libnpp=11.7.4.75=0
+  - libnvjpeg=11.8.0.2=0
+  - libpng=1.6.39=h5eee18b_0
+  - libstdcxx-ng=11.2.0=h1234567_1
+  - libtasn1=4.19.0=h5eee18b_0
+  - libtiff=4.5.0=h6a678d5_2
+  - libunistring=0.9.10=h27cfd23_0
+  - libwebp=1.2.4=h11a3e52_1
+  - libwebp-base=1.2.4=h5eee18b_1
+  - lz4-c=1.9.4=h6a678d5_0
+  - mkl=2021.4.0=h06a4308_640
+  - mkl-service=2.4.0=py38h7f8727e_0
+  - mkl_fft=1.3.1=py38hd3c417c_0
+  - mkl_random=1.2.2=py38h51133e4_0
+  - ncurses=6.4=h6a678d5_0
+  - nettle=3.7.3=hbbd107a_1
+  - numpy-base=1.23.5=py38h31eccc5_0
+  - openh264=2.1.1=h4ff587b_0
+  - openssl=1.1.1t=h7f8727e_0
+  - pip=23.0.1=py38h06a4308_0
+  - pycparser=2.21=pyhd3eb1b0_0
+  - pyopenssl=23.0.0=py38h06a4308_0
+  - pysocks=1.7.1=py38h06a4308_0
+  - python=3.8.16=h7a1cb2a_3
+  - pytorch=1.13.1=py3.8_cuda11.7_cudnn8.5.0_0
+  - pytorch-cuda=11.7=h778d358_3
+  - pytorch-mutex=1.0=cuda
+  - readline=8.2=h5eee18b_0
+  - six=1.16.0=pyhd3eb1b0_1
+  - sqlite=3.41.1=h5eee18b_0
+  - tk=8.6.12=h1ccaba5_0
+  - typing_extensions=4.4.0=py38h06a4308_0
+  - urllib3=1.26.15=py38h06a4308_0
+  - wheel=0.38.4=py38h06a4308_0
+  - xz=5.2.10=h5eee18b_1
+  - zlib=1.2.13=h5eee18b_0
+  - zstd=1.5.4=hc292b87_0
+  - pip:
+    - absl-py==1.4.0
+    - accelerate==0.17.1
+    - aiofiles==23.1.0
+    - aiohttp==3.8.4
+    - aiosignal==1.3.1
+    - altair==4.2.2
+    - antlr4-python3-runtime==4.9.3
+    - anyio==3.6.2
+    - appdirs==1.4.4
+    - argon2-cffi==21.3.0
+    - argon2-cffi-bindings==21.2.0
+    - arrow==1.2.3
+    - asttokens==2.2.1
+    - async-timeout==4.0.2
+    - attrs==22.2.0
+    - backcall==0.2.0
+    - beautifulsoup4==4.11.2
+    - bleach==6.0.0
+    - cachetools==5.3.0
+    - chardet==5.1.0
+    - charset-normalizer==3.1.0
+    - click==8.1.3
+    - comm==0.1.2
+    - contourpy==1.0.7
+    - cycler==0.11.0
+    - debugpy==1.6.6
+    - decorator==5.1.1
+    - defusedxml==0.7.1
+    - diffusers==0.9.0
+    - docker-pycreds==0.4.0
+    - entrypoints==0.4
+    - executing==1.2.0
+    - fastapi==0.95.0
+    - fastjsonschema==2.16.3
+    - ffmpy==0.3.0
+    - filelock==3.10.0
+    - fire==0.5.0
+    - flatbuffers==23.3.3
+    - fonttools==4.39.3
+    - fqdn==1.5.1
+    - frozenlist==1.3.3
+    - fsspec==2023.3.0
+    - ftfy==6.1.1
+    - gitdb==4.0.10
+    - gitpython==3.1.31
+    - google-auth==2.16.2
+    - google-auth-oauthlib==0.4.6
+    - gradio==3.24.1
+    - gradio-client==0.0.5
+    - grpcio==1.51.3
+    - h11==0.14.0
+    - h5py==3.8.0
+    - httpcore==0.16.3
+    - httpx==0.23.3
+    - huggingface-hub==0.13.2
+    - hydra-core==1.3.2
+    - importlib-metadata==6.1.0
+    - importlib-resources==5.12.0
+    - ipykernel==6.21.3
+    - ipython==8.11.0
+    - ipython-genutils==0.2.0
+    - ipywidgets==8.0.4
+    - isoduration==20.11.0
+    - jedi==0.18.2
+    - jinja2==3.1.2
+    - jsonpointer==2.3
+    - jsonschema==4.17.3
+    - jupyter==1.0.0
+    - jupyter-client==8.0.3
+    - jupyter-console==6.6.3
+    - jupyter-core==5.3.0
+    - jupyter-events==0.6.3
+    - jupyter-server==2.5.0
+    - jupyter-server-terminals==0.4.4
+    - jupyterlab-pygments==0.2.2
+    - jupyterlab-widgets==3.0.5
+    - kiwisolver==1.4.4
+    - lightning-bolts==0.5.0
+    - linkify-it-py==2.0.0
+    - lora-diffusion==0.1.7
+    - markdown==3.4.1
+    - markdown-it-py==2.2.0
+    - markupsafe==2.1.2
+    - matplotlib==3.7.1
+    - matplotlib-inline==0.1.6
+    - mdit-py-plugins==0.3.3
+    - mdurl==0.1.2
+    - mediapipe==0.9.1.0
+    - mistune==2.0.5
+    - multidict==6.0.4
+    - nbclassic==0.5.3
+    - nbclient==0.7.2
+    - nbconvert==7.2.10
+    - nbformat==5.7.3
+    - nest-asyncio==1.5.6
+    - notebook==6.5.3
+    - notebook-shim==0.2.2
+    - numpy==1.24.2
+    - oauthlib==3.2.2
+    - omegaconf==2.3.0
+    - opencv-contrib-python==4.7.0.72
+    - opencv-python==4.7.0.72
+    - orjson==3.8.9
+    - packaging==23.0
+    - pandas==1.5.3
+    - pandocfilters==1.5.0
+    - parso==0.8.3
+    - pathtools==0.1.2
+    - pexpect==4.8.0
+    - pickleshare==0.7.5
+    - pillow==9.4.0
+    - pkgutil-resolve-name==1.3.10
+    - platformdirs==3.1.1
+    - prometheus-client==0.16.0
+    - prompt-toolkit==3.0.38
+    - protobuf==3.20.1
+    - psutil==5.9.4
+    - ptyprocess==0.7.0
+    - pure-eval==0.2.2
+    - pyasn1==0.4.8
+    - pyasn1-modules==0.2.8
+    - pydantic==1.10.7
+    - pydeprecate==0.3.2
+    - pydub==0.25.1
+    - pygments==2.14.0
+    - pyparsing==3.0.9
+    - pyrsistent==0.19.3
+    - python-dateutil==2.8.2
+    - python-json-logger==2.0.7
+    - python-multipart==0.0.6
+    - pytorch-lightning==1.6.5
+    - pytz==2023.3
+    - pyyaml==6.0
+    - pyzmq==25.0.1
+    - qtconsole==5.4.1
+    - qtpy==2.3.0
+    - regex==2022.10.31
+    - requests==2.28.2
+    - requests-oauthlib==1.3.1
+    - rfc3339-validator==0.1.4
+    - rfc3986==1.5.0
+    - rfc3986-validator==0.1.1
+    - rsa==4.9
+    - safetensors==0.3.0
+    - scipy==1.10.1
+    - semantic-version==2.10.0
+    - send2trash==1.8.0
+    - sentry-sdk==1.17.0
+    - setproctitle==1.3.2
+    - setuptools==59.5.0
+    - smmap==5.0.0
+    - sniffio==1.3.0
+    - soupsieve==2.4
+    - stack-data==0.6.2
+    - starlette==0.26.1
+    - tensorboard==2.12.0
+    - tensorboard-data-server==0.7.0
+    - tensorboard-plugin-wit==1.8.1
+    - termcolor==2.2.0
+    - terminado==0.17.1
+    - timm==0.6.12
+    - tinycss2==1.2.1
+    - tokenizers==0.13.2
+    - toolz==0.12.0
+    - torch==1.9.0
+    - torchaudio==0.9.0
+    - torchmetrics==0.11.4
+    - torchvision==0.10.0+cu111
+    - tornado==6.2
+    - tqdm==4.65.0
+    - traitlets==5.9.0
+    - transformers==4.28.1
+    - typing-extensions==4.5.0
+    - uc-micro-py==1.0.1
+    - uri-template==1.2.0
+    - uvicorn==0.21.1
+    - wandb==0.14.0
+    - wcwidth==0.2.6
+    - webcolors==1.12
+    - webencodings==0.5.1
+    - websocket-client==1.5.1
+    - websockets==11.0
+    - werkzeug==2.2.3
+    - widgetsnbextension==4.0.5
+    - yarl==1.8.2
+    - zipp==3.15.0
+prefix: /root/anaconda3/envs/story

fid_utils.py ADDED Viewed

	@@ -0,0 +1,41 @@

+import numpy as np
+from scipy import linalg
+def calculate_frechet_distance(mu1, sigma1, mu2, sigma2, eps=1e-6):
+    mu1 = np.atleast_1d(mu1)
+    mu2 = np.atleast_1d(mu2)
+    sigma1 = np.atleast_2d(sigma1)
+    sigma2 = np.atleast_2d(sigma2)
+    assert mu1.shape == mu2.shape, 'Training and test mean vectors have different lengths'
+    assert sigma1.shape == sigma2.shape, 'Training and test covariances have different dimensions'
+    diff = mu1 - mu2
+    # Product might be almost singular
+    covmean, _ = linalg.sqrtm(sigma1.dot(sigma2), disp=False)
+    if not np.isfinite(covmean).all():
+        print('fid calculation produces singular product; adding %s to diagonal of cov estimates' % eps)
+        offset = np.eye(sigma1.shape[0]) * eps
+        covmean = linalg.sqrtm((sigma1 + offset).dot(sigma2 + offset))
+    # Numerical error might give slight imaginary component
+    if np.iscomplexobj(covmean):
+        if not np.allclose(np.diagonal(covmean).imag, 0, atol=1e-3):
+            m = np.max(np.abs(covmean.imag))
+            raise ValueError('Imaginary component {}'.format(m))
+        covmean = covmean.real
+    return diff.dot(diff) + np.trace(sigma1) + np.trace(sigma2) - 2 * np.trace(covmean)
+def calculate_fid_given_features(feature1, feature2):
+    mu1 = np.mean(feature1, axis=0)
+    sigma1 = np.cov(feature1, rowvar=False)
+    mu2 = np.mean(feature2, axis=0)
+    sigma2 = np.cov(feature2, rowvar=False)
+    fid_value = calculate_frechet_distance(mu1, sigma1, mu2, sigma2)
+    return fid_value

main.py ADDED Viewed

	@@ -0,0 +1,537 @@

+import inspect
+import os
+import cv2
+import hydra
+import numpy as np
+import pytorch_lightning as pl
+import torch
+import torch.nn.functional as F
+import torch.utils.checkpoint
+from PIL import Image
+from diffusers import AutoencoderKL, DDPMScheduler, LMSDiscreteScheduler, PNDMScheduler, DDIMScheduler
+from omegaconf import DictConfig
+from pl_bolts.optimizers.lr_scheduler import LinearWarmupCosineAnnealingLR
+from pytorch_lightning.callbacks import ModelCheckpoint, LearningRateMonitor
+from pytorch_lightning.loggers import TensorBoardLogger
+from pytorch_lightning.strategies import DDPStrategy
+from torch import nn
+from torch.utils.data import DataLoader
+from torchvision import transforms
+from transformers import CLIPTokenizer, CLIPTextModel
+from fid_utils import calculate_fid_given_features
+from lora_diffusion import monkeypatch_or_replace_lora, tune_lora_scale
+from models.blip_override.blip import blip_feature_extractor, init_tokenizer
+from models.diffusers_override.unet_2d_condition import UNet2DConditionModel
+from models.inception import InceptionV3
+unet_target_replace_module = {"CrossAttention", "Attention", "GEGLU"}
+#!/usr/bin/env python3
+from transformers import CLIPProcessor
+import transformers
+from PIL import Image
+import PIL.Image
+import numpy as np
+import torchvision.transforms as tvtrans
+import requests
+from io import BytesIO
+class LightningDataset(pl.LightningDataModule):
+    def __init__(self, args: DictConfig):
+        super(LightningDataset, self).__init__()
+        self.kwargs = {"num_workers": args.num_workers, "persistent_workers": True if args.num_workers > 0 else False,
+                       "pin_memory": True}
+        self.args = args
+    def setup(self, stage="fit"):
+        if self.args.dataset == "pororo":
+            import datasets.pororo as data
+        elif self.args.dataset == 'flintstones':
+            import datasets.flintstones as data
+        elif self.args.dataset == 'vistsis':
+            import datasets.vistsis as data
+        elif self.args.dataset == 'vistdii':
+            import datasets.vistdii as data
+        else:
+            raise ValueError("Unknown dataset: {}".format(self.args.dataset))
+        if stage == "fit":
+            self.train_data = data.StoryDataset("train", self.args)
+            self.val_data = data.StoryDataset("val", self.args)
+        if stage == "test":
+            self.test_data = data.StoryDataset("test", self.args)
+    def train_dataloader(self):
+        if not hasattr(self, 'trainloader'):
+            self.trainloader = DataLoader(self.train_data, batch_size=self.args.batch_size, shuffle=True, **self.kwargs)
+        return self.trainloader
+    def val_dataloader(self):
+        return DataLoader(self.val_data, batch_size=self.args.batch_size, shuffle=False, **self.kwargs)
+    def test_dataloader(self):
+        return DataLoader(self.test_data, batch_size=self.args.batch_size, shuffle=False, **self.kwargs)
+    def predict_dataloader(self):
+        return DataLoader(self.test_data, batch_size=self.args.batch_size, shuffle=False, **self.kwargs)
+    def get_length_of_train_dataloader(self):
+        if not hasattr(self, 'trainloader'):
+            self.trainloader = DataLoader(self.train_data, batch_size=self.args.batch_size, shuffle=True, **self.kwargs)
+        return len(self.trainloader)
+class ARLDM(pl.LightningModule):
+    def __init__(self, args: DictConfig, steps_per_epoch=1):
+        super(ARLDM, self).__init__()
+        self.args = args
+        self.steps_per_epoch = steps_per_epoch
+        """
+            Configurations
+        """
+        self.task = args.task
+        if args.mode == 'sample':
+            if args.scheduler == "pndm":
+                self.scheduler = PNDMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear",
+                                               skip_prk_steps=True)
+            elif args.scheduler == "ddim":
+                self.scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear",
+                                               clip_sample=False, set_alpha_to_one=True)
+            else:
+                raise ValueError("Scheduler not supported")
+            self.fid_augment = transforms.Compose([
+                transforms.Resize([64, 64]),
+                transforms.ToTensor(),
+                transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
+            ])
+            block_idx = InceptionV3.BLOCK_INDEX_BY_DIM[2048]
+            self.inception = InceptionV3([block_idx])
+        self.clip_tokenizer = CLIPTokenizer.from_pretrained('runwayml/stable-diffusion-v1-5', subfolder="tokenizer")
+        ##############################
+        #self.clip_tokenizer.save_pretrained('/root/lihui/StoryVisualization/save_pretrained/tokenizer')
+        self.blip_tokenizer = init_tokenizer()
+        self.blip_image_processor = transforms.Compose([
+            transforms.Resize([224, 224]),
+            transforms.ToTensor(),
+            transforms.Normalize([0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711])
+        ])
+        self.max_length = args.get(args.dataset).max_length
+        blip_image_null_token = self.blip_image_processor(
+            Image.fromarray(np.zeros((224, 224, 3), dtype=np.uint8))).unsqueeze(0).float()
+        clip_text_null_token = self.clip_tokenizer([""], padding="max_length", max_length=self.max_length,
+                                                   return_tensors="pt").input_ids
+        blip_text_null_token = self.blip_tokenizer([""], padding="max_length", max_length=self.max_length,
+                                                   return_tensors="pt").input_ids
+        self.register_buffer('clip_text_null_token', clip_text_null_token)
+        self.register_buffer('blip_text_null_token', blip_text_null_token)
+        self.register_buffer('blip_image_null_token', blip_image_null_token)
+        self.text_encoder = CLIPTextModel.from_pretrained('runwayml/stable-diffusion-v1-5',
+                                                          subfolder="text_encoder")
+        ############################################
+        #self.text_encoder.save_pretrained('/root/lihui/StoryVisualization/save_pretrained/text_encoder')
+        self.text_encoder.resize_token_embeddings(args.get(args.dataset).clip_embedding_tokens)
+        # resize_position_embeddings
+        old_embeddings = self.text_encoder.text_model.embeddings.position_embedding
+        new_embeddings = self.text_encoder._get_resized_embeddings(old_embeddings, self.max_length)
+        self.text_encoder.text_model.embeddings.position_embedding = new_embeddings
+        self.text_encoder.config.max_position_embeddings = self.max_length
+        self.text_encoder.max_position_embeddings = self.max_length
+        self.text_encoder.text_model.embeddings.position_ids = torch.arange(self.max_length).expand((1, -1))
+        self.modal_type_embeddings = nn.Embedding(2, 768)
+        self.time_embeddings = nn.Embedding(5, 768)
+        self.mm_encoder = blip_feature_extractor(
+            # pretrained='https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large.pth',
+            pretrained='/root/lihui/StoryVisualization/save_pretrained/model_large.pth',
+            image_size=224, vit='large')#, local_files_only=True)
+        self.mm_encoder.text_encoder.resize_token_embeddings(args.get(args.dataset).blip_embedding_tokens)
+        self.vae = AutoencoderKL.from_pretrained('runwayml/stable-diffusion-v1-5', subfolder="vae")
+        self.unet = UNet2DConditionModel.from_pretrained('runwayml/stable-diffusion-v1-5', subfolder="unet")
+        self.noise_scheduler = DDPMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear",
+                                             num_train_timesteps=1000)
+        # monkeypatch_or_replace_lora(
+        #     self.unet,
+        #     torch.load("lora/example_loras/analog_svd_rank4.safetensors"),
+        #     r=4,
+        #     target_replace_module=unet_target_replace_module,
+        # )
+        #
+        # tune_lora_scale(self.unet, 1.00)
+        #tune_lora_scale(self.text_encoder, 1.00)
+        # torch.manual_seed(0)
+        ###################################
+        #self.vae.save_pretrained('/root/lihui/StoryVisualization/save_pretrained/vae')
+        #self.unet.save_pretrained('/root/lihui/StoryVisualization/save_pretrained/unet')
+        # Freeze vae and unet
+        self.freeze_params(self.vae.parameters())
+        if args.freeze_resnet:
+            self.freeze_params([p for n, p in self.unet.named_parameters() if "attentions" not in n])
+        if args.freeze_blip and hasattr(self, "mm_encoder"):
+            self.freeze_params(self.mm_encoder.parameters())
+            self.unfreeze_params(self.mm_encoder.text_encoder.embeddings.word_embeddings.parameters())
+        if args.freeze_clip and hasattr(self, "text_encoder"):
+            self.freeze_params(self.text_encoder.parameters())
+            self.unfreeze_params(self.text_encoder.text_model.embeddings.token_embedding.parameters())
+    @staticmethod
+    def freeze_params(params):
+        for param in params:
+            param.requires_grad = False
+    @staticmethod
+    def unfreeze_params(params):
+        for param in params:
+            param.requires_grad = True
+    def configure_optimizers(self):
+        optimizer = torch.optim.AdamW(self.parameters(), lr=self.args.init_lr, weight_decay=1e-4)  # optim_bits=8
+        scheduler = LinearWarmupCosineAnnealingLR(optimizer,
+                                                  warmup_epochs=self.args.warmup_epochs * self.steps_per_epoch,
+                                                  max_epochs=self.args.max_epochs * self.steps_per_epoch)
+        optim_dict = {
+            'optimizer': optimizer,
+            'lr_scheduler': {
+                'scheduler': scheduler,  # The LR scheduler instance (required)
+                'interval': 'step',  # The unit of the scheduler's step size
+            }
+        }
+        return optim_dict
+    def forward(self, batch):
+        if self.args.freeze_clip and hasattr(self, "text_encoder"):
+            self.text_encoder.eval()
+        if self.args.freeze_blip and hasattr(self, "mm_encoder"):
+            self.mm_encoder.eval()
+        images, captions, attention_mask, source_images, source_caption, source_attention_mask, texts, ori_images = batch
+        B, V, S = captions.shape
+        src_V = V + 1 if self.task == 'continuation' else V
+        images = torch.flatten(images, 0, 1)
+        captions = torch.flatten(captions, 0, 1)
+        attention_mask = torch.flatten(attention_mask, 0, 1)
+        source_images = torch.flatten(source_images, 0, 1)
+        source_caption = torch.flatten(source_caption, 0, 1)
+        source_attention_mask = torch.flatten(source_attention_mask, 0, 1)
+        # 1 is not masked, 0 is maske
+        classifier_free_idx = np.random.rand(B * V) < 0.1
+        caption_embeddings = self.text_encoder(captions, attention_mask).last_hidden_state  # B * V, S, D
+        source_embeddings = self.mm_encoder(source_images, source_caption, source_attention_mask,
+                                            mode='multimodal').reshape(B, src_V * S, -1)
+        source_embeddings = source_embeddings.repeat_interleave(V, dim=0)
+        caption_embeddings[classifier_free_idx] = \
+            self.text_encoder(self.clip_text_null_token).last_hidden_state[0]
+        source_embeddings[classifier_free_idx] = \
+            self.mm_encoder(self.blip_image_null_token, self.blip_text_null_token, attention_mask=None,
+                            mode='multimodal')[0].repeat(src_V, 1)
+        caption_embeddings += self.modal_type_embeddings(torch.tensor(0, device=self.device))
+        source_embeddings += self.modal_type_embeddings(torch.tensor(1, device=self.device))
+        source_embeddings += self.time_embeddings(
+            torch.arange(src_V, device=self.device).repeat_interleave(S, dim=0))
+        encoder_hidden_states = torch.cat([caption_embeddings, source_embeddings], dim=1)
+        attention_mask = torch.cat(
+            [attention_mask, source_attention_mask.reshape(B, src_V * S).repeat_interleave(V, dim=0)], dim=1)
+        attention_mask = ~(attention_mask.bool())  # B * V, (src_V + 1) * S
+        attention_mask[classifier_free_idx] = False
+        # B, V, V, S
+        square_mask = torch.triu(torch.ones((V, V), device=self.device)).bool()
+        square_mask = square_mask.unsqueeze(0).unsqueeze(-1).expand(B, V, V, S)
+        square_mask = square_mask.reshape(B * V, V * S)
+        attention_mask[:, -V * S:] = torch.logical_or(square_mask, attention_mask[:, -V * S:])
+        latents = self.vae.encode(images).latent_dist.sample()
+        latents = latents * 0.18215
+        noise = torch.randn(latents.shape, device=self.device)
+        bsz = latents.shape[0]
+        timesteps = torch.randint(0, self.noise_scheduler.num_train_timesteps, (bsz,), device=self.device).long()
+        noisy_latents = self.noise_scheduler.add_noise(latents, noise, timesteps)
+        noise_pred = self.unet(noisy_latents, timesteps, encoder_hidden_states, attention_mask).sample
+        loss = F.mse_loss(noise_pred, noise, reduction="none").mean([1, 2, 3]).mean()
+        return loss
+    def sample(self, batch):
+        original_images, captions, attention_mask, source_images, source_caption, source_attention_mask, texts, ori_test_images  = batch
+        B, V, S = captions.shape
+        src_V = V + 1 if self.task == 'continuation' else V
+        original_images = torch.flatten(original_images, 0, 1)
+        captions = torch.flatten(captions, 0, 1)
+        attention_mask = torch.flatten(attention_mask, 0, 1)
+        source_images = torch.flatten(source_images, 0, 1)
+        source_caption = torch.flatten(source_caption, 0, 1)
+        source_attention_mask = torch.flatten(source_attention_mask, 0, 1)
+        caption_embeddings = self.text_encoder(captions, attention_mask).last_hidden_state  # B * V, S, D
+        source_embeddings = self.mm_encoder(source_images, source_caption, source_attention_mask,
+                                            mode='multimodal').reshape(B, src_V * S, -1)
+        caption_embeddings += self.modal_type_embeddings(torch.tensor(0, device=self.device))
+        source_embeddings += self.modal_type_embeddings(torch.tensor(1, device=self.device))
+        source_embeddings += self.time_embeddings(
+            torch.arange(src_V, device=self.device).repeat_interleave(S, dim=0))
+        source_embeddings = source_embeddings.repeat_interleave(V, dim=0)
+        encoder_hidden_states = torch.cat([caption_embeddings, source_embeddings], dim=1)
+        attention_mask = torch.cat(
+            [attention_mask, source_attention_mask.reshape(B, src_V * S).repeat_interleave(V, dim=0)], dim=1)
+        attention_mask = ~(attention_mask.bool())  # B * V, (src_V + 1) * S
+        # B, V, V, S
+        square_mask = torch.triu(torch.ones((V, V), device=self.device)).bool()
+        square_mask = square_mask.unsqueeze(0).unsqueeze(-1).expand(B, V, V, S)
+        square_mask = square_mask.reshape(B * V, V * S)
+        attention_mask[:, -V * S:] = torch.logical_or(square_mask, attention_mask[:, -V * S:])
+        uncond_caption_embeddings = self.text_encoder(self.clip_text_null_token).last_hidden_state
+        uncond_source_embeddings = self.mm_encoder(self.blip_image_null_token, self.blip_text_null_token,
+                                                   attention_mask=None, mode='multimodal').repeat(1, src_V, 1)
+        uncond_caption_embeddings += self.modal_type_embeddings(torch.tensor(0, device=self.device))
+        uncond_source_embeddings += self.modal_type_embeddings(torch.tensor(1, device=self.device))
+        uncond_source_embeddings += self.time_embeddings(
+            torch.arange(src_V, device=self.device).repeat_interleave(S, dim=0))
+        uncond_embeddings = torch.cat([uncond_caption_embeddings, uncond_source_embeddings], dim=1)
+        uncond_embeddings = uncond_embeddings.expand(B * V, -1, -1)
+        encoder_hidden_states = torch.cat([uncond_embeddings, encoder_hidden_states])
+        uncond_attention_mask = torch.zeros((B * V, (src_V + 1) * S), device=self.device).bool()
+        uncond_attention_mask[:, -V * S:] = square_mask
+        attention_mask = torch.cat([uncond_attention_mask, attention_mask], dim=0)
+        attention_mask = attention_mask.reshape(2, B, V, (src_V + 1) * S)
+        images = list()
+        for i in range(V):
+            encoder_hidden_states = encoder_hidden_states.reshape(2, B, V, (src_V + 1) * S, -1)
+            new_image = self.diffusion(encoder_hidden_states[:, :, i].reshape(2 * B, (src_V + 1) * S, -1),
+                                       attention_mask[:, :, i].reshape(2 * B, (src_V + 1) * S),
+                                       512, 512, self.args.num_inference_steps, self.args.guidance_scale, 0.0)
+            images += new_image
+            new_image = torch.stack([self.blip_image_processor(im) for im in new_image]).to(self.device)
+            new_embedding = self.mm_encoder(new_image,  # B,C,H,W
+                                            source_caption.reshape(B, src_V, S)[:, i + src_V - V],
+                                            source_attention_mask.reshape(B, src_V, S)[:, i + src_V - V],
+                                            mode='multimodal')  # B, S, D
+            new_embedding = new_embedding.repeat_interleave(V, dim=0)
+            new_embedding += self.modal_type_embeddings(torch.tensor(1, device=self.device))
+            new_embedding += self.time_embeddings(torch.tensor(i + src_V - V, device=self.device))
+            encoder_hidden_states = encoder_hidden_states[1].reshape(B * V, (src_V + 1) * S, -1)
+            encoder_hidden_states[:, (i + 1 + src_V - V) * S:(i + 2 + src_V - V) * S] = new_embedding
+            encoder_hidden_states = torch.cat([uncond_embeddings, encoder_hidden_states])
+        return original_images, images, texts, ori_test_images
+    def training_step(self, batch, batch_idx):
+        loss = self(batch)
+        self.log('loss/train_loss', loss, on_step=True, on_epoch=False, sync_dist=True, prog_bar=True)
+        return loss
+    def validation_step(self, batch, batch_idx):
+        loss = self(batch)
+        self.log('loss/val_loss', loss, on_step=False, on_epoch=True, sync_dist=True, prog_bar=True)
+    def predict_step(self, batch, batch_idx, dataloader_idx=0):
+        original_images, images, texts, ori_test_images = self.sample(batch)
+        if self.args.calculate_fid:
+            original_images = original_images.cpu().numpy().astype('uint8')
+            original_images = [Image.fromarray(im, 'RGB') for im in original_images]
+            # ori_test_images = torch.stack(ori_test_images).cpu().numpy().astype('uint8')
+            # ori_test_images = [Image.fromarray(im, 'RGB') for im in ori_test_images]
+            ori = self.inception_feature(original_images).cpu().numpy()
+            gen = self.inception_feature(images).cpu().numpy()
+        else:
+            ori = None
+            gen = None
+        return images, ori, gen, ori_test_images, texts
+    def diffusion(self, encoder_hidden_states, attention_mask, height, width, num_inference_steps, guidance_scale, eta):
+        latents = torch.randn((encoder_hidden_states.shape[0] // 2, self.unet.in_channels, height // 8, width // 8),
+                              device=self.device)
+        # set timesteps
+        accepts_offset = "offset" in set(inspect.signature(self.scheduler.set_timesteps).parameters.keys())
+        extra_set_kwargs = {}
+        if accepts_offset:
+            extra_set_kwargs["offset"] = 1
+        self.scheduler.set_timesteps(num_inference_steps, **extra_set_kwargs)
+        # if we use LMSDiscreteScheduler, let's make sure latents are mulitplied by sigmas
+        if isinstance(self.scheduler, LMSDiscreteScheduler):
+            latents = latents * self.scheduler.sigmas[0]
+        accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        extra_step_kwargs = {}
+        if accepts_eta:
+            extra_step_kwargs["eta"] = eta
+        for i, t in enumerate(self.scheduler.timesteps):
+            # expand the latents if we are doing classifier free guidance
+            latent_model_input = torch.cat([latents] * 2)
+            # noise_pred = self.unet(latent_model_input, t, encoder_hidden_states).sample
+            noise_pred = self.unet(latent_model_input, t, encoder_hidden_states, attention_mask).sample
+            # perform guidance
+            noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+            noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
+            # compute the previous noisy sample x_t -> x_t-1
+            latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs).prev_sample
+        # scale and decode the image latents with vae
+        latents = 1 / 0.18215 * latents
+        image = self.vae.decode(latents).sample
+        image = (image / 2 + 0.5).clamp(0, 1)
+        image = image.cpu().permute(0, 2, 3, 1).numpy()
+        return self.numpy_to_pil(image)
+    @staticmethod
+    def numpy_to_pil(images):
+        """
+        Convert a numpy image or a batch of images to a PIL image.
+        """
+        if images.ndim == 3:
+            images = images[None, ...]
+        images = (images * 255).round().astype("uint8")
+        pil_images = [Image.fromarray(image, 'RGB') for image in images]
+        return pil_images
+    def inception_feature(self, images):
+        images = torch.stack([self.fid_augment(image) for image in images])
+        images = images.type(torch.FloatTensor).to(self.device)
+        images = (images + 1) / 2
+        images = F.interpolate(images, size=(299, 299), mode='bilinear', align_corners=False)
+        pred = self.inception(images)[0]
+        if pred.shape[2] != 1 or pred.shape[3] != 1:
+            pred = F.adaptive_avg_pool2d(pred, output_size=(1, 1))
+        return pred.reshape(-1, 2048)
+def train(args: DictConfig) -> None:
+    dataloader = LightningDataset(args)
+    dataloader.setup('fit')
+   # dataloader.
+    model = ARLDM(args, steps_per_epoch=dataloader.get_length_of_train_dataloader())
+    logger = TensorBoardLogger(save_dir=os.path.join(args.ckpt_dir, args.run_name), name='log', default_hp_metric=False)
+    checkpoint_callback = ModelCheckpoint(
+        dirpath=os.path.join(args.ckpt_dir, args.run_name),
+        save_top_k=0,
+        save_last=True
+    )
+    lr_monitor = LearningRateMonitor(logging_interval='step')
+    callback_list = [lr_monitor, checkpoint_callback]
+    trainer = pl.Trainer(
+        accelerator='gpu',
+        devices=args.gpu_ids,
+        max_epochs=args.max_epochs,
+        benchmark=True,
+        logger=logger,
+        log_every_n_steps=1,
+        callbacks=callback_list,
+        strategy=DDPStrategy(find_unused_parameters=False)
+    )
+    trainer.fit(model, dataloader, ckpt_path=args.train_model_file)
+def sample(args: DictConfig) -> None:
+    assert args.test_model_file is not None, "test_model_file cannot be None"
+    assert args.gpu_ids == 1 or len(args.gpu_ids) == 1, "Only one GPU is supported in test mode"
+    dataloader = LightningDataset(args)
+    dataloader.setup('test')
+    model = ARLDM.load_from_checkpoint(args.test_model_file, args=args, strict=False)
+    predictor = pl.Trainer(
+        accelerator='gpu',
+        devices=args.gpu_ids,
+        max_epochs=-1,
+        benchmark=True
+    )
+    predictions = predictor.predict(model, dataloader)
+    images = [elem for sublist in predictions for elem in sublist[0]]
+    ori_images = [elem for sublist in predictions for elem in sublist[3]]
+    ori_test_images = list()
+    if not os.path.exists(args.sample_output_dir):
+        try:
+            os.mkdir(args.sample_output_dir)
+        except:
+            pass
+    text_list = [elem for sublist in predictions for elem in sublist[4]]
+    ################################
+    # print(f"index: {index}")
+    num_images = len(images)
+    num_groups = (num_images + 4) // 5  # 计算总共需要的组数
+    for g in range(num_groups):
+        print('Story {}:'.format(g + 1))  # 打印组号
+        start_index = g * 5  # 当前组的起始索引
+        end_index = min(start_index + 5, num_images)  # 当前组的结束索引
+        for i in range(start_index, end_index):
+            print(text_list[i])  # 打印对应的文本
+            images[i].save(
+                os.path.join(args.sample_output_dir, 'group{:02d}_image{:02d}.png'.format(g + 1, i - start_index + 1)))
+            # ori_images[i] = ori_images[i]
+            ori_images_pil = Image.fromarray(np.uint8(ori_images[i].detach().cpu().squeeze().float().numpy())).convert("RGB")
+            ori_test_images.append(ori_images_pil)
+            ori_images_pil.save(
+                 os.path.join('/root/lihui/StoryVisualization/ori_test_images_epoch10', 'group{:02d}_image{:02d}.png'.format(g + 1, i - start_index + 1)))
+        # for i, im in enumerate(ori_images):
+        #     file_path = '/root/lihui/StoryVisualization/ori_test_images/image{}.png'.format(i)
+        #     cv2.imwrite(file_path, im)
+    if args.calculate_fid:
+        ori = np.array([elem for sublist in predictions for elem in sublist[1]])
+        gen = np.array([elem for sublist in predictions for elem in sublist[2]])
+        fid = calculate_fid_given_features(ori, gen)
+        print('FID: {}'.format(fid))
+@hydra.main(config_path=".", config_name="config")
+def main(args: DictConfig) -> None:
+    pl.seed_everything(args.seed)
+    if args.num_cpu_cores > 0:
+        torch.set_num_threads(args.num_cpu_cores)
+    if args.mode == 'train':
+        ############################
+        train(args)
+    elif args.mode == 'sample':
+        # dataloader = LightningDataset(args)
+        # dataloader.setup('test')
+        sample(args)
+if __name__ == '__main__':
+    main()

models/blip_override/blip.py ADDED Viewed

	@@ -0,0 +1,240 @@

+'''
+ * Copyright (c) 2022, salesforce.com, inc.
+ * All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ * For full license text, see LICENSE.txt file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+ * By Junnan Li
+'''
+import warnings
+warnings.filterwarnings("ignore")
+from .vit import VisionTransformer, interpolate_pos_embed
+from .med import BertModel, BertLMHeadModel
+from transformers import BertTokenizer, BertConfig
+import torch
+from torch import nn
+import os
+from urllib.parse import urlparse
+from timm.models.hub import download_cached_file
+class BLIP_Base(nn.Module):
+    def __init__(self,
+                 med_config='models/blip_override/med_config.json',
+                 image_size=224,
+                 vit='base',
+                 vit_grad_ckpt=False,
+                 vit_ckpt_layer=0,
+                 ):
+        """
+        Args:
+            med_config (str): path for the mixture of encoder-decoder model's configuration file
+            image_size (int): input image size
+            vit (str): model size of vision transformer
+        """
+        super().__init__()
+        self.visual_encoder, vision_width = create_vit(vit, image_size, vit_grad_ckpt, vit_ckpt_layer)
+        self.tokenizer = init_tokenizer()
+        med_config = BertConfig.from_json_file(med_config)
+        med_config.encoder_width = vision_width
+        self.text_encoder = BertModel(config=med_config, add_pooling_layer=False)
+    def forward(self, image, text, attention_mask, mode):
+        assert mode in ['image', 'text', 'multimodal'], "mode parameter must be image, text, or multimodal"
+        if mode == 'image':
+            # return image features
+            image_embeds = self.visual_encoder(image)
+            return image_embeds
+        elif mode == 'text':
+            # return text features
+            text_output = self.text_encoder(text, attention_mask=attention_mask, return_dict=True, mode='text')
+            return text_output.last_hidden_state
+        elif mode == 'multimodal':
+            # return multimodel features
+            image_embeds = self.visual_encoder(image)
+            image_atts = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(image.device)
+            text[:, 0] = self.tokenizer.enc_token_id
+            output = self.text_encoder(text,
+                                       attention_mask=attention_mask,
+                                       encoder_hidden_states=image_embeds,
+                                       encoder_attention_mask=image_atts,
+                                       return_dict=True,
+                                       )
+            return output.last_hidden_state
+class BLIP_Decoder(nn.Module):
+    def __init__(self,
+                 med_config='models/blip_override/med_config.json',
+                 image_size=384,
+                 vit='base',
+                 vit_grad_ckpt=False,
+                 vit_ckpt_layer=0,
+                 prompt='a picture of ',
+                 ):
+        """
+        Args:
+            med_config (str): path for the mixture of encoder-decoder model's configuration file
+            image_size (int): input image size
+            vit (str): model size of vision transformer
+        """
+        super().__init__()
+        self.visual_encoder, vision_width = create_vit(vit, image_size, vit_grad_ckpt, vit_ckpt_layer)
+        self.tokenizer = init_tokenizer()
+        med_config = BertConfig.from_json_file(med_config)
+        med_config.encoder_width = vision_width
+        self.text_decoder = BertLMHeadModel(config=med_config)
+        self.prompt = prompt
+        self.prompt_length = len(self.tokenizer(self.prompt).input_ids) - 1
+    def forward(self, image, caption):
+        image_embeds = self.visual_encoder(image)
+        image_atts = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(image.device)
+        text = self.tokenizer(caption, padding='longest', truncation=True, max_length=40, return_tensors="pt").to(
+            image.device)
+        text.input_ids[:, 0] = self.tokenizer.bos_token_id
+        decoder_targets = text.input_ids.masked_fill(text.input_ids == self.tokenizer.pad_token_id, -100)
+        decoder_targets[:, :self.prompt_length] = -100
+        decoder_output = self.text_decoder(text.input_ids,
+                                           attention_mask=text.attention_mask,
+                                           encoder_hidden_states=image_embeds,
+                                           encoder_attention_mask=image_atts,
+                                           labels=decoder_targets,
+                                           return_dict=True,
+                                           )
+        loss_lm = decoder_output.loss
+        return loss_lm
+    def generate(self, image, sample=False, num_beams=3, max_length=30, min_length=10, top_p=0.9,
+                 repetition_penalty=1.0):
+        image_embeds = self.visual_encoder(image)
+        if not sample:
+            image_embeds = image_embeds.repeat_interleave(num_beams, dim=0)
+        image_atts = torch.ones(image_embeds.size()[:-1], dtype=torch.long).to(image.device)
+        model_kwargs = {"encoder_hidden_states": image_embeds, "encoder_attention_mask": image_atts}
+        prompt = [self.prompt] * image.size(0)
+        input_ids = self.tokenizer(prompt, return_tensors="pt").input_ids.to(image.device)
+        input_ids[:, 0] = self.tokenizer.bos_token_id
+        input_ids = input_ids[:, :-1]
+        if sample:
+            # nucleus sampling
+            outputs = self.text_decoder.generate(input_ids=input_ids,
+                                                 max_length=max_length,
+                                                 min_length=min_length,
+                                                 do_sample=True,
+                                                 top_p=top_p,
+                                                 num_return_sequences=1,
+                                                 eos_token_id=self.tokenizer.sep_token_id,
+                                                 pad_token_id=self.tokenizer.pad_token_id,
+                                                 repetition_penalty=1.1,
+                                                 **model_kwargs)
+        else:
+            # beam search
+            outputs = self.text_decoder.generate(input_ids=input_ids,
+                                                 max_length=max_length,
+                                                 min_length=min_length,
+                                                 num_beams=num_beams,
+                                                 eos_token_id=self.tokenizer.sep_token_id,
+                                                 pad_token_id=self.tokenizer.pad_token_id,
+                                                 repetition_penalty=repetition_penalty,
+                                                 **model_kwargs)
+        captions = []
+        for output in outputs:
+            caption = self.tokenizer.decode(output, skip_special_tokens=True)
+            captions.append(caption[len(self.prompt):])
+        return captions
+def blip_decoder(pretrained='', **kwargs):
+    model = BLIP_Decoder(**kwargs)
+    if pretrained:
+        model, msg = load_checkpoint(model, pretrained)
+        assert (len(msg.missing_keys) == 0)
+    return model
+def blip_feature_extractor(pretrained='', **kwargs):
+    model = BLIP_Base(**kwargs)
+    if pretrained:
+        model, msg = load_checkpoint(model, pretrained)
+        assert (len(msg.missing_keys) == 0)
+    return model
+def init_tokenizer():
+    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+    tokenizer.add_special_tokens({'bos_token': '[DEC]'})
+    tokenizer.add_special_tokens({'additional_special_tokens': ['[ENC]']})
+    tokenizer.enc_token_id = tokenizer.additional_special_tokens_ids[0]
+    return tokenizer
+def create_vit(vit, image_size, use_grad_checkpointing=False, ckpt_layer=0, drop_path_rate=0):
+    assert vit in ['base', 'large'], "vit parameter must be base or large"
+    assert use_grad_checkpointing is False, 'grad checkpointing is not supported yet'
+    if vit == 'base':
+        vision_width = 768
+        visual_encoder = VisionTransformer(img_size=image_size, patch_size=16, embed_dim=vision_width, depth=12,
+                                           num_heads=12, use_grad_checkpointing=use_grad_checkpointing,
+                                           ckpt_layer=ckpt_layer,
+                                           drop_path_rate=0 or drop_path_rate
+                                           )
+    elif vit == 'large':
+        vision_width = 1024
+        visual_encoder = VisionTransformer(img_size=image_size, patch_size=16, embed_dim=vision_width, depth=24,
+                                           num_heads=16, use_grad_checkpointing=use_grad_checkpointing,
+                                           ckpt_layer=ckpt_layer,
+                                           drop_path_rate=0.1 or drop_path_rate
+                                           )
+    return visual_encoder, vision_width
+def is_url(url_or_filename):
+    parsed = urlparse(url_or_filename)
+    return parsed.scheme in ("http", "https")
+def load_checkpoint(model, url_or_filename):
+    if is_url(url_or_filename):
+        cached_file = download_cached_file(url_or_filename, check_hash=False, progress=True)
+        checkpoint = torch.load(cached_file, map_location='cpu')
+    elif os.path.isfile(url_or_filename):
+        checkpoint = torch.load(url_or_filename, map_location='cpu')
+    else:
+        raise RuntimeError('checkpoint url or path is invalid')
+    state_dict = checkpoint['model']
+    state_dict['visual_encoder.pos_embed'] = interpolate_pos_embed(state_dict['visual_encoder.pos_embed'],
+                                                                   model.visual_encoder)
+    if 'visual_encoder_m.pos_embed' in model.state_dict().keys():
+        state_dict['visual_encoder_m.pos_embed'] = interpolate_pos_embed(state_dict['visual_encoder_m.pos_embed'],
+                                                                         model.visual_encoder_m)
+    for key in model.state_dict().keys():
+        if key in state_dict.keys():
+            if state_dict[key].shape != model.state_dict()[key].shape:
+                del state_dict[key]
+    msg = model.load_state_dict(state_dict, strict=False)
+    print('load checkpoint from %s' % url_or_filename)
+    return model, msg

models/blip_override/med.py ADDED Viewed

	@@ -0,0 +1,955 @@

+'''
+ * Copyright (c) 2022, salesforce.com, inc.
+ * All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ * For full license text, see LICENSE.txt file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+ * By Junnan Li
+ * Based on huggingface code base
+ * https://github.com/huggingface/transformers/blob/v4.15.0/src/transformers/models/bert
+'''
+import math
+import os
+import warnings
+from dataclasses import dataclass
+from typing import Optional, Tuple
+import torch
+from torch import Tensor, device, dtype, nn
+import torch.utils.checkpoint
+from torch import nn
+from torch.nn import CrossEntropyLoss
+import torch.nn.functional as F
+from transformers.activations import ACT2FN
+from transformers.file_utils import (
+    ModelOutput,
+)
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPastAndCrossAttentions,
+    BaseModelOutputWithPoolingAndCrossAttentions,
+    CausalLMOutputWithCrossAttentions,
+    MaskedLMOutput,
+    MultipleChoiceModelOutput,
+    NextSentencePredictorOutput,
+    QuestionAnsweringModelOutput,
+    SequenceClassifierOutput,
+    TokenClassifierOutput,
+)
+from transformers.modeling_utils import (
+    PreTrainedModel,
+    apply_chunking_to_forward,
+    find_pruneable_heads_and_indices,
+    prune_linear_layer,
+)
+from transformers.utils import logging
+from transformers.models.bert.configuration_bert import BertConfig
+logger = logging.get_logger(__name__)
+class BertEmbeddings(nn.Module):
+    """Construct the embeddings from word and position embeddings."""
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        self.config = config
+    def forward(
+        self, input_ids=None, position_ids=None, inputs_embeds=None, past_key_values_length=0
+    ):
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+        seq_length = input_shape[1]
+        if position_ids is None:
+            position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length]
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+        embeddings = inputs_embeds
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings += position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+class BertSelfAttention(nn.Module):
+    def __init__(self, config, is_cross_attention):
+        super().__init__()
+        self.config = config
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                "The hidden size (%d) is not a multiple of the number of attention "
+                "heads (%d)" % (config.hidden_size, config.num_attention_heads)
+            )
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        if is_cross_attention:
+            self.key = nn.Linear(config.encoder_width, self.all_head_size)
+            self.value = nn.Linear(config.encoder_width, self.all_head_size)
+        else:
+            self.key = nn.Linear(config.hidden_size, self.all_head_size)
+            self.value = nn.Linear(config.hidden_size, self.all_head_size)
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
+        self.save_attention = False
+    def save_attn_gradients(self, attn_gradients):
+        self.attn_gradients = attn_gradients
+    def get_attn_gradients(self):
+        return self.attn_gradients
+    def save_attention_map(self, attention_map):
+        self.attention_map = attention_map
+    def get_attention_map(self):
+        return self.attention_map
+    def transpose_for_scores(self, x):
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(*new_x_shape)
+        return x.permute(0, 2, 1, 3)
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        mixed_query_layer = self.query(hidden_states)
+        # If this is instantiated as a cross-attention module, the keys
+        # and values come from an encoder; the attention mask needs to be
+        # such that the encoder's padding tokens are not attended to.
+        is_cross_attention = encoder_hidden_states is not None
+        if is_cross_attention:
+            key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
+            value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
+            attention_mask = encoder_attention_mask
+        elif past_key_value is not None:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+            key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
+            value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
+        else:
+            key_layer = self.transpose_for_scores(self.key(hidden_states))
+            value_layer = self.transpose_for_scores(self.value(hidden_states))
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        past_key_value = (key_layer, value_layer)
+        # Take the dot product between "query" and "key" to get the raw attention scores.
+        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            seq_length = hidden_states.size()[1]
+            position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
+            position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
+            attention_scores = attention_scores + attention_mask
+        # Normalize the attention scores to probabilities.
+        attention_probs = nn.Softmax(dim=-1)(attention_scores)
+        if is_cross_attention and self.save_attention:
+            self.save_attention_map(attention_probs)
+            attention_probs.register_hook(self.save_attn_gradients)
+        # This is actually dropping out entire tokens to attend to, which might
+        # seem a bit unusual, but is taken from the original Transformer paper.
+        attention_probs_dropped = self.dropout(attention_probs)
+        # Mask heads if we want to
+        if head_mask is not None:
+            attention_probs_dropped = attention_probs_dropped * head_mask
+        context_layer = torch.matmul(attention_probs_dropped, value_layer)
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(*new_context_layer_shape)
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+        outputs = outputs + (past_key_value,)
+        return outputs
+class BertSelfOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+class BertAttention(nn.Module):
+    def __init__(self, config, is_cross_attention=False):
+        super().__init__()
+        self.self = BertSelfAttention(config, is_cross_attention)
+        self.output = BertSelfOutput(config)
+        self.pruned_heads = set()
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
+        )
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
+        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+    ):
+        self_outputs = self.self(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            encoder_hidden_states,
+            encoder_attention_mask,
+            past_key_value,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+class BertIntermediate(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+class BertOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+    def forward(self, hidden_states, input_tensor):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+class BertLayer(nn.Module):
+    def __init__(self, config, layer_num):
+        super().__init__()
+        self.config = config
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = BertAttention(config)
+        self.layer_num = layer_num
+        if self.config.add_cross_attention:
+            self.crossattention = BertAttention(config, is_cross_attention=self.config.add_cross_attention)
+        self.intermediate = BertIntermediate(config)
+        self.output = BertOutput(config)
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_value=None,
+        output_attentions=False,
+        mode=None,
+    ):
+        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
+        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
+        self_attention_outputs = self.attention(
+            hidden_states,
+            attention_mask,
+            head_mask,
+            output_attentions=output_attentions,
+            past_key_value=self_attn_past_key_value,
+        )
+        attention_output = self_attention_outputs[0]
+        outputs = self_attention_outputs[1:-1]
+        present_key_value = self_attention_outputs[-1]
+        if mode=='multimodal':
+            assert encoder_hidden_states is not None, "encoder_hidden_states must be given for cross-attention layers"
+            cross_attention_outputs = self.crossattention(
+                attention_output,
+                attention_mask,
+                head_mask,
+                encoder_hidden_states,
+                encoder_attention_mask,
+                output_attentions=output_attentions,
+            )
+            attention_output = cross_attention_outputs[0]
+            outputs = outputs + cross_attention_outputs[1:-1]  # add cross attentions if we output attention weights
+        layer_output = apply_chunking_to_forward(
+            self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
+        )
+        outputs = (layer_output,) + outputs
+        outputs = outputs + (present_key_value,)
+        return outputs
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+class BertEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList([BertLayer(config,i) for i in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+    def forward(
+        self,
+        hidden_states,
+        attention_mask=None,
+        head_mask=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=False,
+        output_hidden_states=False,
+        return_dict=True,
+        mode='multimodal',
+    ):
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attentions = () if output_attentions else None
+        all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
+        next_decoder_cache = () if use_cache else None
+        for i in range(self.config.num_hidden_layers):
+            layer_module = self.layer[i]
+            if output_hidden_states:
+                all_hidden_states = all_hidden_states + (hidden_states,)
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            past_key_value = past_key_values[i] if past_key_values is not None else None
+            if self.gradient_checkpointing and self.training:
+                if use_cache:
+                    logger.warn(
+                        "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
+                    )
+                    use_cache = False
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, past_key_value, output_attentions)
+                    return custom_forward
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    mode=mode,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    attention_mask,
+                    layer_head_mask,
+                    encoder_hidden_states,
+                    encoder_attention_mask,
+                    past_key_value,
+                    output_attentions,
+                    mode=mode,
+                )
+            hidden_states = layer_outputs[0]
+            if use_cache:
+                next_decoder_cache += (layer_outputs[-1],)
+            if output_attentions:
+                all_self_attentions = all_self_attentions + (layer_outputs[1],)
+        if output_hidden_states:
+            all_hidden_states = all_hidden_states + (hidden_states,)
+        if not return_dict:
+            return tuple(
+                v
+                for v in [
+                    hidden_states,
+                    next_decoder_cache,
+                    all_hidden_states,
+                    all_self_attentions,
+                    all_cross_attentions,
+                ]
+                if v is not None
+            )
+        return BaseModelOutputWithPastAndCrossAttentions(
+            last_hidden_state=hidden_states,
+            past_key_values=next_decoder_cache,
+            hidden_states=all_hidden_states,
+            attentions=all_self_attentions,
+            cross_attentions=all_cross_attentions,
+        )
+class BertPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+    def forward(self, hidden_states):
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+class BertPredictionHeadTransform(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        if isinstance(config.hidden_act, str):
+            self.transform_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.transform_act_fn = config.hidden_act
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+    def forward(self, hidden_states):
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.transform_act_fn(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states)
+        return hidden_states
+class BertLMPredictionHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.transform = BertPredictionHeadTransform(config)
+        # The output weights are the same as the input embeddings, but there is
+        # an output-only bias for each token.
+        self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        self.bias = nn.Parameter(torch.zeros(config.vocab_size))
+        # Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
+        self.decoder.bias = self.bias
+    def forward(self, hidden_states):
+        hidden_states = self.transform(hidden_states)
+        hidden_states = self.decoder(hidden_states)
+        return hidden_states
+class BertOnlyMLMHead(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.predictions = BertLMPredictionHead(config)
+    def forward(self, sequence_output):
+        prediction_scores = self.predictions(sequence_output)
+        return prediction_scores
+class BertPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = BertConfig
+    base_model_prefix = "bert"
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+    def _init_weights(self, module):
+        """ Initialize the weights """
+        if isinstance(module, (nn.Linear, nn.Embedding)):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        if isinstance(module, nn.Linear) and module.bias is not None:
+            module.bias.data.zero_()
+class BertModel(BertPreTrainedModel):
+    """
+    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
+    cross-attention is added between the self-attention layers, following the architecture described in `Attention is
+    all you need <https://arxiv.org/abs/1706.03762>`__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
+    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
+    argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an
+    input to the forward pass.
+    """
+    def __init__(self, config, add_pooling_layer=True):
+        super().__init__(config)
+        self.config = config
+        self.embeddings = BertEmbeddings(config)
+        self.encoder = BertEncoder(config)
+        self.pooler = BertPooler(config) if add_pooling_layer else None
+        self.init_weights()
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+    def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: Tuple[int], device: device, is_decoder: bool) -> Tensor:
+        """
+        Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
+        Arguments:
+            attention_mask (:obj:`torch.Tensor`):
+                Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
+            input_shape (:obj:`Tuple[int]`):
+                The shape of the input to the model.
+            device: (:obj:`torch.device`):
+                The device of the input to the model.
+        Returns:
+            :obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
+        """
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        if attention_mask.dim() == 3:
+            extended_attention_mask = attention_mask[:, None, :, :]
+        elif attention_mask.dim() == 2:
+            # Provided a padding mask of dimensions [batch_size, seq_length]
+            # - if the model is a decoder, apply a causal mask in addition to the padding mask
+            # - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
+            if is_decoder:
+                batch_size, seq_length = input_shape
+                seq_ids = torch.arange(seq_length, device=device)
+                causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]
+                # in case past_key_values are used we need to add a prefix ones mask to the causal mask
+                # causal and attention masks must have same type with pytorch version < 1.3
+                causal_mask = causal_mask.to(attention_mask.dtype)
+                if causal_mask.shape[1] < attention_mask.shape[1]:
+                    prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
+                    causal_mask = torch.cat(
+                        [
+                            torch.ones((batch_size, seq_length, prefix_seq_len), device=device, dtype=causal_mask.dtype),
+                            causal_mask,
+                        ],
+                        axis=-1,
+                    )
+                extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
+            else:
+                extended_attention_mask = attention_mask[:, None, None, :]
+        else:
+            raise ValueError(
+                "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
+                    input_shape, attention_mask.shape
+                )
+            )
+        # Since attention_mask is 1.0 for positions we want to attend and 0.0 for
+        # masked positions, this operation will create a tensor which is 0.0 for
+        # positions we want to attend and -10000.0 for masked positions.
+        # Since we are adding it to the raw scores before the softmax, this is
+        # effectively the same as removing these entirely.
+        extended_attention_mask = extended_attention_mask.to(dtype=self.dtype)  # fp16 compatibility
+        extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
+        return extended_attention_mask
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        is_decoder=False,
+        mode='multimodal',
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        """
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if is_decoder:
+            use_cache = use_cache if use_cache is not None else self.config.use_cache
+        else:
+            use_cache = False
+        if input_ids is not None and inputs_embeds is not None:
+            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
+        elif input_ids is not None:
+            input_shape = input_ids.size()
+            batch_size, seq_length = input_shape
+            device = input_ids.device
+        elif inputs_embeds is not None:
+            input_shape = inputs_embeds.size()[:-1]
+            batch_size, seq_length = input_shape
+            device = inputs_embeds.device
+        elif encoder_embeds is not None:
+            input_shape = encoder_embeds.size()[:-1]
+            batch_size, seq_length = input_shape
+            device = encoder_embeds.device
+        else:
+            raise ValueError("You have to specify either input_ids or inputs_embeds or encoder_embeds")
+        # past_key_values_length
+        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
+        if attention_mask is None:
+            attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)
+        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
+        # ourselves in which case we just need to make it broadcastable to all heads.
+        extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape,
+                                                                                 device, is_decoder)
+        # If a 2D or 3D attention mask is provided for the cross-attention
+        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
+        if encoder_hidden_states is not None:
+            if type(encoder_hidden_states) == list:
+                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[0].size()
+            else:
+                encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
+            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
+            if type(encoder_attention_mask) == list:
+                encoder_extended_attention_mask = [self.invert_attention_mask(mask) for mask in encoder_attention_mask]
+            elif encoder_attention_mask is None:
+                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+            else:
+                encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
+        else:
+            encoder_extended_attention_mask = None
+        # Prepare head mask if needed
+        # 1.0 in head_mask indicate we keep the head
+        # attention_probs has shape bsz x n_heads x N x N
+        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
+        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
+        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+        if encoder_embeds is None:
+            embedding_output = self.embeddings(
+                input_ids=input_ids,
+                position_ids=position_ids,
+                inputs_embeds=inputs_embeds,
+                past_key_values_length=past_key_values_length,
+            )
+        else:
+            embedding_output = encoder_embeds
+        encoder_outputs = self.encoder(
+            embedding_output,
+            attention_mask=extended_attention_mask,
+            head_mask=head_mask,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_extended_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            mode=mode,
+        )
+        sequence_output = encoder_outputs[0]
+        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
+        if not return_dict:
+            return (sequence_output, pooled_output) + encoder_outputs[1:]
+        return BaseModelOutputWithPoolingAndCrossAttentions(
+            last_hidden_state=sequence_output,
+            pooler_output=pooled_output,
+            past_key_values=encoder_outputs.past_key_values,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+            cross_attentions=encoder_outputs.cross_attentions,
+        )
+class BertLMHeadModel(BertPreTrainedModel):
+    _keys_to_ignore_on_load_unexpected = [r"pooler"]
+    _keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
+    def __init__(self, config):
+        super().__init__(config)
+        self.bert = BertModel(config, add_pooling_layer=False)
+        self.cls = BertOnlyMLMHead(config)
+        self.init_weights()
+    def get_output_embeddings(self):
+        return self.cls.predictions.decoder
+    def set_output_embeddings(self, new_embeddings):
+        self.cls.predictions.decoder = new_embeddings
+    def forward(
+        self,
+        input_ids=None,
+        attention_mask=None,
+        position_ids=None,
+        head_mask=None,
+        inputs_embeds=None,
+        encoder_hidden_states=None,
+        encoder_attention_mask=None,
+        labels=None,
+        past_key_values=None,
+        use_cache=None,
+        output_attentions=None,
+        output_hidden_states=None,
+        return_dict=None,
+        return_logits=False,
+        is_decoder=True,
+        reduction='mean',
+        mode='multimodal',
+    ):
+        r"""
+        encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
+            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
+            the model is configured as a decoder.
+        encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
+            the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+        labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
+            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
+            ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
+            ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``
+        past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
+            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
+            If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
+            (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
+            instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
+        use_cache (:obj:`bool`, `optional`):
+            If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
+            decoding (see :obj:`past_key_values`).
+        Returns:
+        Example::
+            >>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig
+            >>> import torch
+            >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
+            >>> config = BertConfig.from_pretrained("bert-base-cased")
+            >>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
+            >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
+            >>> outputs = model(**inputs)
+            >>> prediction_logits = outputs.logits
+        """
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if labels is not None:
+            use_cache = False
+        outputs = self.bert(
+            input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            head_mask=head_mask,
+            inputs_embeds=inputs_embeds,
+            encoder_hidden_states=encoder_hidden_states,
+            encoder_attention_mask=encoder_attention_mask,
+            past_key_values=past_key_values,
+            use_cache=use_cache,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            return_dict=return_dict,
+            is_decoder=is_decoder,
+            mode=mode,
+        )
+        sequence_output = outputs[0]
+        prediction_scores = self.cls(sequence_output)
+        if return_logits:
+            return prediction_scores[:, :-1, :].contiguous()
+        lm_loss = None
+        if labels is not None:
+            # we are doing next-token prediction; shift prediction scores and input ids by one
+            shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
+            labels = labels[:, 1:].contiguous()
+            loss_fct = CrossEntropyLoss(reduction=reduction, label_smoothing=0.1)
+            lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
+            if reduction=='none':
+                lm_loss = lm_loss.view(prediction_scores.size(0),-1).sum(1)
+        if not return_dict:
+            output = (prediction_scores,) + outputs[2:]
+            return ((lm_loss,) + output) if lm_loss is not None else output
+        return CausalLMOutputWithCrossAttentions(
+            loss=lm_loss,
+            logits=prediction_scores,
+            past_key_values=outputs.past_key_values,
+            hidden_states=outputs.hidden_states,
+            attentions=outputs.attentions,
+            cross_attentions=outputs.cross_attentions,
+        )
+    def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):
+        input_shape = input_ids.shape
+        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
+        if attention_mask is None:
+            attention_mask = input_ids.new_ones(input_shape)
+        # cut decoder_input_ids if past is used
+        if past is not None:
+            input_ids = input_ids[:, -1:]
+        return {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "past_key_values": past,
+            "encoder_hidden_states": model_kwargs.get("encoder_hidden_states", None),
+            "encoder_attention_mask": model_kwargs.get("encoder_attention_mask", None),
+            "is_decoder": True,
+        }
+    def _reorder_cache(self, past, beam_idx):
+        reordered_past = ()
+        for layer_past in past:
+            reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
+        return reordered_past

models/blip_override/med_config.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "type_vocab_size": 2,
+  "vocab_size": 30524,
+  "encoder_width": 768,
+  "add_cross_attention": true
+}

models/blip_override/vit.py ADDED Viewed

	@@ -0,0 +1,302 @@

+'''
+ * Copyright (c) 2022, salesforce.com, inc.
+ * All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ * For full license text, see LICENSE.txt file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+ * By Junnan Li
+ * Based on timm code base
+ * https://github.com/rwightman/pytorch-image-models/tree/master/timm
+'''
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from functools import partial
+from timm.models.vision_transformer import _cfg, PatchEmbed
+from timm.models.registry import register_model
+from timm.models.layers import trunc_normal_, DropPath
+from timm.models.helpers import named_apply, adapt_input_conv
+class Mlp(nn.Module):
+    """ MLP as used in Vision Transformer, MLP-Mixer and related networks
+    """
+    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
+        super().__init__()
+        out_features = out_features or in_features
+        hidden_features = hidden_features or in_features
+        self.fc1 = nn.Linear(in_features, hidden_features)
+        self.act = act_layer()
+        self.fc2 = nn.Linear(hidden_features, out_features)
+        self.drop = nn.Dropout(drop)
+    def forward(self, x):
+        x = self.fc1(x)
+        x = self.act(x)
+        x = self.drop(x)
+        x = self.fc2(x)
+        x = self.drop(x)
+        return x
+class Attention(nn.Module):
+    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.):
+        super().__init__()
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        # NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
+        self.scale = qk_scale or head_dim ** -0.5
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.attn_drop = nn.Dropout(attn_drop)
+        self.proj = nn.Linear(dim, dim)
+        self.proj_drop = nn.Dropout(proj_drop)
+        self.attn_gradients = None
+        self.attention_map = None
+    def save_attn_gradients(self, attn_gradients):
+        self.attn_gradients = attn_gradients
+    def get_attn_gradients(self):
+        return self.attn_gradients
+    def save_attention_map(self, attention_map):
+        self.attention_map = attention_map
+    def get_attention_map(self):
+        return self.attention_map
+    def forward(self, x, register_hook=False):
+        B, N, C = x.shape
+        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]  # make torchscript happy (cannot use tensor as tuple)
+        attn = (q @ k.transpose(-2, -1)) * self.scale
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        if register_hook:
+            self.save_attention_map(attn)
+            attn.register_hook(self.save_attn_gradients)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+class Block(nn.Module):
+    def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
+                 drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, use_grad_checkpointing=False):
+        super().__init__()
+        self.norm1 = norm_layer(dim)
+        self.attn = Attention(
+            dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop)
+        # NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
+        self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
+        self.norm2 = norm_layer(dim)
+        mlp_hidden_dim = int(dim * mlp_ratio)
+        self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
+    def forward(self, x, register_hook=False):
+        x = x + self.drop_path(self.attn(self.norm1(x), register_hook=register_hook))
+        x = x + self.drop_path(self.mlp(self.norm2(x)))
+        return x
+class VisionTransformer(nn.Module):
+    """ Vision Transformer
+    A PyTorch impl of : `An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale`  -
+        https://arxiv.org/abs/2010.11929
+    """
+    def __init__(self, img_size=224, patch_size=16, in_chans=3, num_classes=1000, embed_dim=768, depth=12,
+                 num_heads=12, mlp_ratio=4., qkv_bias=True, qk_scale=None, representation_size=None,
+                 drop_rate=0., attn_drop_rate=0., drop_path_rate=0., norm_layer=None,
+                 use_grad_checkpointing=False, ckpt_layer=0):
+        """
+        Args:
+            img_size (int, tuple): input image size
+            patch_size (int, tuple): patch size
+            in_chans (int): number of input channels
+            num_classes (int): number of classes for classification head
+            embed_dim (int): embedding dimension
+            depth (int): depth of transformer
+            num_heads (int): number of attention heads
+            mlp_ratio (int): ratio of mlp hidden dim to embedding dim
+            qkv_bias (bool): enable bias for qkv if True
+            qk_scale (float): override default qk scale of head_dim ** -0.5 if set
+            representation_size (Optional[int]): enable and set representation layer (pre-logits) to this value if set
+            drop_rate (float): dropout rate
+            attn_drop_rate (float): attention dropout rate
+            drop_path_rate (float): stochastic depth rate
+            norm_layer: (nn.Module): normalization layer
+        """
+        super().__init__()
+        self.num_features = self.embed_dim = embed_dim  # num_features for consistency with other models
+        norm_layer = norm_layer or partial(nn.LayerNorm, eps=1e-6)
+        self.patch_embed = PatchEmbed(
+            img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim)
+        num_patches = self.patch_embed.num_patches
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
+        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
+        self.pos_drop = nn.Dropout(p=drop_rate)
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
+        self.blocks = nn.ModuleList([
+            Block(
+                dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
+                drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer,
+                use_grad_checkpointing=(use_grad_checkpointing and i >= depth - ckpt_layer)
+            )
+            for i in range(depth)])
+        self.norm = norm_layer(embed_dim)
+        trunc_normal_(self.pos_embed, std=.02)
+        trunc_normal_(self.cls_token, std=.02)
+        self.apply(self._init_weights)
+    def _init_weights(self, m):
+        if isinstance(m, nn.Linear):
+            trunc_normal_(m.weight, std=.02)
+            if isinstance(m, nn.Linear) and m.bias is not None:
+                nn.init.constant_(m.bias, 0)
+        elif isinstance(m, nn.LayerNorm):
+            nn.init.constant_(m.bias, 0)
+            nn.init.constant_(m.weight, 1.0)
+    @torch.jit.ignore
+    def no_weight_decay(self):
+        return {'pos_embed', 'cls_token'}
+    def forward(self, x, register_blk=-1):
+        B = x.shape[0]
+        x = self.patch_embed(x)
+        cls_tokens = self.cls_token.expand(B, -1, -1)  # stole cls_tokens impl from Phil Wang, thanks
+        x = torch.cat((cls_tokens, x), dim=1)
+        x = x + self.pos_embed[:, :x.size(1), :]
+        x = self.pos_drop(x)
+        for i, blk in enumerate(self.blocks):
+            x = blk(x, register_blk == i)
+        x = self.norm(x)
+        return x
+    @torch.jit.ignore()
+    def load_pretrained(self, checkpoint_path, prefix=''):
+        _load_weights(self, checkpoint_path, prefix)
+@torch.no_grad()
+def _load_weights(model: VisionTransformer, checkpoint_path: str, prefix: str = ''):
+    """ Load weights from .npz checkpoints for official Google Brain Flax implementation
+    """
+    import numpy as np
+    def _n2p(w, t=True):
+        if w.ndim == 4 and w.shape[0] == w.shape[1] == w.shape[2] == 1:
+            w = w.flatten()
+        if t:
+            if w.ndim == 4:
+                w = w.transpose([3, 2, 0, 1])
+            elif w.ndim == 3:
+                w = w.transpose([2, 0, 1])
+            elif w.ndim == 2:
+                w = w.transpose([1, 0])
+        return torch.from_numpy(w)
+    w = np.load(checkpoint_path)
+    if not prefix and 'opt/target/embedding/kernel' in w:
+        prefix = 'opt/target/'
+    if hasattr(model.patch_embed, 'backbone'):
+        # hybrid
+        backbone = model.patch_embed.backbone
+        stem_only = not hasattr(backbone, 'stem')
+        stem = backbone if stem_only else backbone.stem
+        stem.conv.weight.copy_(adapt_input_conv(stem.conv.weight.shape[1], _n2p(w[f'{prefix}conv_root/kernel'])))
+        stem.norm.weight.copy_(_n2p(w[f'{prefix}gn_root/scale']))
+        stem.norm.bias.copy_(_n2p(w[f'{prefix}gn_root/bias']))
+        if not stem_only:
+            for i, stage in enumerate(backbone.stages):
+                for j, block in enumerate(stage.blocks):
+                    bp = f'{prefix}block{i + 1}/unit{j + 1}/'
+                    for r in range(3):
+                        getattr(block, f'conv{r + 1}').weight.copy_(_n2p(w[f'{bp}conv{r + 1}/kernel']))
+                        getattr(block, f'norm{r + 1}').weight.copy_(_n2p(w[f'{bp}gn{r + 1}/scale']))
+                        getattr(block, f'norm{r + 1}').bias.copy_(_n2p(w[f'{bp}gn{r + 1}/bias']))
+                    if block.downsample is not None:
+                        block.downsample.conv.weight.copy_(_n2p(w[f'{bp}conv_proj/kernel']))
+                        block.downsample.norm.weight.copy_(_n2p(w[f'{bp}gn_proj/scale']))
+                        block.downsample.norm.bias.copy_(_n2p(w[f'{bp}gn_proj/bias']))
+        embed_conv_w = _n2p(w[f'{prefix}embedding/kernel'])
+    else:
+        embed_conv_w = adapt_input_conv(
+            model.patch_embed.proj.weight.shape[1], _n2p(w[f'{prefix}embedding/kernel']))
+    model.patch_embed.proj.weight.copy_(embed_conv_w)
+    model.patch_embed.proj.bias.copy_(_n2p(w[f'{prefix}embedding/bias']))
+    model.cls_token.copy_(_n2p(w[f'{prefix}cls'], t=False))
+    pos_embed_w = _n2p(w[f'{prefix}Transformer/posembed_input/pos_embedding'], t=False)
+    if pos_embed_w.shape != model.pos_embed.shape:
+        pos_embed_w = resize_pos_embed(  # resize pos embedding when different size from pretrained weights
+            pos_embed_w, model.pos_embed, getattr(model, 'num_tokens', 1), model.patch_embed.grid_size)
+    model.pos_embed.copy_(pos_embed_w)
+    model.norm.weight.copy_(_n2p(w[f'{prefix}Transformer/encoder_norm/scale']))
+    model.norm.bias.copy_(_n2p(w[f'{prefix}Transformer/encoder_norm/bias']))
+    #     if isinstance(model.head, nn.Linear) and model.head.bias.shape[0] == w[f'{prefix}head/bias'].shape[-1]:
+    #         model.head.weight.copy_(_n2p(w[f'{prefix}head/kernel']))
+    #         model.head.bias.copy_(_n2p(w[f'{prefix}head/bias']))
+    #     if isinstance(getattr(model.pre_logits, 'fc', None), nn.Linear) and f'{prefix}pre_logits/bias' in w:
+    #         model.pre_logits.fc.weight.copy_(_n2p(w[f'{prefix}pre_logits/kernel']))
+    #         model.pre_logits.fc.bias.copy_(_n2p(w[f'{prefix}pre_logits/bias']))
+    for i, block in enumerate(model.blocks.children()):
+        block_prefix = f'{prefix}Transformer/encoderblock_{i}/'
+        mha_prefix = block_prefix + 'MultiHeadDotProductAttention_1/'
+        block.norm1.weight.copy_(_n2p(w[f'{block_prefix}LayerNorm_0/scale']))
+        block.norm1.bias.copy_(_n2p(w[f'{block_prefix}LayerNorm_0/bias']))
+        block.attn.qkv.weight.copy_(torch.cat([
+            _n2p(w[f'{mha_prefix}{n}/kernel'], t=False).flatten(1).T for n in ('query', 'key', 'value')]))
+        block.attn.qkv.bias.copy_(torch.cat([
+            _n2p(w[f'{mha_prefix}{n}/bias'], t=False).reshape(-1) for n in ('query', 'key', 'value')]))
+        block.attn.proj.weight.copy_(_n2p(w[f'{mha_prefix}out/kernel']).flatten(1))
+        block.attn.proj.bias.copy_(_n2p(w[f'{mha_prefix}out/bias']))
+        for r in range(2):
+            getattr(block.mlp, f'fc{r + 1}').weight.copy_(_n2p(w[f'{block_prefix}MlpBlock_3/Dense_{r}/kernel']))
+            getattr(block.mlp, f'fc{r + 1}').bias.copy_(_n2p(w[f'{block_prefix}MlpBlock_3/Dense_{r}/bias']))
+        block.norm2.weight.copy_(_n2p(w[f'{block_prefix}LayerNorm_2/scale']))
+        block.norm2.bias.copy_(_n2p(w[f'{block_prefix}LayerNorm_2/bias']))
+def interpolate_pos_embed(pos_embed_checkpoint, visual_encoder):
+    # interpolate position embedding
+    embedding_size = pos_embed_checkpoint.shape[-1]
+    num_patches = visual_encoder.patch_embed.num_patches
+    num_extra_tokens = visual_encoder.pos_embed.shape[-2] - num_patches
+    # height (== width) for the checkpoint position embedding
+    orig_size = int((pos_embed_checkpoint.shape[-2] - num_extra_tokens) ** 0.5)
+    # height (== width) for the new position embedding
+    new_size = int(num_patches ** 0.5)
+    if orig_size != new_size:
+        # class_token and dist_token are kept unchanged
+        extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
+        # only the position tokens are interpolated
+        pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
+        pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
+        pos_tokens = torch.nn.functional.interpolate(
+            pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
+        pos_tokens = pos_tokens.permute(0, 2, 3, 1).flatten(1, 2)
+        new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
+        print('reshape position embedding from %d to %d' % (orig_size ** 2, new_size ** 2))
+        return new_pos_embed
+    else:
+        return pos_embed_checkpoint

models/diffusers_override/attention.py ADDED Viewed

	@@ -0,0 +1,669 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import math
+from dataclasses import dataclass
+from typing import Optional
+import torch
+import torch.nn.functional as F
+from torch import nn
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.modeling_utils import ModelMixin
+from diffusers.models.embeddings import ImagePositionalEmbeddings
+from diffusers.utils import BaseOutput
+from diffusers.utils.import_utils import is_xformers_available
+@dataclass
+class Transformer2DModelOutput(BaseOutput):
+    """
+    Args:
+        sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` or `(batch size, num_vector_embeds - 1, num_latent_pixels)` if [`Transformer2DModel`] is discrete):
+            Hidden states conditioned on `encoder_hidden_states` input. If discrete, returns probability distributions
+            for the unnoised latent pixels.
+    """
+    sample: torch.FloatTensor
+if is_xformers_available():
+    import xformers
+    import xformers.ops
+else:
+    xformers = None
+class Transformer2DModel(ModelMixin, ConfigMixin):
+    """
+    Transformer model for image-like data. Takes either discrete (classes of vector embeddings) or continuous (actual
+    embeddings) inputs.
+    When input is continuous: First, project the input (aka embedding) and reshape to b, t, d. Then apply standard
+    transformer action. Finally, reshape to image.
+    When input is discrete: First, input (classes of latent pixels) is converted to embeddings and has positional
+    embeddings applied, see `ImagePositionalEmbeddings`. Then apply standard transformer action. Finally, predict
+    classes of unnoised image.
+    Note that it is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised
+    image do not contain a prediction for the masked pixel as the unnoised image cannot be masked.
+    Parameters:
+        num_attention_heads (`int`, *optional*, defaults to 16): The number of heads to use for multi-head attention.
+        attention_head_dim (`int`, *optional*, defaults to 88): The number of channels in each head.
+        in_channels (`int`, *optional*):
+            Pass if the input is continuous. The number of channels in the input and output.
+        num_layers (`int`, *optional*, defaults to 1): The number of layers of Transformer blocks to use.
+        dropout (`float`, *optional*, defaults to 0.1): The dropout probability to use.
+        cross_attention_dim (`int`, *optional*): The number of context dimensions to use.
+        sample_size (`int`, *optional*): Pass if the input is discrete. The width of the latent images.
+            Note that this is fixed at training time as it is used for learning a number of position embeddings. See
+            `ImagePositionalEmbeddings`.
+        num_vector_embeds (`int`, *optional*):
+            Pass if the input is discrete. The number of classes of the vector embeddings of the latent pixels.
+            Includes the class for the masked latent pixel.
+        activation_fn (`str`, *optional*, defaults to `"geglu"`): Activation function to be used in feed-forward.
+        num_embeds_ada_norm ( `int`, *optional*): Pass if at least one of the norm_layers is `AdaLayerNorm`.
+            The number of diffusion steps used during training. Note that this is fixed at training time as it is used
+            to learn a number of embeddings that are added to the hidden states. During inference, you can denoise for
+            up to but not more than steps than `num_embeds_ada_norm`.
+        attention_bias (`bool`, *optional*):
+            Configure if the TransformerBlocks' attention should contain a bias parameter.
+    """
+    @register_to_config
+    def __init__(
+            self,
+            num_attention_heads: int = 16,
+            attention_head_dim: int = 88,
+            in_channels: Optional[int] = None,
+            num_layers: int = 1,
+            dropout: float = 0.0,
+            norm_num_groups: int = 32,
+            cross_attention_dim: Optional[int] = None,
+            attention_bias: bool = False,
+            sample_size: Optional[int] = None,
+            num_vector_embeds: Optional[int] = None,
+            activation_fn: str = "geglu",
+            num_embeds_ada_norm: Optional[int] = None,
+    ):
+        super().__init__()
+        self.num_attention_heads = num_attention_heads
+        self.attention_head_dim = attention_head_dim
+        inner_dim = num_attention_heads * attention_head_dim
+        # 1. Transformer2DModel can process both standard continous images of shape `(batch_size, num_channels, width, height)` as well as quantized image embeddings of shape `(batch_size, num_image_vectors)`
+        # Define whether input is continuous or discrete depending on configuration
+        self.is_input_continuous = in_channels is not None
+        self.is_input_vectorized = num_vector_embeds is not None
+        if self.is_input_continuous and self.is_input_vectorized:
+            raise ValueError(
+                f"Cannot define both `in_channels`: {in_channels} and `num_vector_embeds`: {num_vector_embeds}. Make"
+                " sure that either `in_channels` or `num_vector_embeds` is None."
+            )
+        elif not self.is_input_continuous and not self.is_input_vectorized:
+            raise ValueError(
+                f"Has to define either `in_channels`: {in_channels} or `num_vector_embeds`: {num_vector_embeds}. Make"
+                " sure that either `in_channels` or `num_vector_embeds` is not None."
+            )
+        # 2. Define input layers
+        if self.is_input_continuous:
+            self.in_channels = in_channels
+            self.norm = torch.nn.GroupNorm(num_groups=norm_num_groups, num_channels=in_channels, eps=1e-6, affine=True)
+            self.proj_in = nn.Conv2d(in_channels, inner_dim, kernel_size=1, stride=1, padding=0)
+        elif self.is_input_vectorized:
+            assert sample_size is not None, "Transformer2DModel over discrete input must provide sample_size"
+            assert num_vector_embeds is not None, "Transformer2DModel over discrete input must provide num_embed"
+            self.height = sample_size
+            self.width = sample_size
+            self.num_vector_embeds = num_vector_embeds
+            self.num_latent_pixels = self.height * self.width
+            self.latent_image_embedding = ImagePositionalEmbeddings(
+                num_embed=num_vector_embeds, embed_dim=inner_dim, height=self.height, width=self.width
+            )
+        # 3. Define transformers blocks
+        self.transformer_blocks = nn.ModuleList(
+            [
+                BasicTransformerBlock(
+                    inner_dim,
+                    num_attention_heads,
+                    attention_head_dim,
+                    dropout=dropout,
+                    cross_attention_dim=cross_attention_dim,
+                    activation_fn=activation_fn,
+                    num_embeds_ada_norm=num_embeds_ada_norm,
+                    attention_bias=attention_bias,
+                )
+                for d in range(num_layers)
+            ]
+        )
+        # 4. Define output layers
+        if self.is_input_continuous:
+            self.proj_out = nn.Conv2d(inner_dim, in_channels, kernel_size=1, stride=1, padding=0)
+        elif self.is_input_vectorized:
+            self.norm_out = nn.LayerNorm(inner_dim)
+            self.out = nn.Linear(inner_dim, self.num_vector_embeds - 1)
+    def _set_attention_slice(self, slice_size):
+        for block in self.transformer_blocks:
+            block._set_attention_slice(slice_size)
+    def forward(self, hidden_states, encoder_hidden_states=None, encoder_attention_mask=None, timestep=None,
+                return_dict: bool = True):
+        """
+        Args:
+            hidden_states ( When discrete, `torch.LongTensor` of shape `(batch size, num latent pixels)`.
+                When continous, `torch.FloatTensor` of shape `(batch size, channel, height, width)`): Input
+                hidden_states
+            encoder_hidden_states ( `torch.LongTensor` of shape `(batch size, context dim)`, *optional*):
+                Conditional embeddings for cross attention layer. If not given, cross-attention defaults to
+                self-attention.
+            encoder_attention_mask ( `torch.LongTensor` of shape `(batch size, context)`, *optional*):
+                Attention mask for cross attention layer.
+            timestep ( `torch.long`, *optional*):
+                Optional timestep to be applied as an embedding in AdaLayerNorm's. Used to indicate denoising step.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`models.unet_2d_condition.UNet2DConditionOutput`] instead of a plain tuple.
+        Returns:
+            [`~models.attention.Transformer2DModelOutput`] or `tuple`: [`~models.attention.Transformer2DModelOutput`]
+            if `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is the sample
+            tensor.
+        """
+        # 1. Input
+        if self.is_input_continuous:
+            batch, channel, height, weight = hidden_states.shape
+            residual = hidden_states
+            hidden_states = self.norm(hidden_states)
+            hidden_states = self.proj_in(hidden_states)
+            inner_dim = hidden_states.shape[1]
+            hidden_states = hidden_states.permute(0, 2, 3, 1).reshape(batch, height * weight, inner_dim)
+        elif self.is_input_vectorized:
+            hidden_states = self.latent_image_embedding(hidden_states)
+        # 2. Blocks
+        for block in self.transformer_blocks:
+            hidden_states = block(hidden_states, context=encoder_hidden_states, mask=encoder_attention_mask,
+                                  timestep=timestep)
+        # 3. Output
+        if self.is_input_continuous:
+            hidden_states = hidden_states.reshape(batch, height, weight, inner_dim).permute(0, 3, 1, 2)
+            hidden_states = self.proj_out(hidden_states)
+            output = hidden_states + residual
+        elif self.is_input_vectorized:
+            hidden_states = self.norm_out(hidden_states)
+            logits = self.out(hidden_states)
+            # (batch, self.num_vector_embeds - 1, self.num_latent_pixels)
+            logits = logits.permute(0, 2, 1)
+            # log(p(x_0))
+            output = F.log_softmax(logits.double(), dim=1).float()
+        if not return_dict:
+            return (output,)
+        return Transformer2DModelOutput(sample=output)
+    def _set_use_memory_efficient_attention_xformers(self, use_memory_efficient_attention_xformers: bool):
+        for block in self.transformer_blocks:
+            block._set_use_memory_efficient_attention_xformers(use_memory_efficient_attention_xformers)
+class AttentionBlock(nn.Module):
+    """
+    An attention block that allows spatial positions to attend to each other. Originally ported from here, but adapted
+    to the N-d case.
+    https://github.com/hojonathanho/diffusion/blob/1e0dceb3b3495bbe19116a5e1b3596cd0706c543/diffusion_tf/models/unet.py#L66.
+    Uses three q, k, v linear layers to compute attention.
+    Parameters:
+        channels (`int`): The number of channels in the input and output.
+        num_head_channels (`int`, *optional*):
+            The number of channels in each head. If None, then `num_heads` = 1.
+        norm_num_groups (`int`, *optional*, defaults to 32): The number of groups to use for group norm.
+        rescale_output_factor (`float`, *optional*, defaults to 1.0): The factor to rescale the output by.
+        eps (`float`, *optional*, defaults to 1e-5): The epsilon value to use for group norm.
+    """
+    def __init__(
+            self,
+            channels: int,
+            num_head_channels: Optional[int] = None,
+            norm_num_groups: int = 32,
+            rescale_output_factor: float = 1.0,
+            eps: float = 1e-5,
+    ):
+        super().__init__()
+        self.channels = channels
+        self.num_heads = channels // num_head_channels if num_head_channels is not None else 1
+        self.num_head_size = num_head_channels
+        self.group_norm = nn.GroupNorm(num_channels=channels, num_groups=norm_num_groups, eps=eps, affine=True)
+        # define q,k,v as linear layers
+        self.query = nn.Linear(channels, channels)
+        self.key = nn.Linear(channels, channels)
+        self.value = nn.Linear(channels, channels)
+        self.rescale_output_factor = rescale_output_factor
+        self.proj_attn = nn.Linear(channels, channels, 1)
+    def transpose_for_scores(self, projection: torch.Tensor) -> torch.Tensor:
+        new_projection_shape = projection.size()[:-1] + (self.num_heads, -1)
+        # move heads to 2nd position (B, T, H * D) -> (B, T, H, D) -> (B, H, T, D)
+        new_projection = projection.view(new_projection_shape).permute(0, 2, 1, 3)
+        return new_projection
+    def forward(self, hidden_states):
+        residual = hidden_states
+        batch, channel, height, width = hidden_states.shape
+        # norm
+        hidden_states = self.group_norm(hidden_states)
+        hidden_states = hidden_states.view(batch, channel, height * width).transpose(1, 2)
+        # proj to q, k, v
+        query_proj = self.query(hidden_states)
+        key_proj = self.key(hidden_states)
+        value_proj = self.value(hidden_states)
+        # transpose
+        query_states = self.transpose_for_scores(query_proj)
+        key_states = self.transpose_for_scores(key_proj)
+        value_states = self.transpose_for_scores(value_proj)
+        # get scores
+        scale = 1 / math.sqrt(math.sqrt(self.channels / self.num_heads))
+        attention_scores = torch.matmul(query_states * scale, key_states.transpose(-1, -2) * scale)  # TODO: use baddmm
+        attention_probs = torch.softmax(attention_scores.float(), dim=-1).type(attention_scores.dtype)
+        # compute attention output
+        hidden_states = torch.matmul(attention_probs, value_states)
+        hidden_states = hidden_states.permute(0, 2, 1, 3).contiguous()
+        new_hidden_states_shape = hidden_states.size()[:-2] + (self.channels,)
+        hidden_states = hidden_states.view(new_hidden_states_shape)
+        # compute next hidden_states
+        hidden_states = self.proj_attn(hidden_states)
+        hidden_states = hidden_states.transpose(-1, -2).reshape(batch, channel, height, width)
+        # res connect and rescale
+        hidden_states = (hidden_states + residual) / self.rescale_output_factor
+        return hidden_states
+class BasicTransformerBlock(nn.Module):
+    r"""
+    A basic Transformer block.
+    Parameters:
+        dim (`int`): The number of channels in the input and output.
+        num_attention_heads (`int`): The number of heads to use for multi-head attention.
+        attention_head_dim (`int`): The number of channels in each head.
+        dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.
+        cross_attention_dim (`int`, *optional*): The size of the context vector for cross attention.
+        activation_fn (`str`, *optional*, defaults to `"geglu"`): Activation function to be used in feed-forward.
+        num_embeds_ada_norm (:
+            obj: `int`, *optional*): The number of diffusion steps used during training. See `Transformer2DModel`.
+        attention_bias (:
+            obj: `bool`, *optional*, defaults to `False`): Configure if the attentions should contain a bias parameter.
+    """
+    def __init__(
+            self,
+            dim: int,
+            num_attention_heads: int,
+            attention_head_dim: int,
+            dropout=0.0,
+            cross_attention_dim: Optional[int] = None,
+            activation_fn: str = "geglu",
+            num_embeds_ada_norm: Optional[int] = None,
+            attention_bias: bool = False,
+    ):
+        super().__init__()
+        self.attn1 = CrossAttention(
+            query_dim=dim,
+            heads=num_attention_heads,
+            dim_head=attention_head_dim,
+            dropout=dropout,
+            bias=attention_bias,
+        )  # is a self-attention
+        self.ff = FeedForward(dim, dropout=dropout, activation_fn=activation_fn)
+        self.attn2 = CrossAttention(
+            query_dim=dim,
+            cross_attention_dim=cross_attention_dim,
+            heads=num_attention_heads,
+            dim_head=attention_head_dim,
+            dropout=dropout,
+            bias=attention_bias,
+        )  # is self-attn if context is none
+        # layer norms
+        self.use_ada_layer_norm = num_embeds_ada_norm is not None
+        if self.use_ada_layer_norm:
+            self.norm1 = AdaLayerNorm(dim, num_embeds_ada_norm)
+            self.norm2 = AdaLayerNorm(dim, num_embeds_ada_norm)
+        else:
+            self.norm1 = nn.LayerNorm(dim)
+            self.norm2 = nn.LayerNorm(dim)
+        self.norm3 = nn.LayerNorm(dim)
+    def _set_attention_slice(self, slice_size):
+        self.attn1._slice_size = slice_size
+        self.attn2._slice_size = slice_size
+    def _set_use_memory_efficient_attention_xformers(self, use_memory_efficient_attention_xformers: bool):
+        if not is_xformers_available():
+            print("Here is how to install it")
+            raise ModuleNotFoundError(
+                "Refer to https://github.com/facebookresearch/xformers for more information on how to install"
+                " xformers",
+                name="xformers",
+            )
+        elif not torch.cuda.is_available():
+            raise ValueError(
+                "torch.cuda.is_available() should be True but is False. xformers' memory efficient attention is only"
+                " available for GPU "
+            )
+        else:
+            try:
+                # Make sure we can run the memory efficient attention
+                _ = xformers.ops.memory_efficient_attention(
+                    torch.randn((1, 2, 40), device="cuda"),
+                    torch.randn((1, 2, 40), device="cuda"),
+                    torch.randn((1, 2, 40), device="cuda"),
+                )
+            except Exception as e:
+                raise e
+            self.attn1._use_memory_efficient_attention_xformers = use_memory_efficient_attention_xformers
+            self.attn2._use_memory_efficient_attention_xformers = use_memory_efficient_attention_xformers
+    def forward(self, hidden_states, context=None, mask=None, timestep=None):
+        # 1. Self-Attention
+        norm_hidden_states = (
+            self.norm1(hidden_states, timestep) if self.use_ada_layer_norm else self.norm1(hidden_states)
+        )
+        hidden_states = self.attn1(norm_hidden_states) + hidden_states
+        # 2. Cross-Attention
+        norm_hidden_states = (
+            self.norm2(hidden_states, timestep) if self.use_ada_layer_norm else self.norm2(hidden_states)
+        )
+        hidden_states = self.attn2(norm_hidden_states, context=context, mask=mask) + hidden_states
+        # 3. Feed-forward
+        hidden_states = self.ff(self.norm3(hidden_states)) + hidden_states
+        return hidden_states
+class CrossAttention(nn.Module):
+    r"""
+    A cross attention layer.
+    Parameters:
+        query_dim (`int`): The number of channels in the query.
+        cross_attention_dim (`int`, *optional*):
+            The number of channels in the context. If not given, defaults to `query_dim`.
+        heads (`int`,  *optional*, defaults to 8): The number of heads to use for multi-head attention.
+        dim_head (`int`,  *optional*, defaults to 64): The number of channels in each head.
+        dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.
+        bias (`bool`, *optional*, defaults to False):
+            Set to `True` for the query, key, and value linear layers to contain a bias parameter.
+    """
+    def __init__(
+            self,
+            query_dim: int,
+            cross_attention_dim: Optional[int] = None,
+            heads: int = 8,
+            dim_head: int = 64,
+            dropout: float = 0.0,
+            bias=False,
+    ):
+        super().__init__()
+        inner_dim = dim_head * heads
+        cross_attention_dim = cross_attention_dim if cross_attention_dim is not None else query_dim
+        self.scale = dim_head ** -0.5
+        self.heads = heads
+        # for slice_size > 0 the attention score computation
+        # is split across the batch axis to save memory
+        # You can set slice_size with `set_attention_slice`
+        self._slice_size = None
+        self._use_memory_efficient_attention_xformers = False
+        self.to_q = nn.Linear(query_dim, inner_dim, bias=bias)
+        self.to_k = nn.Linear(cross_attention_dim, inner_dim, bias=bias)
+        self.to_v = nn.Linear(cross_attention_dim, inner_dim, bias=bias)
+        self.to_out = nn.ModuleList([])
+        self.to_out.append(nn.Linear(inner_dim, query_dim))
+        self.to_out.append(nn.Dropout(dropout))
+    def reshape_heads_to_batch_dim(self, tensor):
+        batch_size, seq_len, dim = tensor.shape
+        head_size = self.heads
+        tensor = tensor.reshape(batch_size, seq_len, head_size, dim // head_size)
+        tensor = tensor.permute(0, 2, 1, 3).reshape(batch_size * head_size, seq_len, dim // head_size)
+        return tensor
+    def reshape_batch_dim_to_heads(self, tensor):
+        batch_size, seq_len, dim = tensor.shape
+        head_size = self.heads
+        tensor = tensor.reshape(batch_size // head_size, head_size, seq_len, dim)
+        tensor = tensor.permute(0, 2, 1, 3).reshape(batch_size // head_size, seq_len, dim * head_size)
+        return tensor
+    def forward(self, hidden_states, context=None, mask=None):
+        batch_size, sequence_length, _ = hidden_states.shape
+        query = self.to_q(hidden_states)
+        context = context if context is not None else hidden_states
+        key = self.to_k(context)
+        value = self.to_v(context)
+        dim = query.shape[-1]
+        query = self.reshape_heads_to_batch_dim(query)
+        key = self.reshape_heads_to_batch_dim(key)
+        value = self.reshape_heads_to_batch_dim(value)
+        mask = mask.repeat_interleave(self.heads, dim=0).unsqueeze(1) if mask is not None else None
+        # attention, what we cannot get enough of
+        if self._use_memory_efficient_attention_xformers:
+            hidden_states = self._memory_efficient_attention_xformers(query, key, value)
+        else:
+            if self._slice_size is None or query.shape[0] // self._slice_size == 1:
+                hidden_states = self._attention(query, key, value, mask)
+            else:
+                assert mask is None, "masking is not supported for sliced attention"
+                hidden_states = self._sliced_attention(query, key, value, sequence_length, dim)
+        # linear proj
+        hidden_states = self.to_out[0](hidden_states)
+        # dropout
+        hidden_states = self.to_out[1](hidden_states)
+        return hidden_states
+    def _attention(self, query, key, value, mask):
+        # TODO: use baddbmm for better performance
+        if query.device.type == "mps":
+            # Better performance on mps (~20-25%)
+            attention_scores = torch.einsum("b i d, b j d -> b i j", query, key) * self.scale
+        else:
+            attention_scores = torch.matmul(query, key.transpose(-1, -2)) * self.scale
+        attention_scores = attention_scores.masked_fill(mask.expand(attention_scores.shape), value=float("-inf")) \
+            if mask is not None else attention_scores
+        attention_probs = attention_scores.softmax(dim=-1)
+        # compute attention output
+        if query.device.type == "mps":
+            hidden_states = torch.einsum("b i j, b j d -> b i d", attention_probs, value)
+        else:
+            hidden_states = torch.matmul(attention_probs, value)
+        # reshape hidden_states
+        hidden_states = self.reshape_batch_dim_to_heads(hidden_states)
+        return hidden_states
+    def _sliced_attention(self, query, key, value, sequence_length, dim):
+        batch_size_attention = query.shape[0]
+        hidden_states = torch.zeros(
+            (batch_size_attention, sequence_length, dim // self.heads), device=query.device, dtype=query.dtype
+        )
+        slice_size = self._slice_size if self._slice_size is not None else hidden_states.shape[0]
+        for i in range(hidden_states.shape[0] // slice_size):
+            start_idx = i * slice_size
+            end_idx = (i + 1) * slice_size
+            if query.device.type == "mps":
+                # Better performance on mps (~20-25%)
+                attn_slice = (
+                        torch.einsum("b i d, b j d -> b i j", query[start_idx:end_idx], key[start_idx:end_idx])
+                        * self.scale
+                )
+            else:
+                attn_slice = (
+                        torch.matmul(query[start_idx:end_idx], key[start_idx:end_idx].transpose(1, 2)) * self.scale
+                )  # TODO: use baddbmm for better performance
+            attn_slice = attn_slice.softmax(dim=-1)
+            if query.device.type == "mps":
+                attn_slice = torch.einsum("b i j, b j d -> b i d", attn_slice, value[start_idx:end_idx])
+            else:
+                attn_slice = torch.matmul(attn_slice, value[start_idx:end_idx])
+            hidden_states[start_idx:end_idx] = attn_slice
+        # reshape hidden_states
+        hidden_states = self.reshape_batch_dim_to_heads(hidden_states)
+        return hidden_states
+    def _memory_efficient_attention_xformers(self, query, key, value):
+        hidden_states = xformers.ops.memory_efficient_attention(query, key, value, attn_bias=None)
+        hidden_states = self.reshape_batch_dim_to_heads(hidden_states)
+        return hidden_states
+class FeedForward(nn.Module):
+    r"""
+    A feed-forward layer.
+    Parameters:
+        dim (`int`): The number of channels in the input.
+        dim_out (`int`, *optional*): The number of channels in the output. If not given, defaults to `dim`.
+        mult (`int`, *optional*, defaults to 4): The multiplier to use for the hidden dimension.
+        dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use.
+        activation_fn (`str`, *optional*, defaults to `"geglu"`): Activation function to be used in feed-forward.
+    """
+    def __init__(
+            self,
+            dim: int,
+            dim_out: Optional[int] = None,
+            mult: int = 4,
+            dropout: float = 0.0,
+            activation_fn: str = "geglu",
+    ):
+        super().__init__()
+        inner_dim = int(dim * mult)
+        dim_out = dim_out if dim_out is not None else dim
+        if activation_fn == "geglu":
+            geglu = GEGLU(dim, inner_dim)
+        elif activation_fn == "geglu-approximate":
+            geglu = ApproximateGELU(dim, inner_dim)
+        self.net = nn.ModuleList([])
+        # project in
+        self.net.append(geglu)
+        # project dropout
+        self.net.append(nn.Dropout(dropout))
+        # project out
+        self.net.append(nn.Linear(inner_dim, dim_out))
+    def forward(self, hidden_states):
+        for module in self.net:
+            hidden_states = module(hidden_states)
+        return hidden_states
+# feedforward
+class GEGLU(nn.Module):
+    r"""
+    A variant of the gated linear unit activation function from https://arxiv.org/abs/2002.05202.
+    Parameters:
+        dim_in (`int`): The number of channels in the input.
+        dim_out (`int`): The number of channels in the output.
+    """
+    def __init__(self, dim_in: int, dim_out: int):
+        super().__init__()
+        self.proj = nn.Linear(dim_in, dim_out * 2)
+    def gelu(self, gate):
+        if gate.device.type != "mps":
+            return F.gelu(gate)
+        # mps: gelu is not implemented for float16
+        return F.gelu(gate.to(dtype=torch.float32)).to(dtype=gate.dtype)
+    def forward(self, hidden_states):
+        hidden_states, gate = self.proj(hidden_states).chunk(2, dim=-1)
+        return hidden_states * self.gelu(gate)
+class ApproximateGELU(nn.Module):
+    """
+    The approximate form of Gaussian Error Linear Unit (GELU)
+    For more details, see section 2: https://arxiv.org/abs/1606.08415
+    """
+    def __init__(self, dim_in: int, dim_out: int):
+        super().__init__()
+        self.proj = nn.Linear(dim_in, dim_out)
+    def forward(self, x):
+        x = self.proj(x)
+        return x * torch.sigmoid(1.702 * x)
+class AdaLayerNorm(nn.Module):
+    """
+    Norm layer modified to incorporate timestep embeddings.
+    """
+    def __init__(self, embedding_dim, num_embeddings):
+        super().__init__()
+        self.emb = nn.Embedding(num_embeddings, embedding_dim)
+        self.silu = nn.SiLU()
+        self.linear = nn.Linear(embedding_dim, embedding_dim * 2)
+        self.norm = nn.LayerNorm(embedding_dim, elementwise_affine=False)
+    def forward(self, x, timestep):
+        emb = self.linear(self.silu(self.emb(timestep)))
+        scale, shift = torch.chunk(emb, 2)
+        x = self.norm(x) * (1 + scale) + shift
+        return x

models/diffusers_override/unet_2d_blocks.py ADDED Viewed

	@@ -0,0 +1,1602 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import torch
+from torch import nn
+from .attention import AttentionBlock, Transformer2DModel
+from diffusers.models.resnet import Downsample2D, FirDownsample2D, FirUpsample2D, ResnetBlock2D, Upsample2D
+def get_down_block(
+        down_block_type,
+        num_layers,
+        in_channels,
+        out_channels,
+        temb_channels,
+        add_downsample,
+        resnet_eps,
+        resnet_act_fn,
+        attn_num_head_channels,
+        resnet_groups=None,
+        cross_attention_dim=None,
+        downsample_padding=None,
+):
+    down_block_type = down_block_type[7:] if down_block_type.startswith("UNetRes") else down_block_type
+    if down_block_type == "DownBlock2D":
+        return DownBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            temb_channels=temb_channels,
+            add_downsample=add_downsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            resnet_groups=resnet_groups,
+            downsample_padding=downsample_padding,
+        )
+    elif down_block_type == "AttnDownBlock2D":
+        return AttnDownBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            temb_channels=temb_channels,
+            add_downsample=add_downsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            resnet_groups=resnet_groups,
+            downsample_padding=downsample_padding,
+            attn_num_head_channels=attn_num_head_channels,
+        )
+    elif down_block_type == "CrossAttnDownBlock2D":
+        if cross_attention_dim is None:
+            raise ValueError("cross_attention_dim must be specified for CrossAttnDownBlock2D")
+        return CrossAttnDownBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            temb_channels=temb_channels,
+            add_downsample=add_downsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            resnet_groups=resnet_groups,
+            downsample_padding=downsample_padding,
+            cross_attention_dim=cross_attention_dim,
+            attn_num_head_channels=attn_num_head_channels,
+        )
+    elif down_block_type == "SkipDownBlock2D":
+        return SkipDownBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            temb_channels=temb_channels,
+            add_downsample=add_downsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            downsample_padding=downsample_padding,
+        )
+    elif down_block_type == "AttnSkipDownBlock2D":
+        return AttnSkipDownBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            temb_channels=temb_channels,
+            add_downsample=add_downsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            downsample_padding=downsample_padding,
+            attn_num_head_channels=attn_num_head_channels,
+        )
+    elif down_block_type == "DownEncoderBlock2D":
+        return DownEncoderBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            add_downsample=add_downsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            resnet_groups=resnet_groups,
+            downsample_padding=downsample_padding,
+        )
+    elif down_block_type == "AttnDownEncoderBlock2D":
+        return AttnDownEncoderBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            add_downsample=add_downsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            resnet_groups=resnet_groups,
+            downsample_padding=downsample_padding,
+            attn_num_head_channels=attn_num_head_channels,
+        )
+    raise ValueError(f"{down_block_type} does not exist.")
+def get_up_block(
+        up_block_type,
+        num_layers,
+        in_channels,
+        out_channels,
+        prev_output_channel,
+        temb_channels,
+        add_upsample,
+        resnet_eps,
+        resnet_act_fn,
+        attn_num_head_channels,
+        resnet_groups=None,
+        cross_attention_dim=None,
+):
+    up_block_type = up_block_type[7:] if up_block_type.startswith("UNetRes") else up_block_type
+    if up_block_type == "UpBlock2D":
+        return UpBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            prev_output_channel=prev_output_channel,
+            temb_channels=temb_channels,
+            add_upsample=add_upsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            resnet_groups=resnet_groups,
+        )
+    elif up_block_type == "CrossAttnUpBlock2D":
+        if cross_attention_dim is None:
+            raise ValueError("cross_attention_dim must be specified for CrossAttnUpBlock2D")
+        return CrossAttnUpBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            prev_output_channel=prev_output_channel,
+            temb_channels=temb_channels,
+            add_upsample=add_upsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            resnet_groups=resnet_groups,
+            cross_attention_dim=cross_attention_dim,
+            attn_num_head_channels=attn_num_head_channels,
+        )
+    elif up_block_type == "AttnUpBlock2D":
+        return AttnUpBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            prev_output_channel=prev_output_channel,
+            temb_channels=temb_channels,
+            add_upsample=add_upsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            resnet_groups=resnet_groups,
+            attn_num_head_channels=attn_num_head_channels,
+        )
+    elif up_block_type == "SkipUpBlock2D":
+        return SkipUpBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            prev_output_channel=prev_output_channel,
+            temb_channels=temb_channels,
+            add_upsample=add_upsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+        )
+    elif up_block_type == "AttnSkipUpBlock2D":
+        return AttnSkipUpBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            prev_output_channel=prev_output_channel,
+            temb_channels=temb_channels,
+            add_upsample=add_upsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            attn_num_head_channels=attn_num_head_channels,
+        )
+    elif up_block_type == "UpDecoderBlock2D":
+        return UpDecoderBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            add_upsample=add_upsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            resnet_groups=resnet_groups,
+        )
+    elif up_block_type == "AttnUpDecoderBlock2D":
+        return AttnUpDecoderBlock2D(
+            num_layers=num_layers,
+            in_channels=in_channels,
+            out_channels=out_channels,
+            add_upsample=add_upsample,
+            resnet_eps=resnet_eps,
+            resnet_act_fn=resnet_act_fn,
+            resnet_groups=resnet_groups,
+            attn_num_head_channels=attn_num_head_channels,
+        )
+    raise ValueError(f"{up_block_type} does not exist.")
+class UNetMidBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            attn_num_head_channels=1,
+            attention_type="default",
+            output_scale_factor=1.0,
+            **kwargs,
+    ):
+        super().__init__()
+        self.attention_type = attention_type
+        resnet_groups = resnet_groups if resnet_groups is not None else min(in_channels // 4, 32)
+        # there is always at least one resnet
+        resnets = [
+            ResnetBlock2D(
+                in_channels=in_channels,
+                out_channels=in_channels,
+                temb_channels=temb_channels,
+                eps=resnet_eps,
+                groups=resnet_groups,
+                dropout=dropout,
+                time_embedding_norm=resnet_time_scale_shift,
+                non_linearity=resnet_act_fn,
+                output_scale_factor=output_scale_factor,
+                pre_norm=resnet_pre_norm,
+            )
+        ]
+        attentions = []
+        for _ in range(num_layers):
+            attentions.append(
+                AttentionBlock(
+                    in_channels,
+                    num_head_channels=attn_num_head_channels,
+                    rescale_output_factor=output_scale_factor,
+                    eps=resnet_eps,
+                    norm_num_groups=resnet_groups,
+                )
+            )
+            resnets.append(
+                ResnetBlock2D(
+                    in_channels=in_channels,
+                    out_channels=in_channels,
+                    temb_channels=temb_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+        self.attentions = nn.ModuleList(attentions)
+        self.resnets = nn.ModuleList(resnets)
+    def forward(self, hidden_states, temb=None, encoder_states=None):
+        hidden_states = self.resnets[0](hidden_states, temb)
+        for attn, resnet in zip(self.attentions, self.resnets[1:]):
+            if self.attention_type == "default":
+                hidden_states = attn(hidden_states)
+            else:
+                hidden_states = attn(hidden_states, encoder_states)
+            hidden_states = resnet(hidden_states, temb)
+        return hidden_states
+class UNetMidBlock2DCrossAttn(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            attn_num_head_channels=1,
+            attention_type="default",
+            output_scale_factor=1.0,
+            cross_attention_dim=1280,
+            **kwargs,
+    ):
+        super().__init__()
+        self.attention_type = attention_type
+        self.attn_num_head_channels = attn_num_head_channels
+        resnet_groups = resnet_groups if resnet_groups is not None else min(in_channels // 4, 32)
+        # there is always at least one resnet
+        resnets = [
+            ResnetBlock2D(
+                in_channels=in_channels,
+                out_channels=in_channels,
+                temb_channels=temb_channels,
+                eps=resnet_eps,
+                groups=resnet_groups,
+                dropout=dropout,
+                time_embedding_norm=resnet_time_scale_shift,
+                non_linearity=resnet_act_fn,
+                output_scale_factor=output_scale_factor,
+                pre_norm=resnet_pre_norm,
+            )
+        ]
+        attentions = []
+        for _ in range(num_layers):
+            attentions.append(
+                Transformer2DModel(
+                    attn_num_head_channels,
+                    in_channels // attn_num_head_channels,
+                    in_channels=in_channels,
+                    num_layers=1,
+                    cross_attention_dim=cross_attention_dim,
+                    norm_num_groups=resnet_groups,
+                )
+            )
+            resnets.append(
+                ResnetBlock2D(
+                    in_channels=in_channels,
+                    out_channels=in_channels,
+                    temb_channels=temb_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+        self.attentions = nn.ModuleList(attentions)
+        self.resnets = nn.ModuleList(resnets)
+    def set_attention_slice(self, slice_size):
+        if slice_size is not None and self.attn_num_head_channels % slice_size != 0:
+            raise ValueError(
+                f"Make sure slice_size {slice_size} is a divisor of "
+                f"the number of heads used in cross_attention {self.attn_num_head_channels}"
+            )
+        if slice_size is not None and slice_size > self.attn_num_head_channels:
+            raise ValueError(
+                f"Chunk_size {slice_size} has to be smaller or equal to "
+                f"the number of heads used in cross_attention {self.attn_num_head_channels}"
+            )
+        for attn in self.attentions:
+            attn._set_attention_slice(slice_size)
+    def set_use_memory_efficient_attention_xformers(self, use_memory_efficient_attention_xformers: bool):
+        for attn in self.attentions:
+            attn._set_use_memory_efficient_attention_xformers(use_memory_efficient_attention_xformers)
+    def forward(self, hidden_states, temb=None, encoder_hidden_states=None, encoder_attention_mask=None):
+        hidden_states = self.resnets[0](hidden_states, temb)
+        for attn, resnet in zip(self.attentions, self.resnets[1:]):
+            hidden_states = attn(hidden_states, encoder_hidden_states, encoder_attention_mask).sample
+            hidden_states = resnet(hidden_states, temb)
+        return hidden_states
+class AttnDownBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            out_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            attn_num_head_channels=1,
+            attention_type="default",
+            output_scale_factor=1.0,
+            downsample_padding=1,
+            add_downsample=True,
+    ):
+        super().__init__()
+        resnets = []
+        attentions = []
+        self.attention_type = attention_type
+        for i in range(num_layers):
+            in_channels = in_channels if i == 0 else out_channels
+            resnets.append(
+                ResnetBlock2D(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    temb_channels=temb_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+            attentions.append(
+                AttentionBlock(
+                    out_channels,
+                    num_head_channels=attn_num_head_channels,
+                    rescale_output_factor=output_scale_factor,
+                    eps=resnet_eps,
+                    norm_num_groups=resnet_groups,
+                )
+            )
+        self.attentions = nn.ModuleList(attentions)
+        self.resnets = nn.ModuleList(resnets)
+        if add_downsample:
+            self.downsamplers = nn.ModuleList(
+                [
+                    Downsample2D(
+                        in_channels, use_conv=True, out_channels=out_channels, padding=downsample_padding, name="op"
+                    )
+                ]
+            )
+        else:
+            self.downsamplers = None
+    def forward(self, hidden_states, temb=None):
+        output_states = ()
+        for resnet, attn in zip(self.resnets, self.attentions):
+            hidden_states = resnet(hidden_states, temb)
+            hidden_states = attn(hidden_states)
+            output_states += (hidden_states,)
+        if self.downsamplers is not None:
+            for downsampler in self.downsamplers:
+                hidden_states = downsampler(hidden_states)
+            output_states += (hidden_states,)
+        return hidden_states, output_states
+class CrossAttnDownBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            out_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            attn_num_head_channels=1,
+            cross_attention_dim=1280,
+            attention_type="default",
+            output_scale_factor=1.0,
+            downsample_padding=1,
+            add_downsample=True,
+    ):
+        super().__init__()
+        resnets = []
+        attentions = []
+        self.attention_type = attention_type
+        self.attn_num_head_channels = attn_num_head_channels
+        for i in range(num_layers):
+            in_channels = in_channels if i == 0 else out_channels
+            resnets.append(
+                ResnetBlock2D(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    temb_channels=temb_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+            attentions.append(
+                Transformer2DModel(
+                    attn_num_head_channels,
+                    out_channels // attn_num_head_channels,
+                    in_channels=out_channels,
+                    num_layers=1,
+                    cross_attention_dim=cross_attention_dim,
+                    norm_num_groups=resnet_groups,
+                )
+            )
+        self.attentions = nn.ModuleList(attentions)
+        self.resnets = nn.ModuleList(resnets)
+        if add_downsample:
+            self.downsamplers = nn.ModuleList(
+                [
+                    Downsample2D(
+                        in_channels, use_conv=True, out_channels=out_channels, padding=downsample_padding, name="op"
+                    )
+                ]
+            )
+        else:
+            self.downsamplers = None
+        self.gradient_checkpointing = False
+    def set_attention_slice(self, slice_size):
+        if slice_size is not None and self.attn_num_head_channels % slice_size != 0:
+            raise ValueError(
+                f"Make sure slice_size {slice_size} is a divisor of "
+                f"the number of heads used in cross_attention {self.attn_num_head_channels}"
+            )
+        if slice_size is not None and slice_size > self.attn_num_head_channels:
+            raise ValueError(
+                f"Chunk_size {slice_size} has to be smaller or equal to "
+                f"the number of heads used in cross_attention {self.attn_num_head_channels}"
+            )
+        for attn in self.attentions:
+            attn._set_attention_slice(slice_size)
+    def set_use_memory_efficient_attention_xformers(self, use_memory_efficient_attention_xformers: bool):
+        for attn in self.attentions:
+            attn._set_use_memory_efficient_attention_xformers(use_memory_efficient_attention_xformers)
+    def forward(self, hidden_states, temb=None, encoder_hidden_states=None, encoder_attention_mask=None):
+        output_states = ()
+        for resnet, attn in zip(self.resnets, self.attentions):
+            if self.training and self.gradient_checkpointing:
+                def create_custom_forward(module, return_dict=None):
+                    def custom_forward(*inputs):
+                        if return_dict is not None:
+                            return module(*inputs, return_dict=return_dict)
+                        else:
+                            return module(*inputs)
+                    return custom_forward
+                hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(resnet), hidden_states, temb)
+                hidden_states = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(attn, return_dict=False), hidden_states, encoder_hidden_states,
+                    encoder_attention_mask
+                )[0]
+            else:
+                hidden_states = resnet(hidden_states, temb)
+                hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states,
+                                     encoder_attention_mask=encoder_attention_mask).sample
+            output_states += (hidden_states,)
+        if self.downsamplers is not None:
+            for downsampler in self.downsamplers:
+                hidden_states = downsampler(hidden_states)
+            output_states += (hidden_states,)
+        return hidden_states, output_states
+class DownBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            out_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            output_scale_factor=1.0,
+            add_downsample=True,
+            downsample_padding=1,
+    ):
+        super().__init__()
+        resnets = []
+        for i in range(num_layers):
+            in_channels = in_channels if i == 0 else out_channels
+            resnets.append(
+                ResnetBlock2D(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    temb_channels=temb_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+        self.resnets = nn.ModuleList(resnets)
+        if add_downsample:
+            self.downsamplers = nn.ModuleList(
+                [
+                    Downsample2D(
+                        in_channels, use_conv=True, out_channels=out_channels, padding=downsample_padding, name="op"
+                    )
+                ]
+            )
+        else:
+            self.downsamplers = None
+        self.gradient_checkpointing = False
+    def forward(self, hidden_states, temb=None):
+        output_states = ()
+        for resnet in self.resnets:
+            if self.training and self.gradient_checkpointing:
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs)
+                    return custom_forward
+                hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(resnet), hidden_states, temb)
+            else:
+                hidden_states = resnet(hidden_states, temb)
+            output_states += (hidden_states,)
+        if self.downsamplers is not None:
+            for downsampler in self.downsamplers:
+                hidden_states = downsampler(hidden_states)
+            output_states += (hidden_states,)
+        return hidden_states, output_states
+class DownEncoderBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            out_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            output_scale_factor=1.0,
+            add_downsample=True,
+            downsample_padding=1,
+    ):
+        super().__init__()
+        resnets = []
+        for i in range(num_layers):
+            in_channels = in_channels if i == 0 else out_channels
+            resnets.append(
+                ResnetBlock2D(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    temb_channels=None,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+        self.resnets = nn.ModuleList(resnets)
+        if add_downsample:
+            self.downsamplers = nn.ModuleList(
+                [
+                    Downsample2D(
+                        in_channels, use_conv=True, out_channels=out_channels, padding=downsample_padding, name="op"
+                    )
+                ]
+            )
+        else:
+            self.downsamplers = None
+    def forward(self, hidden_states):
+        for resnet in self.resnets:
+            hidden_states = resnet(hidden_states, temb=None)
+        if self.downsamplers is not None:
+            for downsampler in self.downsamplers:
+                hidden_states = downsampler(hidden_states)
+        return hidden_states
+class AttnDownEncoderBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            out_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            attn_num_head_channels=1,
+            output_scale_factor=1.0,
+            add_downsample=True,
+            downsample_padding=1,
+    ):
+        super().__init__()
+        resnets = []
+        attentions = []
+        for i in range(num_layers):
+            in_channels = in_channels if i == 0 else out_channels
+            resnets.append(
+                ResnetBlock2D(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    temb_channels=None,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+            attentions.append(
+                AttentionBlock(
+                    out_channels,
+                    num_head_channels=attn_num_head_channels,
+                    rescale_output_factor=output_scale_factor,
+                    eps=resnet_eps,
+                    norm_num_groups=resnet_groups,
+                )
+            )
+        self.attentions = nn.ModuleList(attentions)
+        self.resnets = nn.ModuleList(resnets)
+        if add_downsample:
+            self.downsamplers = nn.ModuleList(
+                [
+                    Downsample2D(
+                        in_channels, use_conv=True, out_channels=out_channels, padding=downsample_padding, name="op"
+                    )
+                ]
+            )
+        else:
+            self.downsamplers = None
+    def forward(self, hidden_states):
+        for resnet, attn in zip(self.resnets, self.attentions):
+            hidden_states = resnet(hidden_states, temb=None)
+            hidden_states = attn(hidden_states)
+        if self.downsamplers is not None:
+            for downsampler in self.downsamplers:
+                hidden_states = downsampler(hidden_states)
+        return hidden_states
+class AttnSkipDownBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            out_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_pre_norm: bool = True,
+            attn_num_head_channels=1,
+            attention_type="default",
+            output_scale_factor=np.sqrt(2.0),
+            downsample_padding=1,
+            add_downsample=True,
+    ):
+        super().__init__()
+        self.attentions = nn.ModuleList([])
+        self.resnets = nn.ModuleList([])
+        self.attention_type = attention_type
+        for i in range(num_layers):
+            in_channels = in_channels if i == 0 else out_channels
+            self.resnets.append(
+                ResnetBlock2D(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    temb_channels=temb_channels,
+                    eps=resnet_eps,
+                    groups=min(in_channels // 4, 32),
+                    groups_out=min(out_channels // 4, 32),
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+            self.attentions.append(
+                AttentionBlock(
+                    out_channels,
+                    num_head_channels=attn_num_head_channels,
+                    rescale_output_factor=output_scale_factor,
+                    eps=resnet_eps,
+                )
+            )
+        if add_downsample:
+            self.resnet_down = ResnetBlock2D(
+                in_channels=out_channels,
+                out_channels=out_channels,
+                temb_channels=temb_channels,
+                eps=resnet_eps,
+                groups=min(out_channels // 4, 32),
+                dropout=dropout,
+                time_embedding_norm=resnet_time_scale_shift,
+                non_linearity=resnet_act_fn,
+                output_scale_factor=output_scale_factor,
+                pre_norm=resnet_pre_norm,
+                use_in_shortcut=True,
+                down=True,
+                kernel="fir",
+            )
+            self.downsamplers = nn.ModuleList([FirDownsample2D(in_channels, out_channels=out_channels)])
+            self.skip_conv = nn.Conv2d(3, out_channels, kernel_size=(1, 1), stride=(1, 1))
+        else:
+            self.resnet_down = None
+            self.downsamplers = None
+            self.skip_conv = None
+    def forward(self, hidden_states, temb=None, skip_sample=None):
+        output_states = ()
+        for resnet, attn in zip(self.resnets, self.attentions):
+            hidden_states = resnet(hidden_states, temb)
+            hidden_states = attn(hidden_states)
+            output_states += (hidden_states,)
+        if self.downsamplers is not None:
+            hidden_states = self.resnet_down(hidden_states, temb)
+            for downsampler in self.downsamplers:
+                skip_sample = downsampler(skip_sample)
+            hidden_states = self.skip_conv(skip_sample) + hidden_states
+            output_states += (hidden_states,)
+        return hidden_states, output_states, skip_sample
+class SkipDownBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            out_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_pre_norm: bool = True,
+            output_scale_factor=np.sqrt(2.0),
+            add_downsample=True,
+            downsample_padding=1,
+    ):
+        super().__init__()
+        self.resnets = nn.ModuleList([])
+        for i in range(num_layers):
+            in_channels = in_channels if i == 0 else out_channels
+            self.resnets.append(
+                ResnetBlock2D(
+                    in_channels=in_channels,
+                    out_channels=out_channels,
+                    temb_channels=temb_channels,
+                    eps=resnet_eps,
+                    groups=min(in_channels // 4, 32),
+                    groups_out=min(out_channels // 4, 32),
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+        if add_downsample:
+            self.resnet_down = ResnetBlock2D(
+                in_channels=out_channels,
+                out_channels=out_channels,
+                temb_channels=temb_channels,
+                eps=resnet_eps,
+                groups=min(out_channels // 4, 32),
+                dropout=dropout,
+                time_embedding_norm=resnet_time_scale_shift,
+                non_linearity=resnet_act_fn,
+                output_scale_factor=output_scale_factor,
+                pre_norm=resnet_pre_norm,
+                use_in_shortcut=True,
+                down=True,
+                kernel="fir",
+            )
+            self.downsamplers = nn.ModuleList([FirDownsample2D(in_channels, out_channels=out_channels)])
+            self.skip_conv = nn.Conv2d(3, out_channels, kernel_size=(1, 1), stride=(1, 1))
+        else:
+            self.resnet_down = None
+            self.downsamplers = None
+            self.skip_conv = None
+    def forward(self, hidden_states, temb=None, skip_sample=None):
+        output_states = ()
+        for resnet in self.resnets:
+            hidden_states = resnet(hidden_states, temb)
+            output_states += (hidden_states,)
+        if self.downsamplers is not None:
+            hidden_states = self.resnet_down(hidden_states, temb)
+            for downsampler in self.downsamplers:
+                skip_sample = downsampler(skip_sample)
+            hidden_states = self.skip_conv(skip_sample) + hidden_states
+            output_states += (hidden_states,)
+        return hidden_states, output_states, skip_sample
+class AttnUpBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            prev_output_channel: int,
+            out_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            attention_type="default",
+            attn_num_head_channels=1,
+            output_scale_factor=1.0,
+            add_upsample=True,
+    ):
+        super().__init__()
+        resnets = []
+        attentions = []
+        self.attention_type = attention_type
+        for i in range(num_layers):
+            res_skip_channels = in_channels if (i == num_layers - 1) else out_channels
+            resnet_in_channels = prev_output_channel if i == 0 else out_channels
+            resnets.append(
+                ResnetBlock2D(
+                    in_channels=resnet_in_channels + res_skip_channels,
+                    out_channels=out_channels,
+                    temb_channels=temb_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+            attentions.append(
+                AttentionBlock(
+                    out_channels,
+                    num_head_channels=attn_num_head_channels,
+                    rescale_output_factor=output_scale_factor,
+                    eps=resnet_eps,
+                    norm_num_groups=resnet_groups,
+                )
+            )
+        self.attentions = nn.ModuleList(attentions)
+        self.resnets = nn.ModuleList(resnets)
+        if add_upsample:
+            self.upsamplers = nn.ModuleList([Upsample2D(out_channels, use_conv=True, out_channels=out_channels)])
+        else:
+            self.upsamplers = None
+    def forward(self, hidden_states, res_hidden_states_tuple, temb=None):
+        for resnet, attn in zip(self.resnets, self.attentions):
+            # pop res hidden states
+            res_hidden_states = res_hidden_states_tuple[-1]
+            res_hidden_states_tuple = res_hidden_states_tuple[:-1]
+            hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
+            hidden_states = resnet(hidden_states, temb)
+            hidden_states = attn(hidden_states)
+        if self.upsamplers is not None:
+            for upsampler in self.upsamplers:
+                hidden_states = upsampler(hidden_states)
+        return hidden_states
+class CrossAttnUpBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            out_channels: int,
+            prev_output_channel: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            attn_num_head_channels=1,
+            cross_attention_dim=1280,
+            attention_type="default",
+            output_scale_factor=1.0,
+            add_upsample=True,
+    ):
+        super().__init__()
+        resnets = []
+        attentions = []
+        self.attention_type = attention_type
+        self.attn_num_head_channels = attn_num_head_channels
+        for i in range(num_layers):
+            res_skip_channels = in_channels if (i == num_layers - 1) else out_channels
+            resnet_in_channels = prev_output_channel if i == 0 else out_channels
+            resnets.append(
+                ResnetBlock2D(
+                    in_channels=resnet_in_channels + res_skip_channels,
+                    out_channels=out_channels,
+                    temb_channels=temb_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+            attentions.append(
+                Transformer2DModel(
+                    attn_num_head_channels,
+                    out_channels // attn_num_head_channels,
+                    in_channels=out_channels,
+                    num_layers=1,
+                    cross_attention_dim=cross_attention_dim,
+                    norm_num_groups=resnet_groups,
+                )
+            )
+        self.attentions = nn.ModuleList(attentions)
+        self.resnets = nn.ModuleList(resnets)
+        if add_upsample:
+            self.upsamplers = nn.ModuleList([Upsample2D(out_channels, use_conv=True, out_channels=out_channels)])
+        else:
+            self.upsamplers = None
+        self.gradient_checkpointing = False
+    def set_attention_slice(self, slice_size):
+        if slice_size is not None and self.attn_num_head_channels % slice_size != 0:
+            raise ValueError(
+                f"Make sure slice_size {slice_size} is a divisor of "
+                f"the number of heads used in cross_attention {self.attn_num_head_channels}"
+            )
+        if slice_size is not None and slice_size > self.attn_num_head_channels:
+            raise ValueError(
+                f"Chunk_size {slice_size} has to be smaller or equal to "
+                f"the number of heads used in cross_attention {self.attn_num_head_channels}"
+            )
+        for attn in self.attentions:
+            attn._set_attention_slice(slice_size)
+        self.gradient_checkpointing = False
+    def set_use_memory_efficient_attention_xformers(self, use_memory_efficient_attention_xformers: bool):
+        for attn in self.attentions:
+            attn._set_use_memory_efficient_attention_xformers(use_memory_efficient_attention_xformers)
+    def forward(
+            self,
+            hidden_states,
+            res_hidden_states_tuple,
+            temb=None,
+            encoder_hidden_states=None,
+            encoder_attention_mask=None,
+            upsample_size=None,
+    ):
+        for resnet, attn in zip(self.resnets, self.attentions):
+            # pop res hidden states
+            res_hidden_states = res_hidden_states_tuple[-1]
+            res_hidden_states_tuple = res_hidden_states_tuple[:-1]
+            hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
+            if self.training and self.gradient_checkpointing:
+                def create_custom_forward(module, return_dict=None):
+                    def custom_forward(*inputs):
+                        if return_dict is not None:
+                            return module(*inputs, return_dict=return_dict)
+                        else:
+                            return module(*inputs)
+                    return custom_forward
+                hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(resnet), hidden_states, temb)
+                hidden_states = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(attn, return_dict=False), hidden_states, encoder_hidden_states,
+                    encoder_attention_mask
+                )[0]
+            else:
+                hidden_states = resnet(hidden_states, temb)
+                hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states,
+                                     encoder_attention_mask=encoder_attention_mask).sample
+        if self.upsamplers is not None:
+            for upsampler in self.upsamplers:
+                hidden_states = upsampler(hidden_states, upsample_size)
+        return hidden_states
+class UpBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            prev_output_channel: int,
+            out_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            output_scale_factor=1.0,
+            add_upsample=True,
+    ):
+        super().__init__()
+        resnets = []
+        for i in range(num_layers):
+            res_skip_channels = in_channels if (i == num_layers - 1) else out_channels
+            resnet_in_channels = prev_output_channel if i == 0 else out_channels
+            resnets.append(
+                ResnetBlock2D(
+                    in_channels=resnet_in_channels + res_skip_channels,
+                    out_channels=out_channels,
+                    temb_channels=temb_channels,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+        self.resnets = nn.ModuleList(resnets)
+        if add_upsample:
+            self.upsamplers = nn.ModuleList([Upsample2D(out_channels, use_conv=True, out_channels=out_channels)])
+        else:
+            self.upsamplers = None
+        self.gradient_checkpointing = False
+    def forward(self, hidden_states, res_hidden_states_tuple, temb=None, upsample_size=None):
+        for resnet in self.resnets:
+            # pop res hidden states
+            res_hidden_states = res_hidden_states_tuple[-1]
+            res_hidden_states_tuple = res_hidden_states_tuple[:-1]
+            hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
+            if self.training and self.gradient_checkpointing:
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs)
+                    return custom_forward
+                hidden_states = torch.utils.checkpoint.checkpoint(create_custom_forward(resnet), hidden_states, temb)
+            else:
+                hidden_states = resnet(hidden_states, temb)
+        if self.upsamplers is not None:
+            for upsampler in self.upsamplers:
+                hidden_states = upsampler(hidden_states, upsample_size)
+        return hidden_states
+class UpDecoderBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            out_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            output_scale_factor=1.0,
+            add_upsample=True,
+    ):
+        super().__init__()
+        resnets = []
+        for i in range(num_layers):
+            input_channels = in_channels if i == 0 else out_channels
+            resnets.append(
+                ResnetBlock2D(
+                    in_channels=input_channels,
+                    out_channels=out_channels,
+                    temb_channels=None,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+        self.resnets = nn.ModuleList(resnets)
+        if add_upsample:
+            self.upsamplers = nn.ModuleList([Upsample2D(out_channels, use_conv=True, out_channels=out_channels)])
+        else:
+            self.upsamplers = None
+    def forward(self, hidden_states):
+        for resnet in self.resnets:
+            hidden_states = resnet(hidden_states, temb=None)
+        if self.upsamplers is not None:
+            for upsampler in self.upsamplers:
+                hidden_states = upsampler(hidden_states)
+        return hidden_states
+class AttnUpDecoderBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            out_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_groups: int = 32,
+            resnet_pre_norm: bool = True,
+            attn_num_head_channels=1,
+            output_scale_factor=1.0,
+            add_upsample=True,
+    ):
+        super().__init__()
+        resnets = []
+        attentions = []
+        for i in range(num_layers):
+            input_channels = in_channels if i == 0 else out_channels
+            resnets.append(
+                ResnetBlock2D(
+                    in_channels=input_channels,
+                    out_channels=out_channels,
+                    temb_channels=None,
+                    eps=resnet_eps,
+                    groups=resnet_groups,
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+            attentions.append(
+                AttentionBlock(
+                    out_channels,
+                    num_head_channels=attn_num_head_channels,
+                    rescale_output_factor=output_scale_factor,
+                    eps=resnet_eps,
+                    norm_num_groups=resnet_groups,
+                )
+            )
+        self.attentions = nn.ModuleList(attentions)
+        self.resnets = nn.ModuleList(resnets)
+        if add_upsample:
+            self.upsamplers = nn.ModuleList([Upsample2D(out_channels, use_conv=True, out_channels=out_channels)])
+        else:
+            self.upsamplers = None
+    def forward(self, hidden_states):
+        for resnet, attn in zip(self.resnets, self.attentions):
+            hidden_states = resnet(hidden_states, temb=None)
+            hidden_states = attn(hidden_states)
+        if self.upsamplers is not None:
+            for upsampler in self.upsamplers:
+                hidden_states = upsampler(hidden_states)
+        return hidden_states
+class AttnSkipUpBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            prev_output_channel: int,
+            out_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_pre_norm: bool = True,
+            attn_num_head_channels=1,
+            attention_type="default",
+            output_scale_factor=np.sqrt(2.0),
+            upsample_padding=1,
+            add_upsample=True,
+    ):
+        super().__init__()
+        self.attentions = nn.ModuleList([])
+        self.resnets = nn.ModuleList([])
+        self.attention_type = attention_type
+        for i in range(num_layers):
+            res_skip_channels = in_channels if (i == num_layers - 1) else out_channels
+            resnet_in_channels = prev_output_channel if i == 0 else out_channels
+            self.resnets.append(
+                ResnetBlock2D(
+                    in_channels=resnet_in_channels + res_skip_channels,
+                    out_channels=out_channels,
+                    temb_channels=temb_channels,
+                    eps=resnet_eps,
+                    groups=min(resnet_in_channels + res_skip_channels // 4, 32),
+                    groups_out=min(out_channels // 4, 32),
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+        self.attentions.append(
+            AttentionBlock(
+                out_channels,
+                num_head_channels=attn_num_head_channels,
+                rescale_output_factor=output_scale_factor,
+                eps=resnet_eps,
+            )
+        )
+        self.upsampler = FirUpsample2D(in_channels, out_channels=out_channels)
+        if add_upsample:
+            self.resnet_up = ResnetBlock2D(
+                in_channels=out_channels,
+                out_channels=out_channels,
+                temb_channels=temb_channels,
+                eps=resnet_eps,
+                groups=min(out_channels // 4, 32),
+                groups_out=min(out_channels // 4, 32),
+                dropout=dropout,
+                time_embedding_norm=resnet_time_scale_shift,
+                non_linearity=resnet_act_fn,
+                output_scale_factor=output_scale_factor,
+                pre_norm=resnet_pre_norm,
+                use_in_shortcut=True,
+                up=True,
+                kernel="fir",
+            )
+            self.skip_conv = nn.Conv2d(out_channels, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+            self.skip_norm = torch.nn.GroupNorm(
+                num_groups=min(out_channels // 4, 32), num_channels=out_channels, eps=resnet_eps, affine=True
+            )
+            self.act = nn.SiLU()
+        else:
+            self.resnet_up = None
+            self.skip_conv = None
+            self.skip_norm = None
+            self.act = None
+    def forward(self, hidden_states, res_hidden_states_tuple, temb=None, skip_sample=None):
+        for resnet in self.resnets:
+            # pop res hidden states
+            res_hidden_states = res_hidden_states_tuple[-1]
+            res_hidden_states_tuple = res_hidden_states_tuple[:-1]
+            hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
+            hidden_states = resnet(hidden_states, temb)
+        hidden_states = self.attentions[0](hidden_states)
+        if skip_sample is not None:
+            skip_sample = self.upsampler(skip_sample)
+        else:
+            skip_sample = 0
+        if self.resnet_up is not None:
+            skip_sample_states = self.skip_norm(hidden_states)
+            skip_sample_states = self.act(skip_sample_states)
+            skip_sample_states = self.skip_conv(skip_sample_states)
+            skip_sample = skip_sample + skip_sample_states
+            hidden_states = self.resnet_up(hidden_states, temb)
+        return hidden_states, skip_sample
+class SkipUpBlock2D(nn.Module):
+    def __init__(
+            self,
+            in_channels: int,
+            prev_output_channel: int,
+            out_channels: int,
+            temb_channels: int,
+            dropout: float = 0.0,
+            num_layers: int = 1,
+            resnet_eps: float = 1e-6,
+            resnet_time_scale_shift: str = "default",
+            resnet_act_fn: str = "swish",
+            resnet_pre_norm: bool = True,
+            output_scale_factor=np.sqrt(2.0),
+            add_upsample=True,
+            upsample_padding=1,
+    ):
+        super().__init__()
+        self.resnets = nn.ModuleList([])
+        for i in range(num_layers):
+            res_skip_channels = in_channels if (i == num_layers - 1) else out_channels
+            resnet_in_channels = prev_output_channel if i == 0 else out_channels
+            self.resnets.append(
+                ResnetBlock2D(
+                    in_channels=resnet_in_channels + res_skip_channels,
+                    out_channels=out_channels,
+                    temb_channels=temb_channels,
+                    eps=resnet_eps,
+                    groups=min((resnet_in_channels + res_skip_channels) // 4, 32),
+                    groups_out=min(out_channels // 4, 32),
+                    dropout=dropout,
+                    time_embedding_norm=resnet_time_scale_shift,
+                    non_linearity=resnet_act_fn,
+                    output_scale_factor=output_scale_factor,
+                    pre_norm=resnet_pre_norm,
+                )
+            )
+        self.upsampler = FirUpsample2D(in_channels, out_channels=out_channels)
+        if add_upsample:
+            self.resnet_up = ResnetBlock2D(
+                in_channels=out_channels,
+                out_channels=out_channels,
+                temb_channels=temb_channels,
+                eps=resnet_eps,
+                groups=min(out_channels // 4, 32),
+                groups_out=min(out_channels // 4, 32),
+                dropout=dropout,
+                time_embedding_norm=resnet_time_scale_shift,
+                non_linearity=resnet_act_fn,
+                output_scale_factor=output_scale_factor,
+                pre_norm=resnet_pre_norm,
+                use_in_shortcut=True,
+                up=True,
+                kernel="fir",
+            )
+            self.skip_conv = nn.Conv2d(out_channels, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
+            self.skip_norm = torch.nn.GroupNorm(
+                num_groups=min(out_channels // 4, 32), num_channels=out_channels, eps=resnet_eps, affine=True
+            )
+            self.act = nn.SiLU()
+        else:
+            self.resnet_up = None
+            self.skip_conv = None
+            self.skip_norm = None
+            self.act = None
+    def forward(self, hidden_states, res_hidden_states_tuple, temb=None, skip_sample=None):
+        for resnet in self.resnets:
+            # pop res hidden states
+            res_hidden_states = res_hidden_states_tuple[-1]
+            res_hidden_states_tuple = res_hidden_states_tuple[:-1]
+            hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
+            hidden_states = resnet(hidden_states, temb)
+        if skip_sample is not None:
+            skip_sample = self.upsampler(skip_sample)
+        else:
+            skip_sample = 0
+        if self.resnet_up is not None:
+            skip_sample_states = self.skip_norm(hidden_states)
+            skip_sample_states = self.act(skip_sample_states)
+            skip_sample_states = self.skip_conv(skip_sample_states)
+            skip_sample = skip_sample + skip_sample_states
+            hidden_states = self.resnet_up(hidden_states, temb)
+        return hidden_states, skip_sample

models/diffusers_override/unet_2d_condition.py ADDED Viewed

	@@ -0,0 +1,359 @@

+# Copyright 2022 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from dataclasses import dataclass
+from typing import Optional, Tuple, Union
+import torch
+import torch.nn as nn
+import torch.utils.checkpoint
+from diffusers.configuration_utils import ConfigMixin, register_to_config
+from diffusers.modeling_utils import ModelMixin
+from diffusers.utils import BaseOutput, logging
+from diffusers.models.embeddings import TimestepEmbedding, Timesteps
+from .unet_2d_blocks import (
+    CrossAttnDownBlock2D,
+    CrossAttnUpBlock2D,
+    DownBlock2D,
+    UNetMidBlock2DCrossAttn,
+    UpBlock2D,
+    get_down_block,
+    get_up_block,
+)
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+@dataclass
+class UNet2DConditionOutput(BaseOutput):
+    """
+    Args:
+        sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+            Hidden states conditioned on `encoder_hidden_states` input. Output of last layer of model.
+    """
+    sample: torch.FloatTensor
+class UNet2DConditionModel(ModelMixin, ConfigMixin):
+    r"""
+    UNet2DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep
+    and returns sample shaped output.
+    This model inherits from [`ModelMixin`]. Check the superclass documentation for the generic methods the library
+    implements for all the models (such as downloading or saving, etc.)
+    Parameters:
+        sample_size (`int`, *optional*): The size of the input sample.
+        in_channels (`int`, *optional*, defaults to 4): The number of channels in the input sample.
+        out_channels (`int`, *optional*, defaults to 4): The number of channels in the output.
+        center_input_sample (`bool`, *optional*, defaults to `False`): Whether to center the input sample.
+        flip_sin_to_cos (`bool`, *optional*, defaults to `True`):
+            Whether to flip the sin to cos in the time embedding.
+        freq_shift (`int`, *optional*, defaults to 0): The frequency shift to apply to the time embedding.
+        down_block_types (`Tuple[str]`, *optional*, defaults to `("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D")`):
+            The tuple of downsample blocks to use.
+        up_block_types (`Tuple[str]`, *optional*, defaults to `("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D",)`):
+            The tuple of upsample blocks to use.
+        block_out_channels (`Tuple[int]`, *optional*, defaults to `(320, 640, 1280, 1280)`):
+            The tuple of output channels for each block.
+        layers_per_block (`int`, *optional*, defaults to 2): The number of layers per block.
+        downsample_padding (`int`, *optional*, defaults to 1): The padding to use for the downsampling convolution.
+        mid_block_scale_factor (`float`, *optional*, defaults to 1.0): The scale factor to use for the mid block.
+        act_fn (`str`, *optional*, defaults to `"silu"`): The activation function to use.
+        norm_num_groups (`int`, *optional*, defaults to 32): The number of groups to use for the normalization.
+        norm_eps (`float`, *optional*, defaults to 1e-5): The epsilon to use for the normalization.
+        cross_attention_dim (`int`, *optional*, defaults to 1280): The dimension of the cross attention features.
+        attention_head_dim (`int`, *optional*, defaults to 8): The dimension of the attention heads.
+    """
+    _supports_gradient_checkpointing = True
+    @register_to_config
+    def __init__(
+            self,
+            sample_size: Optional[int] = None,
+            in_channels: int = 4,
+            out_channels: int = 4,
+            center_input_sample: bool = False,
+            flip_sin_to_cos: bool = True,
+            freq_shift: int = 0,
+            down_block_types: Tuple[str] = (
+                    "CrossAttnDownBlock2D",
+                    "CrossAttnDownBlock2D",
+                    "CrossAttnDownBlock2D",
+                    "DownBlock2D",
+            ),
+            up_block_types: Tuple[str] = (
+                    "UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D"),
+            block_out_channels: Tuple[int] = (320, 640, 1280, 1280),
+            layers_per_block: int = 2,
+            downsample_padding: int = 1,
+            mid_block_scale_factor: float = 1,
+            act_fn: str = "silu",
+            norm_num_groups: int = 32,
+            norm_eps: float = 1e-5,
+            cross_attention_dim: int = 1280,
+            attention_head_dim: int = 8,
+    ):
+        super().__init__()
+        self.sample_size = sample_size
+        time_embed_dim = block_out_channels[0] * 4
+        # input
+        self.conv_in = nn.Conv2d(in_channels, block_out_channels[0], kernel_size=3, padding=(1, 1))
+        # time
+        self.time_proj = Timesteps(block_out_channels[0], flip_sin_to_cos, freq_shift)
+        timestep_input_dim = block_out_channels[0]
+        self.time_embedding = TimestepEmbedding(timestep_input_dim, time_embed_dim)
+        self.down_blocks = nn.ModuleList([])
+        self.mid_block = None
+        self.up_blocks = nn.ModuleList([])
+        # down
+        output_channel = block_out_channels[0]
+        for i, down_block_type in enumerate(down_block_types):
+            input_channel = output_channel
+            output_channel = block_out_channels[i]
+            is_final_block = i == len(block_out_channels) - 1
+            down_block = get_down_block(
+                down_block_type,
+                num_layers=layers_per_block,
+                in_channels=input_channel,
+                out_channels=output_channel,
+                temb_channels=time_embed_dim,
+                add_downsample=not is_final_block,
+                resnet_eps=norm_eps,
+                resnet_act_fn=act_fn,
+                resnet_groups=norm_num_groups,
+                cross_attention_dim=cross_attention_dim,
+                attn_num_head_channels=attention_head_dim,
+                downsample_padding=downsample_padding,
+            )
+            self.down_blocks.append(down_block)
+        # mid
+        self.mid_block = UNetMidBlock2DCrossAttn(
+            in_channels=block_out_channels[-1],
+            temb_channels=time_embed_dim,
+            resnet_eps=norm_eps,
+            resnet_act_fn=act_fn,
+            output_scale_factor=mid_block_scale_factor,
+            resnet_time_scale_shift="default",
+            cross_attention_dim=cross_attention_dim,
+            attn_num_head_channels=attention_head_dim,
+            resnet_groups=norm_num_groups,
+        )
+        # count how many layers upsample the images
+        self.num_upsamplers = 0
+        # up
+        reversed_block_out_channels = list(reversed(block_out_channels))
+        output_channel = reversed_block_out_channels[0]
+        for i, up_block_type in enumerate(up_block_types):
+            is_final_block = i == len(block_out_channels) - 1
+            prev_output_channel = output_channel
+            output_channel = reversed_block_out_channels[i]
+            input_channel = reversed_block_out_channels[min(i + 1, len(block_out_channels) - 1)]
+            # add upsample block for all BUT final layer
+            if not is_final_block:
+                add_upsample = True
+                self.num_upsamplers += 1
+            else:
+                add_upsample = False
+            up_block = get_up_block(
+                up_block_type,
+                num_layers=layers_per_block + 1,
+                in_channels=input_channel,
+                out_channels=output_channel,
+                prev_output_channel=prev_output_channel,
+                temb_channels=time_embed_dim,
+                add_upsample=add_upsample,
+                resnet_eps=norm_eps,
+                resnet_act_fn=act_fn,
+                resnet_groups=norm_num_groups,
+                cross_attention_dim=cross_attention_dim,
+                attn_num_head_channels=attention_head_dim,
+            )
+            self.up_blocks.append(up_block)
+            prev_output_channel = output_channel
+        # out
+        self.conv_norm_out = nn.GroupNorm(num_channels=block_out_channels[0], num_groups=norm_num_groups, eps=norm_eps)
+        self.conv_act = nn.SiLU()
+        self.conv_out = nn.Conv2d(block_out_channels[0], out_channels, 3, padding=1)
+    def set_attention_slice(self, slice_size):
+        if slice_size is not None and self.config.attention_head_dim % slice_size != 0:
+            raise ValueError(
+                f"Make sure slice_size {slice_size} is a divisor of "
+                f"the number of heads used in cross_attention {self.config.attention_head_dim}"
+            )
+        if slice_size is not None and slice_size > self.config.attention_head_dim:
+            raise ValueError(
+                f"Chunk_size {slice_size} has to be smaller or equal to "
+                f"the number of heads used in cross_attention {self.config.attention_head_dim}"
+            )
+        for block in self.down_blocks:
+            if hasattr(block, "attentions") and block.attentions is not None:
+                block.set_attention_slice(slice_size)
+        self.mid_block.set_attention_slice(slice_size)
+        for block in self.up_blocks:
+            if hasattr(block, "attentions") and block.attentions is not None:
+                block.set_attention_slice(slice_size)
+    def set_use_memory_efficient_attention_xformers(self, use_memory_efficient_attention_xformers: bool):
+        for block in self.down_blocks:
+            if hasattr(block, "attentions") and block.attentions is not None:
+                block.set_use_memory_efficient_attention_xformers(use_memory_efficient_attention_xformers)
+        self.mid_block.set_use_memory_efficient_attention_xformers(use_memory_efficient_attention_xformers)
+        for block in self.up_blocks:
+            if hasattr(block, "attentions") and block.attentions is not None:
+                block.set_use_memory_efficient_attention_xformers(use_memory_efficient_attention_xformers)
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, (CrossAttnDownBlock2D, DownBlock2D, CrossAttnUpBlock2D, UpBlock2D)):
+            module.gradient_checkpointing = value
+    def forward(
+            self,
+            sample: torch.FloatTensor,
+            timestep: Union[torch.Tensor, float, int],
+            encoder_hidden_states: torch.Tensor,
+            encoder_attention_mask: torch.Tensor,
+            return_dict: bool = True,
+    ) -> Union[UNet2DConditionOutput, Tuple]:
+        r"""
+        Args:
+            sample (`torch.FloatTensor`): (batch, channel, height, width) noisy inputs tensor
+            timestep (`torch.FloatTensor` or `float` or `int`): (batch) timesteps
+            encoder_hidden_states (`torch.FloatTensor`):
+                (batch_size, sequence_length, hidden_size) encoder hidden states
+            encoder_attention_mask (`torch.FloatTensor`):
+                (batch_size, sequence_length) encoder attention mask
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`models.unet_2d_condition.UNet2DConditionOutput`] instead of a plain tuple.
+        Returns:
+            [`~models.unet_2d_condition.UNet2DConditionOutput`] or `tuple`:
+            [`~models.unet_2d_condition.UNet2DConditionOutput`] if `return_dict` is True, otherwise a `tuple`. When
+            returning a tuple, the first element is the sample tensor.
+        """
+        # By default samples have to be AT least a multiple of the overall upsampling factor.
+        # The overall upsampling factor is equal to 2 ** (# num of upsampling layears).
+        # However, the upsampling interpolation output size can be forced to fit any upsampling size
+        # on the fly if necessary.
+        default_overall_up_factor = 2 ** self.num_upsamplers
+        # upsample size should be forwarded when sample is not a multiple of `default_overall_up_factor`
+        forward_upsample_size = False
+        upsample_size = None
+        if any(s % default_overall_up_factor != 0 for s in sample.shape[-2:]):
+            logger.info("Forward upsample size to force interpolation output size.")
+            forward_upsample_size = True
+        # 0. center input if necessary
+        if self.config.center_input_sample:
+            sample = 2 * sample - 1.0
+        # 1. time
+        timesteps = timestep
+        if not torch.is_tensor(timesteps):
+            # TODO: this requires sync between CPU and GPU. So try to pass timesteps as tensors if you can
+            timesteps = torch.tensor([timesteps], dtype=torch.long, device=sample.device)
+        elif torch.is_tensor(timesteps) and len(timesteps.shape) == 0:
+            timesteps = timesteps[None].to(sample.device)
+        # broadcast to batch dimension in a way that's compatible with ONNX/Core ML
+        timesteps = timesteps.expand(sample.shape[0])
+        t_emb = self.time_proj(timesteps)
+        # timesteps does not contain any weights and will always return f32 tensors
+        # but time_embedding might actually be running in fp16. so we need to cast here.
+        # there might be better ways to encapsulate this.
+        t_emb = t_emb.to(dtype=self.dtype)
+        emb = self.time_embedding(t_emb)
+        # 2. pre-process
+        sample = self.conv_in(sample)
+        # 3. down
+        down_block_res_samples = (sample,)
+        for downsample_block in self.down_blocks:
+            if hasattr(downsample_block, "attentions") and downsample_block.attentions is not None:
+                sample, res_samples = downsample_block(
+                    hidden_states=sample,
+                    temb=emb,
+                    encoder_hidden_states=encoder_hidden_states,
+                    encoder_attention_mask=encoder_attention_mask,
+                )
+            else:
+                sample, res_samples = downsample_block(hidden_states=sample, temb=emb)
+            down_block_res_samples += res_samples
+        # 4. mid
+        sample = self.mid_block(sample, emb, encoder_hidden_states=encoder_hidden_states,
+                                encoder_attention_mask=encoder_attention_mask)
+        # 5. up
+        for i, upsample_block in enumerate(self.up_blocks):
+            is_final_block = i == len(self.up_blocks) - 1
+            res_samples = down_block_res_samples[-len(upsample_block.resnets):]
+            down_block_res_samples = down_block_res_samples[: -len(upsample_block.resnets)]
+            # if we have not reached the final block and need to forward the
+            # upsample size, we do it here
+            if not is_final_block and forward_upsample_size:
+                upsample_size = down_block_res_samples[-1].shape[2:]
+            if hasattr(upsample_block, "attentions") and upsample_block.attentions is not None:
+                sample = upsample_block(
+                    hidden_states=sample,
+                    temb=emb,
+                    res_hidden_states_tuple=res_samples,
+                    encoder_hidden_states=encoder_hidden_states,
+                    encoder_attention_mask=encoder_attention_mask,
+                    upsample_size=upsample_size,
+                )
+            else:
+                sample = upsample_block(
+                    hidden_states=sample, temb=emb, res_hidden_states_tuple=res_samples, upsample_size=upsample_size
+                )
+        # 6. post-process
+        sample = self.conv_norm_out(sample)
+        sample = self.conv_act(sample)
+        sample = self.conv_out(sample)
+        if not return_dict:
+            return (sample,)
+        return UNet2DConditionOutput(sample=sample)

models/inception.py ADDED Viewed

	@@ -0,0 +1,314 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torchvision import models
+try:
+    from torchvision.models.utils import load_state_dict_from_url
+except ImportError:
+    from torch.utils.model_zoo import load_url as load_state_dict_from_url
+# Inception weights ported to Pytorch from
+# http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz
+FID_WEIGHTS_URL = 'https://github.com/mseitzer/pytorch-fid/releases/download/fid_weights/pt_inception-2015-12-05-6726825d.pth'
+class InceptionV3(nn.Module):
+    """Pretrained InceptionV3 network returning feature maps"""
+    # Index of default block of inception to return,
+    # corresponds to output of final average pooling
+    DEFAULT_BLOCK_INDEX = 3
+    # Maps feature dimensionality to their output blocks indices
+    BLOCK_INDEX_BY_DIM = {
+        64: 0,  # First max pooling features
+        192: 1,  # Second max pooling featurs
+        768: 2,  # Pre-aux classifier features
+        2048: 3  # Final average pooling features
+    }
+    def __init__(self,
+                 output_blocks=[DEFAULT_BLOCK_INDEX],
+                 resize_input=True,
+                 normalize_input=True,
+                 requires_grad=False,
+                 use_fid_inception=True):
+        """Build pretrained InceptionV3
+        Parameters
+        ----------
+        output_blocks : list of int
+            Indices of blocks to return features of. Possible values are:
+                - 0: corresponds to output of first max pooling
+                - 1: corresponds to output of second max pooling
+                - 2: corresponds to output which is fed to aux classifier
+                - 3: corresponds to output of final average pooling
+        resize_input : bool
+            If true, bilinearly resizes input to width and height 299 before
+            feeding input to model. As the network without fully connected
+            layers is fully convolutional, it should be able to handle inputs
+            of arbitrary size, so resizing might not be strictly needed
+        normalize_input : bool
+            If true, scales the input from range (0, 1) to the range the
+            pretrained Inception network expects, namely (-1, 1)
+        requires_grad : bool
+            If true, parameters of the model require gradients. Possibly useful
+            for finetuning the network
+        use_fid_inception : bool
+            If true, uses the pretrained Inception model used in Tensorflow's
+            FID implementation. If false, uses the pretrained Inception model
+            available in torchvision. The FID Inception model has different
+            weights and a slightly different structure from torchvision's
+            Inception model. If you want to compute FID scores, you are
+            strongly advised to set this parameter to true to get comparable
+            results.
+        """
+        super(InceptionV3, self).__init__()
+        self.resize_input = resize_input
+        self.normalize_input = normalize_input
+        self.output_blocks = sorted(output_blocks)
+        self.last_needed_block = max(output_blocks)
+        assert self.last_needed_block <= 3, \
+            'Last possible output block index is 3'
+        self.blocks = nn.ModuleList()
+        if use_fid_inception:
+            inception = fid_inception_v3()
+        else:
+            inception = models.inception_v3(pretrained=True)
+        # Block 0: input to maxpool1
+        block0 = [
+            inception.Conv2d_1a_3x3,
+            inception.Conv2d_2a_3x3,
+            inception.Conv2d_2b_3x3,
+            nn.MaxPool2d(kernel_size=3, stride=2)
+        ]
+        self.blocks.append(nn.Sequential(*block0))
+        # Block 1: maxpool1 to maxpool2
+        if self.last_needed_block >= 1:
+            block1 = [
+                inception.Conv2d_3b_1x1,
+                inception.Conv2d_4a_3x3,
+                nn.MaxPool2d(kernel_size=3, stride=2)
+            ]
+            self.blocks.append(nn.Sequential(*block1))
+        # Block 2: maxpool2 to aux classifier
+        if self.last_needed_block >= 2:
+            block2 = [
+                inception.Mixed_5b,
+                inception.Mixed_5c,
+                inception.Mixed_5d,
+                inception.Mixed_6a,
+                inception.Mixed_6b,
+                inception.Mixed_6c,
+                inception.Mixed_6d,
+                inception.Mixed_6e,
+            ]
+            self.blocks.append(nn.Sequential(*block2))
+        # Block 3: aux classifier to final avgpool
+        if self.last_needed_block >= 3:
+            block3 = [
+                inception.Mixed_7a,
+                inception.Mixed_7b,
+                inception.Mixed_7c,
+                nn.AdaptiveAvgPool2d(output_size=(1, 1))
+            ]
+            self.blocks.append(nn.Sequential(*block3))
+        for param in self.parameters():
+            param.requires_grad = requires_grad
+    def forward(self, inp):
+        """Get Inception feature maps
+        Parameters
+        ----------
+        inp : torch.autograd.Variable
+            Input tensor of shape Bx3xHxW. Values are expected to be in
+            range (0, 1)
+        Returns
+        -------
+        List of torch.autograd.Variable, corresponding to the selected output
+        block, sorted ascending by index
+        """
+        outp = []
+        x = inp
+        if self.resize_input:
+            x = F.interpolate(x,
+                              size=(299, 299),
+                              mode='bilinear',
+                              align_corners=False)
+        if self.normalize_input:
+            x = 2 * x - 1  # Scale from range (0, 1) to range (-1, 1)
+        for idx, block in enumerate(self.blocks):
+            x = block(x)
+            if idx in self.output_blocks:
+                outp.append(x)
+            if idx == self.last_needed_block:
+                break
+        return outp
+def fid_inception_v3():
+    """Build pretrained Inception model for FID computation
+    The Inception model for FID computation uses a different set of weights
+    and has a slightly different structure than torchvision's Inception.
+    This method first constructs torchvision's Inception and then patches the
+    necessary parts that are different in the FID Inception model.
+    """
+    inception = models.inception_v3(num_classes=1008,
+                                    aux_logits=False,
+                                    pretrained=False)
+    inception.Mixed_5b = FIDInceptionA(192, pool_features=32)
+    inception.Mixed_5c = FIDInceptionA(256, pool_features=64)
+    inception.Mixed_5d = FIDInceptionA(288, pool_features=64)
+    inception.Mixed_6b = FIDInceptionC(768, channels_7x7=128)
+    inception.Mixed_6c = FIDInceptionC(768, channels_7x7=160)
+    inception.Mixed_6d = FIDInceptionC(768, channels_7x7=160)
+    inception.Mixed_6e = FIDInceptionC(768, channels_7x7=192)
+    inception.Mixed_7b = FIDInceptionE_1(1280)
+    inception.Mixed_7c = FIDInceptionE_2(2048)
+    state_dict = load_state_dict_from_url(FID_WEIGHTS_URL, progress=True)
+    inception.load_state_dict(state_dict)
+    return inception
+class FIDInceptionA(models.inception.InceptionA):
+    """InceptionA block patched for FID computation"""
+    def __init__(self, in_channels, pool_features):
+        super(FIDInceptionA, self).__init__(in_channels, pool_features)
+    def forward(self, x):
+        branch1x1 = self.branch1x1(x)
+        branch5x5 = self.branch5x5_1(x)
+        branch5x5 = self.branch5x5_2(branch5x5)
+        branch3x3dbl = self.branch3x3dbl_1(x)
+        branch3x3dbl = self.branch3x3dbl_2(branch3x3dbl)
+        branch3x3dbl = self.branch3x3dbl_3(branch3x3dbl)
+        # Patch: Tensorflow's average pool does not use the padded zero's in
+        # its average calculation
+        branch_pool = F.avg_pool2d(x, kernel_size=3, stride=1, padding=1,
+                                   count_include_pad=False)
+        branch_pool = self.branch_pool(branch_pool)
+        outputs = [branch1x1, branch5x5, branch3x3dbl, branch_pool]
+        return torch.cat(outputs, 1)
+class FIDInceptionC(models.inception.InceptionC):
+    """InceptionC block patched for FID computation"""
+    def __init__(self, in_channels, channels_7x7):
+        super(FIDInceptionC, self).__init__(in_channels, channels_7x7)
+    def forward(self, x):
+        branch1x1 = self.branch1x1(x)
+        branch7x7 = self.branch7x7_1(x)
+        branch7x7 = self.branch7x7_2(branch7x7)
+        branch7x7 = self.branch7x7_3(branch7x7)
+        branch7x7dbl = self.branch7x7dbl_1(x)
+        branch7x7dbl = self.branch7x7dbl_2(branch7x7dbl)
+        branch7x7dbl = self.branch7x7dbl_3(branch7x7dbl)
+        branch7x7dbl = self.branch7x7dbl_4(branch7x7dbl)
+        branch7x7dbl = self.branch7x7dbl_5(branch7x7dbl)
+        # Patch: Tensorflow's average pool does not use the padded zero's in
+        # its average calculation
+        branch_pool = F.avg_pool2d(x, kernel_size=3, stride=1, padding=1,
+                                   count_include_pad=False)
+        branch_pool = self.branch_pool(branch_pool)
+        outputs = [branch1x1, branch7x7, branch7x7dbl, branch_pool]
+        return torch.cat(outputs, 1)
+class FIDInceptionE_1(models.inception.InceptionE):
+    """First InceptionE block patched for FID computation"""
+    def __init__(self, in_channels):
+        super(FIDInceptionE_1, self).__init__(in_channels)
+    def forward(self, x):
+        branch1x1 = self.branch1x1(x)
+        branch3x3 = self.branch3x3_1(x)
+        branch3x3 = [
+            self.branch3x3_2a(branch3x3),
+            self.branch3x3_2b(branch3x3),
+        ]
+        branch3x3 = torch.cat(branch3x3, 1)
+        branch3x3dbl = self.branch3x3dbl_1(x)
+        branch3x3dbl = self.branch3x3dbl_2(branch3x3dbl)
+        branch3x3dbl = [
+            self.branch3x3dbl_3a(branch3x3dbl),
+            self.branch3x3dbl_3b(branch3x3dbl),
+        ]
+        branch3x3dbl = torch.cat(branch3x3dbl, 1)
+        # Patch: Tensorflow's average pool does not use the padded zero's in
+        # its average calculation
+        branch_pool = F.avg_pool2d(x, kernel_size=3, stride=1, padding=1,
+                                   count_include_pad=False)
+        branch_pool = self.branch_pool(branch_pool)
+        outputs = [branch1x1, branch3x3, branch3x3dbl, branch_pool]
+        return torch.cat(outputs, 1)
+class FIDInceptionE_2(models.inception.InceptionE):
+    """Second InceptionE block patched for FID computation"""
+    def __init__(self, in_channels):
+        super(FIDInceptionE_2, self).__init__(in_channels)
+    def forward(self, x):
+        branch1x1 = self.branch1x1(x)
+        branch3x3 = self.branch3x3_1(x)
+        branch3x3 = [
+            self.branch3x3_2a(branch3x3),
+            self.branch3x3_2b(branch3x3),
+        ]
+        branch3x3 = torch.cat(branch3x3, 1)
+        branch3x3dbl = self.branch3x3dbl_1(x)
+        branch3x3dbl = self.branch3x3dbl_2(branch3x3dbl)
+        branch3x3dbl = [
+            self.branch3x3dbl_3a(branch3x3dbl),
+            self.branch3x3dbl_3b(branch3x3dbl),
+        ]
+        branch3x3dbl = torch.cat(branch3x3dbl, 1)
+        # Patch: The FID Inception model uses max pooling instead of average
+        # pooling. This is likely an error in this specific Inception
+        # implementation, as other Inception models use average pooling here
+        # (which matches the description in the paper).
+        branch_pool = F.max_pool2d(x, kernel_size=3, stride=1, padding=1)
+        branch_pool = self.branch_pool(branch_pool)
+        outputs = [branch1x1, branch3x3, branch3x3dbl, branch_pool]
+        return torch.cat(outputs, 1)

v1-5-pruned-emaonly.ckpt → pororo_100.h5 RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:cc6cb27103417325ff94f52b7a5d2dde45a7515b25c255d8e396c90014281516
-size 4265380512

 version https://git-lfs.github.com/spec/v1
+oid sha256:6b5d47440de7abbbbb2265e1d5ecbc1c5d4d3188434db3988cb13e7ec5fa7549
+size 69568

readme-storyvisualization.md ADDED Viewed

	@@ -0,0 +1,123 @@

+### 一、基于叙事文本的跨模态序列图像生成模型
+## 安装环境
+conda create -n arldm python=3.8
+conda activate arldm
+conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-lts
+cd /root/lihui/StoryVisualization
+pip install -r requirements.txt
+## 数据准备
+Download the PororoSV dataset here.
+To accelerate I/O, using the following scrips to convert your downloaded data to HDF5
+python data_script/pororo_hdf5.py
+--data_dir /path/to/pororo_data
+--save_path /path/to/save_hdf5_file
+## 配置文件config.yaml
+#device
+mode: sample # train sample
+ckpt_dir: /root/lihui/StoryVisualization/save_ckpt_epoch5_new # checkpoint directory
+run_name: ARLDM # name for this run
+#train
+train_model_file: /root/lihui/StoryVisualization/save_ckpt_3last50/ARLDM/last.ckpt # model file for resume, none for train from scratch
+#sample
+test_model_file: /root/lihui/StoryVisualization/save_ckpt_3last50/ARLDM/last.ckpt # model file for test
+sample_output_dir: /root/lihui/StoryVisualization/save_samples_128_epoch50 # output directory
+## 训练
+在 config.yaml 中指定您的目录和设备配置并运行：
+python main.py
+## 采样
+在 config.yaml 中指定您的目录和设备配置并运行：
+python main.py
+## 引用
+@article{pan2022synthesizing,
+  title={Synthesizing Coherent Story with Auto-Regressive Latent Diffusion Models},
+  author={Pan, Xichen and Qin, Pengda and Li, Yuhong and Xue, Hui and Chen, Wenhu},
+  journal={arXiv preprint arXiv:2211.10950},
+  year={2022}
+}
+### 二、基于Real-ESRGAN的超分算法
+Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data
+ [论文]   [项目主页]   [YouTube 视频]   [B站视频]   [Poster]   [PPT]
+Xintao Wang, Liangbin Xie, Chao Dong, Ying Shan
+Tencent ARC Lab; Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
+## 环境
+Python >= 3.7 (推荐使用Anaconda或Miniconda)
+PyTorch >= 1.7
+## 安装
+1、直接进入已配好的文件夹
+cd /root/lihui/StoryVisualization/Real-ESRGAN
+2、或 把项目克隆到本地
+bash git clone https://github.com/xinntao/Real-ESRGAN.git cd Real-ESRGAN
+3、 安装各种依赖
+ ```bash
+   安装 basicsr - https://github.com/xinntao/BasicSR
+   #我们使用BasicSR来训练以及推断
+   pip install basicsr
+   #facexlib和gfpgan是用来增强人脸的
+   pip install facexlib pip install gfpgan pip install -r requirements.txt python setup.py develop
+   ```
+## 训练
+训练好的模型: RealESRGAN_x4plus_anime_6B
+有关waifu2x的更多信息和对比在anime_model.md中。
+## 下载模型
+wget https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth -P weights
+## 推断
+python inference_realesrgan.py -n RealESRGAN_x4plus_anime_6B -i inputs
+结果在results文件夹
+## BibTeX 引用
+@Article{wang2021realesrgan,
+    title={Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data},
+    author={Xintao Wang and Liangbin Xie and Chao Dong and Ying Shan},
+    journal={arXiv:2107.10833},
+    year={2021}
+}
+### 三、基于YOLOv5的目标角色检测算法
+## 安装
+克隆 repo，并要求在 Python>=3.7.0 环境中安装 requirements.txt ，且要求 PyTorch>=1.7 。
+git clone https://github.com/ultralytics/yolov5  # clone
+cd /root/lihui/StoryVisualization
+cd yolov5
+pip install -r requirements.txt  # install
+## 转换图片
+cd /root/lihui/StoryVisualization
+python transtoyolo.py
+## 使用 detect.py 推理
+detect.py 在各种来源上运行推理， 模型 自动从 最新的YOLOv5 release 中下载，并将结果保存到 runs/detect 。
+python detect.py --weights yolov5s.pt --source 0                               # webcam
+                                               img.jpg                         # image
+                                               vid.mp4                         # video
+                                               screen                          # screenshot
+                                               path/                           # directory
+                                               list.txt                        # list of images
+                                               list.streams                    # list of streams
+                                               'path/*.jpg'                    # glob
+                                               'https://youtu.be/Zgi9g1ksQHc'  # YouTube
+                                               'rtsp://example.com/media.mp4'  # RTSP, RTMP, HTTP stream
+## 训练
+ 最新的 模型 和 数据集 将自动的从 YOLOv5 release 中下载。 YOLOv5n/s/m/l/x 在 V100 GPU 的训练时间为 1/2/4/6/8 天（ 多GPU 训练速度更快）。 尽可能使用更大的 --batch-size ，或通过 --batch-size -1 实现 YOLOv5 自动批处理 。下方显示的 batchsize 适用于 V100-16GB。
+python train.py --data xxx.yaml --epochs 500 --weights '' --cfg yolov5l --batch-size 64
+# xx.yaml文件为转换后的数据
+## 许可
+YOLOv5 在两种不同的 License 下可用：
+AGPL-3.0 License： 查看 License 文件的详细信息。
+企业License：在没有 AGPL-3.0 开源要求的情况下为商业产品开发提供更大的灵活性。典型用例是将 Ultralytics 软件和 AI 模型嵌入到商业产品和应用程序中。在以下位置申请企业许可证 Ultralytics 许可 。
+### 四、演示系统
+## 指定文件目录并运行：
+cd /root/lihui/StoryVisualization/visualsystem
+python main.py
+#
+Your identification has been saved in             .
+Your public key has been saved in C:\Users\30254/.ssh/id_ed25519.pub.

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+pytorch_lightning<1.7.0
+lightning-bolts
+transformers==4.24.0
+diffusers==0.7.2
+timm
+ftfy
+hydra-core
+opencv-python
+h5py
+scipy

run.sh ADDED Viewed

	@@ -0,0 +1 @@


1	+ python main.py

test.py ADDED Viewed

	@@ -0,0 +1,94 @@

+import cv2
+import h5py
+import copy
+import os
+import random
+import numpy
+import numpy as np
+from PIL import Image
+def gettext(index):
+    with h5py.File('/root/lihui/StoryVisualization/pororo.h5', 'r') as h5:
+        story = list()
+        h5 = h5['test']
+        # 读取当前索引处的文本，并使用decode方法将其解码为UTF-8
+        texts = h5['text'][index].decode('utf-8').split('|')
+        symbol = '\n'
+        texts = symbol.join(texts)
+    texts = 'Story<' + str(index) + '> :' + '\n' + texts
+    print(texts)
+    return texts
+# for i in range(1000):
+#     gettext(i)
+# 截取前100的数据集
+#                                 ###正确的##############
+# # import h5py
+# # import numpy as np
+# # from PIL import Image
+# #
+# #
+# # # 创建名为“images”的子目录来保存图像
+# # os.makedirs("train_images", exist_ok=True)
+# #
+# # 创建一个h5文件
+# nf = h5py.File('/root/lihui/StoryVisualization/pororo_100.h5', "w")
+# with h5py.File('/root/lihui/StoryVisualization/pororo.h5', 'r') as f:
+#     test_group = f['test']
+#     texts = np.array(test_group['text'][()])
+#     ngroup = nf.create_group('test')
+#     ntext = ngroup.create_dataset('text', (100,), dtype=h5py.string_dtype(encoding='utf-8'))
+#     for i in range(100):
+#         ntext[i]=texts[i]
+#         print(f"样本 {i}:")
+#         # for j in range(5):
+#         #     # 创建一个固定的文件名来保存图像
+#         #     # filename = os.path.join("images", f"image_{i}_{j}.png")
+#         #     # # 将HDF5文件中的图像数据保存到文件中
+#         #     # with open(filename, "wb") as img_file:
+#         #     #     img_file.write(test_group[f'image{j}'][i])
+#         #     # 打印文本信息和文件名
+#         #     ntext[i]='|'.join(texts[i].decode('utf-8').split('|')[j])
+#         # print(f"图像{j}已保存到文件：{filename}")
+#         print(ntext[i])
+# nf.close()
+#保存测试集图像，随机截取视频帧
+with h5py.File(r'C:\Users\zjlab\Desktop\StoryVisualization\pororo.h5', 'r') as h5:
+    h5 = h5['test']
+    for index in range(len(h5['text'])):   #len(h5['text'])
+        # index = int(index + 1)
+        # print(index)
+        images = list()
+        for i in range(5):
+            # 从h5文件中读取一组图像和对应的文本。
+            im = h5['image{}'.format(i)][index]
+            # print(im)
+            # pil_img = Image.fromarray(im)
+            # # 保存图像
+            # pil_img.save(os.path.join('/root/lihui/StoryVisualization/ori_test_images', '{:04d}.png'.format(i)))
+            # 对每个图像解码
+            im = cv2.imdecode(im, cv2.IMREAD_COLOR)
+            # 随机选择一个128像素的图像切片
+            idx = random.randint(0, im.shape[0] / 128 - 1)
+            # 将切片后的图像加到images列表中
+            images.append(im[idx * 128: (idx + 1) * 128])
+        # 深拷贝，后续不随images变化
+        # ori_images = copy.deepcopy(images)
+        # 保存test原始图像
+    # for i, im in enumerate(images):
+    #     file_path = 'C:/Users/zjlab/Desktop/StoryVisualization/test_images/group{:02d}_image{:02d}.png'.format(
+    #             index + 1,
+    #             i + 1)
+    #     cv2.imwrite(file_path, im)
+            ori_images_pil = Image.fromarray(images[i])#numpy.uint8(images[i].detach().cpu().squeeze().float().numpy())).convert("RGB")
+            ori_images_pil.save(
+              os.path.join('C:/Users/zjlab/Desktop/StoryVisualization/test_images',
+                     'group{:02d}_image{:02d}.png'.format(index + 1,i + 1)))

transtoyolo.py ADDED Viewed

	@@ -0,0 +1,320 @@

+# -*- coding: utf-8 -*-
+import os
+import numpy as np
+import json
+from glob import glob
+import cv2
+import shutil
+import yaml
+from sklearn.model_selection import train_test_split
+from tqdm import tqdm
+# 获取当前路径
+ROOT_DIR = os.getcwd()
+'''
+统一图像格式
+'''
+def change_image_format(label_path=ROOT_DIR, suffix='.png'):
+    """
+    统一当前文件夹下所有图像的格式，如'.jpg'
+    :param suffix: 图像文件后缀
+    :param label_path:当前文件路径
+    :return:
+    """
+    externs = ['png', 'jpg', 'JPEG', 'BMP', 'bmp']
+    files = list()
+    # 获取尾缀在ecterns中的所有图像
+    for extern in externs:
+        files.extend(glob(label_path + "\\*." + extern))
+    # 遍历所有图像，转换图像格式
+    for file in files:
+        name = ''.join(file.split('.')[:-1])
+        file_suffix = file.split('.')[-1]
+        if file_suffix != suffix.split('.')[-1]:
+            # 重命名为jpg
+            new_name = name + suffix
+            # 读取图像
+            image = cv2.imread(file)
+            # 重新存图为jpg格式
+            cv2.imwrite(new_name, image)
+            # 删除旧图像
+            os.remove(file)
+'''
+读取所有json文件，获取所有的类别
+'''
+def get_all_class(file_list, label_path=ROOT_DIR):
+    """
+    从json文件中获取当前数据的所有类别
+    :param file_list:当前路径下的所有文件名
+    :param label_path:当前文件路径
+    :return:
+    """
+    # 初始化类别列表
+    classes = list()
+    # 遍历所有json,读取shape中的label值内容，添加到classes
+    for filename in tqdm(file_list):
+        json_path = os.path.join(label_path, filename + '.json')
+        json_file = json.load(open(json_path, "r", encoding="utf-8"))
+        for item in json_file["shapes"]:
+            label_class = item['label']
+            if label_class not in classes:
+                classes.append(label_class)
+    print('read file done')
+    return classes
+'''
+划分训练集、验证机、测试集
+'''
+def split_dataset(label_path, test_size=0.3, isUseTest=False, useNumpyShuffle=False):
+    """
+    将文件分为训练集，测试集和验证集
+    :param useNumpyShuffle: 使用numpy方法分割数据集
+    :param test_size: 分割测试集或验证集的比例
+    :param isUseTest: 是否使用测试集，默认为False
+    :param label_path:当前文件路径
+    :return:
+    """
+    # 获取所有json
+    files = glob(label_path + "\\*.json")
+    files = [i.replace("\\", "/").split("/")[-1].split(".json")[0] for i in files]
+    if useNumpyShuffle:
+        file_length = len(files)
+        index = np.arange(file_length)
+        np.random.seed(32)
+        np.random.shuffle(index) # 随机划分
+        test_files = None
+        # 是否有测试集
+        if isUseTest:
+            trainval_files, test_files = np.array(files)[index[:int(file_length * (1 - test_size))]], np.array(files)[
+                index[int(file_length * (1 - test_size)):]]
+        else:
+            trainval_files = files
+        # 划分训练集和测试集
+        train_files, val_files = np.array(trainval_files)[index[:int(len(trainval_files) * (1 - test_size))]], \
+                                 np.array(trainval_files)[index[int(len(trainval_files) * (1 - test_size)):]]
+    else:
+        test_files = None
+        if isUseTest:
+            trainval_files, test_files = train_test_split(files, test_size=test_size, random_state=55)
+        else:
+            trainval_files = files
+        train_files, val_files = train_test_split(trainval_files, test_size=test_size, random_state=55)
+    return train_files, val_files, test_files, files
+'''
+生成yolov5的训练、验证、测试集的文件夹
+'''
+def create_save_file(label_path=ROOT_DIR):
+    """
+    按照训练时的图像和标注路径创建文件夹
+    :param label_path:当前文件路径
+    :return:
+    """
+    # 生成训练集
+    train_image = os.path.join(label_path, 'train', 'images')
+    if not os.path.exists(train_image):
+        os.makedirs(train_image)
+    train_label = os.path.join(label_path, 'train', 'labels')
+    if not os.path.exists(train_label):
+        os.makedirs(train_label)
+    # 生成验证集
+    val_image = os.path.join(label_path, 'valid', 'images')
+    if not os.path.exists(val_image):
+        os.makedirs(val_image)
+    val_label = os.path.join(label_path, 'valid', 'labels')
+    if not os.path.exists(val_label):
+        os.makedirs(val_label)
+    # 生成测试集
+    test_image = os.path.join(label_path, 'test', 'images')
+    if not os.path.exists(test_image):
+        os.makedirs(test_image)
+    test_label = os.path.join(label_path, 'test', 'labels')
+    if not os.path.exists(test_label):
+        os.makedirs(test_label)
+    return train_image, train_label, val_image, val_label, test_image, test_label
+'''
+转换，根据图像大小，返回box框的中点和高宽信息
+'''
+def convert(size, box):
+    # 宽
+    dw = 1. / (size[0])
+    # 高
+    dh = 1. / (size[1])
+    x = (box[0] + box[1]) / 2.0 - 1
+    y = (box[2] + box[3]) / 2.0 - 1
+    # 宽
+    w = box[1] - box[0]
+    # 高
+    h = box[3] - box[2]
+    x = x * dw
+    w = w * dw
+    y = y * dh
+    h = h * dh
+    return x, y, w, h
+'''
+移动图像和标注文件到指定的训练集、验证集和测试集中
+'''
+def push_into_file(file, images, labels, label_path=ROOT_DIR, suffix='.jpg'):
+    """
+    最终生成在当前文件夹下的所有文件按image和label分别存在到训练集/验证集/测试集路径的文件夹下
+    :param file: 文件名列表
+    :param images: 存放images的路径
+    :param labels: 存放labels的路径
+    :param label_path: 当前文件路径
+    :param suffix: 图像文件后缀
+    :return:
+    """
+    # 遍历所有文件
+    for filename in file:
+        # 图像文件
+        image_file = os.path.join(label_path, filename + suffix)
+        # 标注文件
+        label_file = os.path.join(label_path, filename + '.txt')
+        # yolov5存放图像文件夹
+        if not os.path.exists(os.path.join(images, filename + suffix)):
+            try:
+                shutil.move(image_file, images)
+            except OSError:
+                pass
+        # yolov5存放标注文件夹
+        if not os.path.exists(os.path.join(labels, filename + suffix)):
+            try:
+                shutil.move(label_file, labels)
+            except OSError:
+                pass
+'''
+'''
+def json2txt(classes, txt_Name='allfiles', label_path=ROOT_DIR, suffix='.png'):
+    """
+    将json文件转化为txt文件，并将json文件存放到指定文件夹
+    :param classes: 类别名
+    :param txt_Name:txt文件，用来存放所有文件的路径
+    :param label_path:当前文件路径
+    :param suffix:图像文件后缀
+    :return:
+    """
+    store_json = os.path.join(label_path, 'json')
+    if not os.path.exists(store_json):
+        os.makedirs(store_json)
+    _, _, _, files = split_dataset(label_path)
+    if not os.path.exists(os.path.join(label_path, 'tmp')):
+        os.makedirs(os.path.join(label_path, 'tmp'))
+    list_file = open('tmp/%s.txt' % txt_Name, 'w')
+    for json_file_ in tqdm(files):
+        json_filename = os.path.join(label_path, json_file_ + ".json")
+        imagePath = os.path.join(label_path, json_file_ + suffix)
+        list_file.write('%s\n' % imagePath)
+        out_file = open('%s/%s.txt' % (label_path, json_file_), 'w')
+        json_file = json.load(open(json_filename, "r", encoding="utf-8"))
+        if os.path.exists(imagePath):
+            height, width, channels = cv2.imread(imagePath).shape
+            for multi in json_file["shapes"]:
+                if len(multi["points"][0]) == 0:
+                    out_file.write('')
+                    continue
+                points = np.array(multi["points"])
+                xmin = min(points[:, 0]) if min(points[:, 0]) > 0 else 0
+                xmax = max(points[:, 0]) if max(points[:, 0]) > 0 else 0
+                ymin = min(points[:, 1]) if min(points[:, 1]) > 0 else 0
+                ymax = max(points[:, 1]) if max(points[:, 1]) > 0 else 0
+                label = multi["label"]
+                if xmax <= xmin:
+                    pass
+                elif ymax <= ymin:
+                    pass
+                else:
+                    cls_id = classes.index(label)
+                    b = (float(xmin), float(xmax), float(ymin), float(ymax))
+                    bb = convert((width, height), b)
+                    out_file.write(str(cls_id) + " " + " ".join([str(a) for a in bb]) + '\n')
+                    # print(json_filename, xmin, ymin, xmax, ymax, cls_id)
+        if not os.path.exists(os.path.join(store_json, json_file_ + '.json')):
+            try:
+                shutil.move(json_filename, store_json)
+            except OSError:
+                pass
+'''
+创建yaml文件
+'''
+def create_yaml(classes, label_path, isUseTest=False):
+    nc = len(classes)
+    if not isUseTest:
+        desired_caps = {
+            'path': label_path,
+            'train': 'train/images',
+            'val': 'valid/images',
+            'nc': nc,
+            'names': classes
+        }
+    else:
+        desired_caps = {
+            'path': label_path,
+            'train': 'train/images',
+            'val': 'valid/images',
+            'test': 'test/images',
+            'nc': nc,
+            'names': classes
+        }
+    yamlpath = os.path.join(label_path, "data" + ".yaml")
+    # 写入到yaml文件
+    with open(yamlpath, "w+", encoding="utf-8") as f:
+        for key, val in desired_caps.items():
+            yaml.dump({key: val}, f, default_flow_style=False)
+# 首先确保当前文件夹下的所有图片统一后缀，如.jpg，如果为其他后缀，将suffix改为对应的后缀，如.png
+def ChangeToYolo5(label_path=r"D:\storydata", suffix='.png', test_size=0.1, isUseTest=False):
+    """
+    生成最终标准格式的文件
+    :param test_size: 分割测试集或验证集的比例
+    :param label_path:当前文件路径
+    :param suffix: 文件后缀名
+    :param isUseTest: 是否使用测试集
+    :return:
+    """
+    # step1:统一图像格式
+    change_image_format(label_path)
+    # step2:根据json文件划分训练集、验证集、测试集
+    train_files, val_files, test_file, files = split_dataset(label_path, test_size=test_size, isUseTest=isUseTest)
+    # step3:根据json文件，获取所有类别
+    classes = get_all_class(files)
+    # step4:将json文件转化为txt文件，并将json文件存放到指定文件夹
+    json2txt(classes)
+    # step5:创建yolov5训练所需的yaml文件
+    create_yaml(classes, label_path, isUseTest=isUseTest)
+    # step6:生成yolov5的训练、验证、测试集的文件夹
+    train_image, train_label, val_image, val_label, test_image, test_label = create_save_file(label_path)
+    # step7:将所有图像和标注文件，移动到对应的训练集、验证集、测试集
+    push_into_file(train_files, train_image, train_label, suffix=suffix)  # 将文件移动到训练集文件中
+    push_into_file(val_files, val_image, val_label, suffix=suffix)  # 将文件移动到验证集文件夹中
+    if test_file is not None:  # 如果测试集存在，则将文件移动到测试集文件中
+        push_into_file(test_file, test_image, test_label, suffix=suffix)
+    print('create dataset done')
+if __name__ == "__main__":
+    ChangeToYolo5()

v1-5-pruned-emaonly.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:6ce0161689b3853acaa03779ec93eafe75a02f4ced659bee03f50797806fa2fa
-size 4265146304

v1-5-pruned.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:1a189f0be69d6106a48548e7626207dddd7042a418dbf372cefd05e0cdba61b6
-size 7703324286