Spaces:

Kayson
/

InstructDiffusion

Runtime error

App Files Files Community

TiankaiHang commited on Sep 24, 2023

Commit

7ae68fe

1 Parent(s): 8c7c06d

sync

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

LICENSE +9 -0
README.md +121 -12
app.py +336 -0
configs/instruct_diffusion.yaml +256 -0
dataset/README.md +62 -0
dataset/editing/edit_zip_dataset.py +494 -0
dataset/low_level/lowlevel_clwd.py +106 -0
dataset/low_level/lowlevel_gopro.py +106 -0
dataset/low_level/lowlevel_reds.py +111 -0
dataset/low_level/lowlevel_sidd.py +96 -0
dataset/pose/pose.py +760 -0
dataset/prompt/color_list_train_small.txt +17 -0
dataset/prompt/prompt_deblur.txt +10 -0
dataset/prompt/prompt_denoise.txt +10 -0
dataset/prompt/prompt_dewatermark.txt +10 -0
dataset/prompt/prompt_pose.txt +10 -0
dataset/prompt/prompt_seg.txt +11 -0
dataset/seg/coco_stuff.py +175 -0
dataset/seg/grefcoco.py +329 -0
dataset/seg/grefcoco_segmentation.py +149 -0
dataset/seg/refcoco.py +354 -0
dataset/seg/refcoco_segmentation.py +149 -0
dataset/utils/zip_manager.py +144 -0
edit_cli.py +136 -0
environment.yaml +40 -0
figure/animals.png +0 -0
figure/mirrorcat.jpg +0 -0
figure/people.jpg +0 -0
figure/watermark.png +0 -0
main.py +566 -0
scripts/convert_ckpt.py +51 -0
scripts/download_pretrained_sd.sh +7 -0
scripts/inference_example.sh +12 -0
scripts/run_multinode.sh +6 -0
stable_diffusion/LICENSE +82 -0
stable_diffusion/README.md +215 -0
stable_diffusion/Stable_Diffusion_v1_Model_Card.md +144 -0
stable_diffusion/assets/a-painting-of-a-fire.png +0 -0
stable_diffusion/assets/a-photograph-of-a-fire.png +0 -0
stable_diffusion/assets/a-shirt-with-a-fire-printed-on-it.png +0 -0
stable_diffusion/assets/a-shirt-with-the-inscription-'fire'.png +0 -0
stable_diffusion/assets/a-watercolor-painting-of-a-fire.png +0 -0
stable_diffusion/assets/birdhouse.png +0 -0
stable_diffusion/assets/fire.png +0 -0
stable_diffusion/assets/inpainting.png +0 -0
stable_diffusion/assets/modelfigure.png +0 -0
stable_diffusion/assets/rdm-preview.jpg +0 -0
stable_diffusion/assets/reconstruction1.png +0 -0
stable_diffusion/assets/reconstruction2.png +0 -0
stable_diffusion/assets/results.gif.REMOVED.git-id +1 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,9 @@

+Copyright 2023 Authors of InstructDiffusion(https://arxiv.org/pdf/2309.03895.pdf)
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+Portions of code and models (such as pretrained checkpoints, which are fine-tuned starting from released Stable Diffusion checkpoints) are derived from the Stable Diffusion codebase (https://github.com/CompVis/stable-diffusion) and Instruct-pix2pix codebase (https://github.com/timothybrooks/instruct-pix2pix). Further restrictions may apply. Please consult the Stable Diffusion license `stable_diffusion/LICENSE` and the Instruct-pix2pix license `instruct-pix2pix/LICENSE`. Modified code is denoted as such in comments at the start of each file.

README.md CHANGED Viewed

@@ -1,12 +1,121 @@
----
-title: InstructDiffusion
-emoji: 📚
-colorFrom: purple
-colorTo: yellow
-sdk: gradio
-sdk_version: 3.44.4
-app_file: app.py
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
+<p align="center">
+  <a href="https://gengzigang.github.io/instructdiffusion.github.io/">Project Page</a> |
+  <a href="https://arxiv.org/pdf/2309.03895.pdf">Arxiv</a> |
+  <a href="https://f605b16c6b183b13ac.gradio.live">Web Demo</a> |
+  <a href="#QuickStart">QuickStart</a> |
+  <a href="#Training">Training</a> |
+  <a href="#Acknowledge">Acknowledge</a> |
+  <a href='#Citation'>Citation</a>
+</p>
+<div align="center">
+  <img src="figure/teaser.png" width="1000"/>
+</div>
+This is the pytorch implementation of InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions. Our code is based on the [Instruct-pix2pix](https://github.com/timothybrooks/instruct-pix2pix) and [CompVis/stable_diffusion](https://github.com/CompVis/stable-diffusion).<br>
+## QuickStart
+Follow the steps below to quickly edit your own images. The inference code in our repository requires **one GPU with > 9GB memory** to test images with a resolution of **512**.
+1. Clone this repo.
+2. Setup conda environment:
+   ```
+   conda env create -f environment.yaml
+   conda activate instructdiff
+   ```
+3. We provide a well-trained [checkpoint](https://mailustceducn-my.sharepoint.com/:u:/g/personal/aa397601_mail_ustc_edu_cn/EZmXduulFidIhJD73SGcbOoBNpm18CJmU4PgPTS21RM2Ow?e=KqQYpO) and a [checkpoint](https://mailustceducn-my.sharepoint.com/:u:/g/personal/aa397601_mail_ustc_edu_cn/EWlNmyeS9P1BkRg_IlXbPbwBeNMQXQTcIA0pCokyd61UWg?e=iKfRdk) that has undergone human-alignment. Feel free to download to the folder `checkpoints` and try both of them.
+4. You can edit your own images:
+```bash
+python edit_cli.py --input example.jpg --edit "Transform it to van Gogh, starry night style."
+# Optionally, you can customize the parameters by using the following syntax:
+# --resolution 512 --steps 50 --config configs/instruct_diffusion.yaml --ckpt YOUR_CHECKPOINT --cfg-text 3.5 --cfg-image 1.25
+# We also support loading image from the website and edit, e.g., you could run the command like this:
+python edit_cli.py --input "https://wallup.net/wp-content/uploads/2016/01/207131-animals-nature-lion.jpg" \
+   --edit "Transform it to van Gogh, starry night style." \
+   --resolution 512 --steps 50 \
+   --config configs/instruct_diffusion.yaml \
+   --ckpt checkpoints/v1-5-pruned-emaonly-adaption-task-humanalign.ckpt \
+   --outdir logs/
+```
+For other different tasks, we provide recommended parameter settings, which can be found in [`scripts/inference_example.sh`](./scripts/inference_example.sh).
+5. (Optional) You can launch your own interactive editing Gradio app:
+```bash
+python edit_app.py
+# You can also specify the path to the checkpoint
+# The default checkpoint is checkpoints/v1-5-pruned-emaonly-adaption-task-humanalign.ckpt
+python edit_app.py --ckpt checkpoints/v1-5-pruned-emaonly-adaption-task-humanalign.ckpt
+```
+## Training
+The code is developed using python 3.8 on Ubuntu 18.04. The code is developed and tested using 48 NVIDIA V100 GPU cards, each with 32GB of memory. Other platforms are not fully tested.
+### Installation
+1. Clone this repo.
+2. Setup conda environment:
+   ```
+   conda env create -f environment.yaml
+   conda activate instructdiff
+   ```
+### Pre-trained Model Preparation
+You can use the following command to download the official pre-trained stable diffusion model, or you can download the model trained by our pretraining adaptation process from [OneDrive](https://mailustceducn-my.sharepoint.com/:u:/g/personal/aa397601_mail_ustc_edu_cn/EXJSMIpFev5Nj0kuKI88U1IBZDSjegp3G8ukku0OxRRjFQ?e=QhnnB4) and put it into the following folder: stable_diffusion/models/ldm/stable-diffusion-v1/.
+   ```
+   bash scripts/download_pretrained_sd.sh
+   ```
+### Data Preparation
+You can refer to the [dataset](https://github.com/cientgu/InstructDiffusion/tree/main/dataset) to prepare your data.
+### Training Command
+For multi-GPU training on a single machine, you can use the following command:
+   ```
+   python -m torch.distributed.launch --nproc_per_node=8 main.py --name v0 --base configs/instruct_diffusion.yaml --train --logdir logs/instruct_diffusion
+   ```
+For multi-GPU training on multiple machines, you can use the following command (assuming 6 machines as an example):
+   ```
+   bash run_multinode.sh instruct_diffusion v0 6
+   ```
+### Convert EMA-Model
+You can get the final EMA checkpoint for inference using the command below:
+   ```
+   python convert_ckpt.py --ema-ckpt logs/instruct_diffusion/checkpoint/ckpt_epoch_200/state.pth --out-ckpt checkpoints/v1-5-pruned-emaonly-adaption-task.ckpt
+   ```
+## Acknowledge
+Thanks to
+- [Stable-diffusion](https://github.com/CompVis/stable-diffusion)
+- [Instruct-pix2pix](https://github.com/timothybrooks/instruct-pix2pix)
+## Citation
+```
+@article{Geng23instructdiff,
+  author       = {Zigang Geng and
+                  Binxin Yang and
+                  Tiankai Hang and
+                  Chen Li and
+                  Shuyang Gu and
+                  Ting Zhang and
+                  Jianmin Bao and
+                  Zheng Zhang and
+                  Han Hu and
+                  Dong Chen and
+                  Baining Guo},
+  title        = {InstructDiffusion: {A} Generalist Modeling Interface for Vision Tasks},
+  journal      = {CoRR},
+  volume       = {abs/2309.03895},
+  year         = {2023},
+  url          = {https://doi.org/10.48550/arXiv.2309.03895},
+  doi          = {10.48550/arXiv.2309.03895},
+}
+```

app.py ADDED Viewed

	@@ -0,0 +1,336 @@

+# --------------------------------------------------------
+# InstructDiffusion
+# Based on instruct-pix2pix (https://github.com/timothybrooks/instruct-pix2pix)
+# Modified by Tiankai Hang (tkhang@seu.edu.cn)
+# --------------------------------------------------------
+import os
+import sys
+import re
+import math
+import numpy as np
+import random
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from omegaconf import OmegaConf
+from torch import autocast
+import einops
+from einops import rearrange
+import gradio as gr
+import k_diffusion as K
+import requests
+from functools import partial
+from copy import deepcopy
+from PIL import Image, ImageOps
+import click
+sys.path.append("./stable_diffusion")
+from stable_diffusion.ldm.util import instantiate_from_config
+def load_model_from_config(config, ckpt, vae_ckpt=None, verbose=False):
+    model = instantiate_from_config(config.model)
+    print(f"Loading model from {ckpt}")
+    pl_sd = torch.load(ckpt, map_location="cpu")
+    if 'state_dict' in pl_sd:
+        pl_sd = pl_sd['state_dict']
+    m, u = model.load_state_dict(pl_sd, strict=False)
+    print(m, u)
+    return model
+def read_content(file_path: str) -> str:
+    """read the content of target file
+    """
+    with open(file_path, 'r', encoding='utf-8') as f:
+        content = f.read()
+    return content
+def get_header():
+    content = """
+    <div style="text-align: center; max-width: 650px; margin: 0 auto;">
+    <div style="
+            display: inline-flex;
+            gap: 0.8rem;
+            font-size: 1.75rem;
+            justify-content: center;
+            margin-bottom: 10px;
+        ">
+        <h1 style="font-weight: 900; align-items: center; margin-bottom: 7px; margin-top: 20px;">
+        InstructDiffusion 🎨
+        </h1>
+    </div>
+    <div>
+        <p style="align-items: center; margin-bottom: 7px;">
+        InstructDiffusion, upload a source image and write the instruction to conduct keypoint detection, referring segmentation, and image editing.
+        </p>
+        <p style="align-items: center; margin-bottom: 7px;">
+        Paper is available in <a style="text-decoration: underline;" href="https://gengzigang.github.io/instructdiffusion.github.io/">Arxiv</a>. If you like this demo, please help to ⭐ the <a style="text-decoration: underline;" href="https://github.com/cientgu/InstructDiffusion">Github Repo</a> 😊.
+        </p>
+    </div>
+    </div>
+    """
+    return content
+class CFGDenoiser(nn.Module):
+    def __init__(self, model):
+        super().__init__()
+        self.inner_model = model
+    def forward(self, z, sigma, cond, uncond, text_cfg_scale, image_cfg_scale):
+        cfg_z = einops.repeat(z, "1 ... -> n ...", n=3)
+        cfg_sigma = einops.repeat(sigma, "1 ... -> n ...", n=3)
+        cfg_cond = {
+            "c_crossattn": [torch.cat([cond["c_crossattn"][0], uncond["c_crossattn"][0], cond["c_crossattn"][0]])],
+            "c_concat": [torch.cat([cond["c_concat"][0], cond["c_concat"][0], uncond["c_concat"][0]])],
+        }
+        out_cond, out_img_cond, out_txt_cond = self.inner_model(cfg_z, cfg_sigma, cond=cfg_cond).chunk(3)
+        return 0.5 * (out_img_cond + out_txt_cond) + text_cfg_scale * (out_cond - out_img_cond) + image_cfg_scale * (out_cond - out_txt_cond)
+def predict(
+        model, model_wrap,
+        model_wrap_cfg,
+        null_token, resolution,
+        input_img, edit, seed, steps, cfg_text, cfg_image,
+        stochastic_steps=0, sampler="euler", additional={}):
+    # set seed
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed(seed)
+    torch.cuda.empty_cache()
+    if isinstance(input_img, str):
+        if input_img.startswith("http"):
+            input_image = Image.open(requests.get(input_img, stream=True).raw).convert("RGB")
+        else:
+            input_image = Image.open(input_img).convert("RGB")
+        width, height = input_image.size
+        factor = resolution / max(width, height)
+        width = int((width * factor) // 64) * 64
+        height = int((height * factor) // 64) * 64
+        if hasattr(Image, "Resampling"):
+            input_image = ImageOps.fit(input_image, (width, height), method=Image.Resampling.LANCZOS)
+        else:
+            input_image = ImageOps.fit(input_image, (width, height), method=Image.LANCZOS)
+        input_image = 2 * torch.tensor(np.array(input_image)).float() / 255 - 1
+        input_image = rearrange(input_image, "h w c -> 1 c h w").cuda()
+    # if PIL Image
+    elif isinstance(input_img, Image.Image):
+        input_image = input_img
+        width, height = input_image.size
+        factor = resolution / max(width, height)
+        # factor = math.ceil(min(width, height) * factor / 64) * 64 / min(width, height)
+        width = int((width * factor) // 64) * 64
+        height = int((height * factor) // 64) * 64
+        if hasattr(Image, "Resampling"):
+            input_image = ImageOps.fit(input_image, (width, height), method=Image.Resampling.LANCZOS)
+        else:
+            input_image = ImageOps.fit(input_image, (width, height), method=Image.LANCZOS)
+        input_image = 2 * torch.tensor(np.array(input_image)).float() / 255 - 1
+        input_image = rearrange(input_image, "h w c -> 1 c h w").cuda()
+    elif isinstance(input_img, dict):
+        input_image = input_img["image"].convert("RGB")
+        width, height = input_image.size
+        factor = resolution / max(width, height)
+        width = int((width * factor) // 64) * 64
+        height = int((height * factor) // 64) * 64
+        if hasattr(Image, "Resampling"):
+            input_image = ImageOps.fit(input_image, (width, height), method=Image.Resampling.LANCZOS)
+        else:
+            input_image = ImageOps.fit(input_image, (width, height), method=Image.LANCZOS)
+        input_image = 2 * torch.tensor(np.array(input_image)).float() / 255 - 1
+        input_image = rearrange(input_image, "h w c -> 1 c h w").cuda()
+    assert input_image is not None
+    # print input image size
+    print(input_image.shape, factor, width, height)
+    with torch.no_grad(), autocast("cuda"):
+        cond = {}
+        cond["c_crossattn"] = [model.get_learned_conditioning([edit])]
+        cond["c_concat"] = [model.encode_first_stage(input_image).mode()]
+        uncond = {}
+        if "txt_embed" in additional:
+            uncond["c_crossattn"] = [additional["txt_embed"].cuda().unsqueeze(0)]
+        else:
+            uncond["c_crossattn"] = [null_token]
+        if "img_embed" in additional:
+            # uncond["c_concat"] = [additional["img_embed"].cuda()]
+            # resize to cond["c_concat"][0]
+            uncond["c_concat"] = [additional["img_embed"].cuda()]
+            uncond["c_concat"][0] = F.interpolate(uncond["c_concat"][0], size=cond["c_concat"][0].shape[-2:], mode="bilinear", align_corners=False)
+        else:
+            uncond["c_concat"] = [torch.zeros_like(cond["c_concat"][0])]
+        sigmas = model_wrap.get_sigmas(steps)
+        extra_args = {
+            "cond": cond,
+            "uncond": uncond,
+            "text_cfg_scale": cfg_text,
+            "image_cfg_scale": cfg_image,
+        }
+        if stochastic_steps <= 0:
+            z = torch.randn_like(cond["c_concat"][0]) * sigmas[0]
+            if sampler == "euler":
+                z = K.sampling.sample_euler_ancestral(model_wrap_cfg, z, sigmas, extra_args=extra_args)
+            elif sampler == "heun":
+                z = K.sampling.sample_heun(model_wrap_cfg, z, sigmas, extra_args=extra_args)
+        else:
+            z = torch.randn_like(cond["c_concat"][0]) * sigmas[stochastic_steps] + cond["c_concat"][0]
+            z = K.sampling.sample_euler_ancestral(model_wrap_cfg, z, sigmas[stochastic_steps:], extra_args=extra_args)
+        x = model.decode_first_stage(z)
+        x = torch.clamp((x + 1.0) / 2.0, min=0.0, max=1.0)
+        x = 255.0 * rearrange(x, "1 c h w -> h w c")
+        edited_image = Image.fromarray(x.type(torch.uint8).cpu().numpy())
+        # input_image to PIL
+        input_image = torch.clamp((input_image + 1.0) / 2.0, min=0.0, max=1.0)
+        input_image = 255.0 * rearrange(input_image, "1 c h w -> h w c")
+        input_image = Image.fromarray(input_image.type(torch.uint8).cpu().numpy())
+        return edited_image # , gr.update(visible=True), gr.update(visible=True), gr.update(visible=True)
+@click.command()
+@click.option("--ckpt", type=str, default="checkpoints/v1-5-pruned-emaonly-adaption-task-humanalign.ckpt")
+def main(ckpt="checkpoints/v1-5-pruned-emaonly-adaption-task-humanalign.ckpt"):
+    css = '''
+    .container {max-width: 1150px;margin: auto;padding-top: 1.5rem}
+    #image_upload{min-height:400px}
+    #image_upload [data-testid="image"], #image_upload [data-testid="image"] > div{min-height: 400px}
+    #mask_radio .gr-form{background:transparent; border: none}
+    #word_mask{margin-top: .75em !important}
+    #word_mask textarea:disabled{opacity: 0.3}
+    .footer {margin-bottom: 45px;margin-top: 35px;text-align: center;border-bottom: 1px solid #e5e5e5}
+    .footer>p {font-size: .8rem; display: inline-block; padding: 0 10px;transform: translateY(10px);background: white}
+    .dark .footer {border-color: #303030}
+    .dark .footer>p {background: #0b0f19}
+    .acknowledgments h4{margin: 1.25em 0 .25em 0;font-weight: bold;font-size: 115%}
+    #image_upload .touch-none{display: flex}
+    @keyframes spin {
+        from {
+            transform: rotate(0deg);
+        }
+        to {
+            transform: rotate(360deg);
+        }
+    }
+    #share-btn-container {
+        display: flex; padding-left: 0.5rem !important; padding-right: 0.5rem !important; background-color: #000000; justify-content: center; align-items: center; border-radius: 9999px !important; width: 13rem;
+    }
+    #share-btn {
+        all: initial; color: #ffffff;font-weight: 600; cursor:pointer; font-family: 'IBM Plex Sans', sans-serif; margin-left: 0.5rem !important; padding-top: 0.25rem !important; padding-bottom: 0.25rem !important;
+    }
+    #share-btn * {
+        all: unset;
+    }
+    #share-btn-container div:nth-child(-n+2){
+        width: auto !important;
+        min-height: 0px !important;
+    }
+    #share-btn-container .wrap {
+        display: none !important;
+    }
+    '''
+    config = OmegaConf.load("configs/instruct_diffusion.yaml")
+    # ckpt = "checkpoints/v1-5-pruned-emaonly-adaption-task-humanalign.ckpt"
+    if not os.path.exists(ckpt):
+        raise ValueError(f"Checkpoint {ckpt} does not exist")
+    vae_ckpt = None
+    model = load_model_from_config(config, ckpt, vae_ckpt)
+    model.eval().cuda()
+    model_wrap = K.external.CompVisDenoiser(model)
+    model_wrap_cfg = CFGDenoiser(model_wrap)
+    null_token = model.get_learned_conditioning([""])
+    image_blocks = gr.Blocks(css=css)
+    with image_blocks as demo:
+        gr.HTML(get_header())
+        with gr.Group():
+            with gr.Box():
+                with gr.Row():
+                    with gr.Column():
+                        image = gr.Image(source='upload', tool=None, elem_id="image_upload", type="pil", label="Source Image")
+                        instruction = gr.Textbox(lines=3, placeholder="Enter text to edit", label="Text")
+                        cfg_text = gr.Slider(label="Guidance scale (TXT)", value=7.0, maximum=15,interactive=True)
+                        cfg_image = gr.Slider(label="Guidance scale (IMG)", value=1.25, maximum=15,interactive=True)
+                        steps = gr.Slider(label="Steps", value=50, minimum=2, maximum=75, step=1,interactive=True)
+                        resolution = gr.Slider(label="Resolution (long side)", value=512, minimum=256, maximum=768, step=64, interactive=True)
+                        seed = gr.Slider(0, 10000, label='Seed', value=0, step=1)
+                        with gr.Row(elem_id="prompt-container", mobile_collapse=False, equal_height=True):
+                            btn = gr.Button(
+                                "Edit!",
+                                margin=False,
+                                rounded=(False, True, True, False),
+                                full_width=True,
+                            )
+                    # output
+                    with gr.Column():
+                        image_out = gr.Image(label="Output", elem_id="output-img", height=400, show_download_button=True)
+                    partial_predict = partial(
+                        predict,
+                        model, model_wrap,
+                        model_wrap_cfg,
+                        null_token, # RESOLUTION
+                    )
+                    btn.click(
+                        fn=partial_predict,
+                        inputs=[
+                            resolution, image, instruction, seed, steps, cfg_text, cfg_image
+                        ],
+                        outputs=[image_out])
+                gr.HTML(
+                    """
+                        <div class="footer">
+                            <p>
+                            InstructDiffusion Demo
+                            </p>
+                        </div>
+                        <div class="acknowledgments">
+                                <p><h4>LICENSE</h4>
+                        The model is licensed with a <a href="https://huggingface.co/spaces/CompVis/stable-diffusion-license" style="text-decoration: underline;" target="_blank">CreativeML Open RAIL-M</a> license. The authors claim no rights on the outputs you generate, you are free to use them and are accountable for their use which must not go against the provisions set in this license. The license forbids you from sharing any content that violates any laws, produce any harm to a person, disseminate any personal information that would be meant for harm, spread misinformation and target vulnerable groups. For the full list of restrictions please <a href="https://huggingface.co/spaces/CompVis/stable-diffusion-license" target="_blank" style="text-decoration: underline;" target="_blank">read the license</a></p>
+                    """
+                )
+    image_blocks.launch(share=True, max_threads=1).queue()
+if __name__ == "__main__":
+    main()

configs/instruct_diffusion.yaml ADDED Viewed

	@@ -0,0 +1,256 @@

+# File modified by authors of InstructDiffusion from original (https://github.com/CompVis/stable-diffusion).
+# See more details in LICENSE.
+model:
+  base_learning_rate: 1.0e-04
+  weight_decay: 0.01
+  target: ldm.models.diffusion.ddpm_edit.LatentDiffusion
+  params:
+    fp16: True
+    deepspeed: 'deepspeed_1'
+    ckpt_path: stable_diffusion/models/ldm/stable-diffusion-v1/v1-5-pruned-emaonly-adaption.ckpt
+    linear_start: 0.00085
+    linear_end: 0.0120
+    num_timesteps_cond: 1
+    log_every_t: 200
+    timesteps: 1000
+    first_stage_key: edited
+    cond_stage_key: edit
+    image_size: 32
+    channels: 4
+    cond_stage_trainable: false   # Note: different from the one we trained before
+    conditioning_key: hybrid
+    monitor: val/loss_simple_ema
+    scale_factor: 0.18215
+    scheduler_config: # 10000 warmup steps
+      target: ldm.lr_scheduler.LambdaLinearScheduler
+      params:
+        warm_up_steps: [ 0 ]
+        cycle_lengths: [ 10000000000000 ] # incredibly large number to prevent corner cases
+        f_start: [ 1.e-6 ]
+        f_max: [ 1. ]
+        f_min: [ 1. ]
+    unet_config:
+      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
+      params:
+        image_size: 32 # unused
+        in_channels: 8
+        out_channels: 4
+        model_channels: 320
+        attention_resolutions: [ 4, 2, 1 ]
+        num_res_blocks: 2
+        channel_mult: [ 1, 2, 4, 4 ]
+        num_heads: 8
+        use_spatial_transformer: True
+        transformer_depth: 1
+        context_dim: 768
+        use_checkpoint: True
+        legacy: False
+        force_type_convert: True
+    first_stage_config:
+      target: ldm.models.autoencoder.AutoencoderKL
+      params:
+        embed_dim: 4
+        monitor: val/rec_loss
+        ddconfig:
+          double_z: true
+          z_channels: 4
+          resolution: 256
+          in_channels: 3
+          out_ch: 3
+          ch: 128
+          ch_mult:
+          - 1
+          - 2
+          - 4
+          - 4
+          num_res_blocks: 2
+          attn_resolutions: []
+          dropout: 0.0
+        lossconfig:
+          target: torch.nn.Identity
+    cond_stage_config:
+      target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
+data:
+  target: main.DataModuleFromConfig
+  params:
+    batch_size: 64
+    num_workers: 4
+    train:
+      - ds1:
+        target: dataset.pose.pose.MPIIDataset
+        params:
+          root: data/mpii/
+          image_set: train
+          is_train: True
+          max_prompt_num: 5
+          min_prompt_num: 1
+          radius: 10
+      - ds2:
+        target: dataset.pose.pose.COCODataset
+        params:
+          root: data/coco/
+          image_set: train2017
+          is_train: True
+          max_prompt_num: 5
+          min_prompt_num: 1
+          radius: 10
+      - ds3:
+        target: dataset.pose.pose.CrowdPoseDataset
+        params:
+          root: data/crowdpose/
+          image_set: train
+          is_train: True
+          max_prompt_num: 5
+          min_prompt_num: 1
+          radius: 10
+      - ds4:
+        target: dataset.pose.pose.AICDataset
+        params:
+          root: data/aic/
+          image_set: train
+          is_train: True
+          max_prompt_num: 5
+          min_prompt_num: 1
+          radius: 10
+          sample_weight: 0.1
+      - ds5:
+        target: dataset.seg.coco_stuff.COCOStuffDataset
+        params:
+          path: data/coco-stuff
+          split: train2017
+          crop_res: 256
+          flip_prob: 0.5
+          transparency: 0.5
+          empty_percentage: 0.2
+      - ds6:
+        target: dataset.seg.grefcoco_segmentation.GrefCOCODataset
+        params:
+          path: data/coco_2014
+          split: train
+          min_resize_res: 256
+          max_resize_res: 256
+          crop_res: 256
+          flip_prob: 0.0
+          transparency: 0.5
+      - ds7:
+        target: dataset.seg.refcoco_segmentation.RefCOCODataset
+        params:
+          path: data/coco_2014
+          split: train
+          crop_res: 256
+          flip_prob: 0.0
+          transparency: 0.5
+      - ds8:
+        target: dataset.low_level.lowlevel_gopro.GoPro
+        params:
+          path: data/GoPro
+          split: train
+          size: 256
+          flip_prob: 0.5
+          interpolation: pil_lanczos
+          sample_weight: 2.0
+      - ds9:
+        target: dataset.low_level.lowlevel_reds.REDS
+        params:
+          path: data/REDS
+          split: train
+          size: 256
+          flip_prob: 0.5
+          interpolation: pil_lanczos
+          sample_weight: 0.2
+      - ds10:
+        target: dataset.low_level.lowlevel_sidd.SIDD
+        params:
+          path: data/SIDD
+          split: train
+          size: 256
+          flip_prob: 0.5
+          interpolation: pil_lanczos
+          sample_weight: 20
+      - ds11:
+        target: dataset.low_level.lowlevel_clwd.CLWD
+        params:
+          path: data/CLWD
+          split: train
+          size: 256
+          flip_prob: 0.5
+          interpolation: pil_lanczos
+          sample_weight: 0.2
+      - ds12:
+        target: dataset.editing.edit_zip_dataset.FilteredIP2PDataset
+        params:
+          path: data/clip-filtered-dataset
+          split: train
+          min_resize_res: 256
+          max_resize_res: 256
+          crop_res: 256
+          flip_prob: 0.5
+          sample_weight: 0.2
+      - ds13:
+        target: dataset.editing.edit_zip_dataset.GIERDataset
+        params:
+          path: data/GIER_editing_data/
+          split: train
+          min_resize_res: 256
+          max_resize_res: 256
+          crop_res: 256
+          flip_prob: 0.0
+          zip_start_index: 0
+          zip_end_index: 100
+          sample_weight: 2.0
+      - ds14:
+        target: dataset.editing.edit_zip_dataset.GQAInpaintDataset
+        params:
+          path: data/gqa-inpaint
+          min_resize_res: 256
+          max_resize_res: 256
+          crop_res: 256
+          flip_prob: 0.0
+      - ds15:
+        target: dataset.editing.edit_zip_dataset.MagicBrushDataset
+        params:
+          path: data/MagicBrush/
+          split: train
+          min_resize_res: 256
+          max_resize_res: 256
+          crop_res: 256
+          flip_prob: 0.5
+          zip_start_index: 0
+          zip_end_index: 100
+      - ds16:
+        target: dataset.editing.edit_zip_dataset.IEIWDataset
+        params:
+          path: data/ieiw/
+          split: train
+          min_resize_res: 256
+          max_resize_res: 256
+          crop_res: 256
+          flip_prob: 0.5
+    validation:
+      target: dataset.pose.pose.COCODataset
+      params:
+        root: data/coco/
+        image_set: val2017
+        is_train: False
+        max_prompt_num: 5
+        min_prompt_num: 1
+        radius: 10
+trainer:
+  initial_scale: 13
+  max_epochs: 200
+  save_freq: 5
+  accumulate_grad_batches: 1
+  clip_grad: 0.0
+  optimizer: adamw

dataset/README.md ADDED Viewed

	@@ -0,0 +1,62 @@

+You can download these datasets: [COCO](http://cocodataset.org/#download), [CrowdPose](https://github.com/Jeff-sjtu/CrowdPose#dataset), [MPII](http://human-pose.mpi-inf.mpg.de/), [AIC](https://arxiv.org/abs/1711.06475), [COCO-Stuff](https://github.com/nightrome/cocostuff), [RefCOCO](https://github.com/lichengunc/refer), [GrefCOCO](https://github.com/henghuiding/gRefCOCO), [GoPro](https://seungjunnah.github.io/Datasets/gopro), [REDS](https://seungjunnah.github.io/Datasets/reds.html), [SIDD](https://www.eecs.yorku.ca/~kamel/sidd/), [CLWD](https://arxiv.org/abs/2012.07616), [IP2PDataset](https://github.com/timothybrooks/instruct-pix2pix), [GIER](https://sites.google.com/view/gierdataset), [GQAInpaint](https://github.com/abyildirim/inst-inpaint), [MagicBrush](https://osu-nlp-group.github.io/MagicBrush/). The resulting data directory should look like this:
+    InstructDiffusion
+    |-- data
+    `-- |-- coco
+        |   |-- annotations
+        |   `-- images
+        |-- mpii
+        |   |-- annot
+        |   `-- images
+        |-- crowdpose
+        |   |-- json
+        |   `-- images
+        |-- aic
+        |   |-- annotations
+        |   `-- ai_challenger_keypoint_train_20170902
+        |
+        |-- coco-stuff
+        |   |-- annotations
+        |   |-- labels.txt
+        |   `-- images
+        |-- coco_2014
+        |   |-- grefcoco
+        |   |   |-- grefs(unc).json
+        |   |   `-- instances.json
+        |   |-- refcoco
+        |   |   |-- instances.json
+        |   |   |-- refs(google).p
+        |   |   `-- refs(unc).p
+        |   `-- images
+        |
+        |-- GoPro
+        |   |-- train
+        |   `-- test
+        |-- REDS
+        |   |-- train
+        |   `-- val
+        |-- SIDD
+        |   |-- train
+        |   `-- val
+        |-- CLWD
+        |   |-- train
+        |   |-- test
+        |   `-- watermark_logo
+        |
+        |-- clip-filtered-dataset
+        |   |-- shard-00.zip
+        |   |-- shard-01.zip
+        |   `-- ...
+        |-- GIER_editing_data
+        |   |-- images
+        |   `-- GIER.json
+        |-- gqa-inpaint
+        |   |-- images
+        |   |-- images_inpainted
+        |   |-- masks
+        |   |-- train_scenes.json
+        |   `-- meta_info.json
+        `-- MagicBrush
+            |-- data
+            |-- processed-train
+            `-- magic_train.json

dataset/editing/edit_zip_dataset.py ADDED Viewed

	@@ -0,0 +1,494 @@

+# --------------------------------------------------------
+# InstructDiffusion
+# Based on instruct-pix2pix (https://github.com/timothybrooks/instruct-pix2pix)
+# Modified by Tiankai Hang (tkhang@seu.edu.cn)
+# --------------------------------------------------------
+from __future__ import annotations
+import os
+import json
+import math
+from pathlib import Path
+from typing import Any
+import numpy as np
+import torch
+import torchvision
+from einops import rearrange
+import PIL
+from PIL import Image
+from torch.utils.data import Dataset
+from tqdm.auto import tqdm
+import random
+from dataset.utils.zip_manager import MultipleZipManager
+if hasattr(Image, "Resampling"):
+    # deprecated in pillow >= 10.0.0
+    RESAMPLING_METHOD = Image.Resampling.LANCZOS
+else:
+    RESAMPLING_METHOD = Image.LANCZOS
+class FilteredIP2PDataset(Dataset):
+    def __init__(
+        self,
+        path: str,
+        split: str = "train",
+        splits: tuple[float, float, float] = (0.9, 0.05, 0.05),
+        min_resize_res: int = 256,
+        max_resize_res: int = 256,
+        crop_res: int = 256,
+        flip_prob: float = 0.0,
+        zip_start_index: int = 0,
+        zip_end_index: int = 30,
+        instruct: bool = False,
+        max_num_images = None,
+        sample_weight: float = 1.0,
+        reverse_version: bool = False,
+        **kwargs
+    ):
+        assert split in ("train", "val", "test")
+        assert sum(splits) == 1
+        self.path = path
+        self.min_resize_res = min_resize_res
+        self.max_resize_res = max_resize_res
+        self.crop_res = crop_res
+        self.flip_prob = flip_prob
+        self.instruct = instruct
+        zip_list = []
+        for i in range(zip_start_index, zip_end_index):
+            name = "shard-"+str(i).zfill(2)+'.zip'
+            zip_list.append(os.path.join(self.path, name))
+        self.image_dataset = MultipleZipManager(zip_list, 'image', sync=True)   # sync=True is faster
+        with open(Path(self.path, "seeds.json")) as f:
+            self.seeds = json.load(f)
+        split_0, split_1 = {
+            "train": (0.0, splits[0]),
+            "val": (splits[0], splits[0] + splits[1]),
+            "test": (splits[0] + splits[1], 1.0),
+        }[split]
+        idx_0 = math.floor(split_0 * len(self.seeds))
+        idx_1 = math.floor(split_1 * len(self.seeds))
+        self.seeds = self.seeds[idx_0:idx_1]
+        if max_num_images is not None and max_num_images > 0:
+            self.seeds = self.seeds[:min(max_num_images, len(self.seeds))]
+        # flatten seeds
+        self.seeds = [(name, seed) for name, seeds in self.seeds for seed in seeds]
+        self.sample_weight = sample_weight
+        while True:
+            try:
+                with open('filtered_ids_ip2p.json') as json_file:
+                    filtered_ids = json.load(json_file)
+                break
+            except:
+                # download json file from url
+                if reverse_version:
+                    os.system('wget https://github.com/TiankaiHang/storage/releases/download/readout/filtered_ids_ip2p.json')
+                else:
+                    os.system("wget https://github.com/TiankaiHang/storage/releases/download/readout/filtered-ip2p-thres5.5-0.5.json -O filtered_ids_ip2p.json")
+        print("seeds:", len(self.seeds))
+        # self.seeds = [seed for seed in self.seeds if seed[1] in filtered_ids]
+        # faster
+        # self.seeds = list(filter(lambda seed: seed[1] in filtered_ids, self.seeds))
+        # to numpy and faster in parallel
+        # import pdb; pdb.set_trace()
+        _seeds = [f"{a}/{b}" for a, b in self.seeds]
+        self.seeds = np.array(self.seeds)
+        _seeds = np.array(_seeds)
+        self.seeds = self.seeds[np.isin(_seeds, filtered_ids)]
+        self.seeds = self.seeds.tolist()
+        self.return_add_kwargs = kwargs.get("return_add_kwargs", False)
+    def __len__(self) -> int:
+        return int(len(self.seeds) * self.sample_weight)
+    def __getitem__(self, i: int) -> dict[str, Any]:
+        # name, seeds = self.seeds[i]
+        if self.sample_weight >= 1:
+            i = i % len(self.seeds)
+        else:
+            remainder = math.ceil(i / self.sample_weight - int(i / self.sample_weight))
+            i = int(i / self.sample_weight) + random.randint(0, int(1 / self.sample_weight) - 1 + remainder)
+        name, seed = self.seeds[i]
+        propt_name = name + "/prompt.json"
+        if not self.image_dataset.managers[self.image_dataset.mapping[propt_name]]._init:
+            self.image_dataset.managers[self.image_dataset.mapping[propt_name]].initialize(close=False)
+        # propt_name = name + "/prompt.json"
+        byteflow = self.image_dataset.managers[self.image_dataset.mapping[propt_name]].zip_fd.read(propt_name)
+        texts = json.loads(byteflow.decode('utf-8'))
+        prompt = texts["edit"]
+        if self.instruct:
+            prompt = "Image Editing: " + prompt
+        text_input = texts["input"]
+        text_output = texts["output"]
+        # image_0 = Image.open(propt_dir.joinpath(f"{seed}_0.jpg"))
+        # image_1 = Image.open(propt_dir.joinpath(f"{seed}_1.jpg"))
+        image_0 = self.image_dataset.get(name+f"/{seed}_0.jpg")
+        image_1 = self.image_dataset.get(name+f"/{seed}_1.jpg")
+        reize_res = torch.randint(self.min_resize_res, self.max_resize_res + 1, ()).item()
+        image_0 = image_0.resize((reize_res, reize_res), RESAMPLING_METHOD)
+        image_1 = image_1.resize((reize_res, reize_res), RESAMPLING_METHOD)
+        image_0 = rearrange(2 * torch.tensor(np.array(image_0)).float() / 255 - 1, "h w c -> c h w")
+        image_1 = rearrange(2 * torch.tensor(np.array(image_1)).float() / 255 - 1, "h w c -> c h w")
+        crop = torchvision.transforms.RandomCrop(self.crop_res)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((image_0, image_1)))).chunk(2)
+        if self.return_add_kwargs:
+            add_kwargs = dict(
+                name=name,
+                seed=seed,
+                text_input=text_input,
+                text_output=text_output,
+            )
+        else:
+            add_kwargs = {}
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt), **add_kwargs)
+class GIERDataset(Dataset):
+    def __init__(
+        self,
+        path: str,
+        split: str = "train",
+        splits: tuple[float, float, float] = (0.9, 0.05, 0.05),
+        min_resize_res: int = 256,
+        max_resize_res: int = 256,
+        crop_res: int = 256,
+        flip_prob: float = 0.0,
+        zip_start_index: int = 0,
+        zip_end_index: int = 30,
+        sample_weight: float = 1.0,
+        instruct: bool = False,
+    ):
+        assert split in ("train", "val", "test")
+        assert sum(splits) == 1
+        self.path = path
+        self.min_resize_res = min_resize_res
+        self.max_resize_res = max_resize_res
+        self.crop_res = crop_res
+        self.flip_prob = flip_prob
+        self.instruct = instruct
+        # self.meta = torch.load(Path(self.path, "GIER.json"), map_location="cpu")
+        # load json file
+        with open(Path(self.path, "GIER_new.json")) as json_file:
+            self.meta = json.load(json_file)
+        print(f"||||||||||||||||||||||||||||| \n Loaded {len(self.meta)} images from json file")
+        input_does_not_exist = []
+        output_does_not_exist = []
+        # filter out out images that do not exist
+        if not os.path.exists(os.path.join(self.path, "filtered_meta_new.pt")):
+            filtered_meta = []
+            for i in tqdm(range(len(self.meta))):
+                input_path = os.path.join(self.path, "warped", self.meta[i]["input"])
+                output_path = os.path.join(self.path, "warped", self.meta[i]["output"])
+                if not os.path.exists(input_path):
+                    input_path = os.path.join(self.path, "images", self.meta[i]["input"])
+                    if not os.path.exists(input_path):
+                        input_does_not_exist.append(input_path)
+                if not os.path.exists(output_path):
+                    output_path = os.path.join(self.path, "images", self.meta[i]["output"])
+                    if not os.path.exists(output_path):
+                        output_does_not_exist.append(output_path)
+                if os.path.exists(input_path) and os.path.exists(output_path):
+                    filtered_meta.append(
+                        dict(
+                            input=input_path,
+                            output=output_path,
+                            prompts=self.meta[i]["prompts"],
+                        )
+                    )
+                else:
+                    print(f"\n {input_path} or {output_path} does not exist")
+            torch.save(filtered_meta, os.path.join(self.path, "filtered_meta_new.pt"))
+        else:
+            filtered_meta = torch.load(os.path.join(self.path, "filtered_meta_new.pt"), map_location="cpu")
+        self.meta = filtered_meta
+        print(f"||||||||||||||||||||||||||||| \n Filtered {len(self.meta)} images")
+        for i in range(len(self.meta)):
+            self.meta[i]['input'] = self.meta[i]['input'].replace('/mnt/external/datasets/GIER_editing_data/', self.path)
+            self.meta[i]['output'] = self.meta[i]['output'].replace('/mnt/external/datasets/GIER_editing_data/', self.path)
+        # write input_does_not_exist and output_does_not_exist to file
+        with open(Path(self.path, f"input_does_not_exist.txt"), "w") as f:
+            for item in input_does_not_exist:
+                f.write("%s\n" % item)
+        with open(Path(self.path, f"output_does_not_exist.txt"), "w") as f:
+            for item in output_does_not_exist:
+                f.write("%s\n" % item)
+        split_0, split_1 = {
+            "train": (0.0, splits[0]),
+            "val":   (splits[0], splits[0] + splits[1]),
+            "test":  (splits[0] + splits[1], 1.0),
+        }[split]
+        idx_0 = math.floor(split_0 * len(self.meta))
+        idx_1 = math.floor(split_1 * len(self.meta))
+        self.meta = self.meta[idx_0:idx_1]
+        self.sample_weight = sample_weight
+        print('original GIER', len(self.meta))
+    def __len__(self) -> int:
+        return int(len(self.meta) * self.sample_weight)
+    def __getitem__(self, i: int) -> dict[str, Any]:
+        if self.sample_weight >= 1:
+            i = i % len(self.meta)
+        else:
+            i = int(i / self.sample_weight) + random.randint(0, int(1 / self.sample_weight) - 1)
+        # prompt = self.meta[i]["prompts"]
+        prompt = random.choice(self.meta[i]["prompts"])
+        try:
+            image_0 = Image.open(self.meta[i]["input"]).convert("RGB")
+            image_1 = Image.open(self.meta[i]["output"]).convert("RGB")
+        except PIL.UnidentifiedImageError:
+            print(f"\n {self.meta[i]['input']} or {self.meta[i]['output']} is not a valid image")
+            i = random.randint(0, len(self.meta) - 1)
+            return self.__getitem__(i)
+        reize_res = torch.randint(self.min_resize_res, self.max_resize_res + 1, ()).item()
+        image_0 = image_0.resize((reize_res, reize_res), RESAMPLING_METHOD)
+        image_1 = image_1.resize((reize_res, reize_res), RESAMPLING_METHOD)
+        image_0 = rearrange(2 * torch.tensor(np.array(image_0)).float() / 255 - 1, "h w c -> c h w")
+        image_1 = rearrange(2 * torch.tensor(np.array(image_1)).float() / 255 - 1, "h w c -> c h w")
+        crop = torchvision.transforms.RandomCrop(self.crop_res)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((image_0, image_1)))).chunk(2)
+        if self.instruct:
+            prompt = "Image Editing: " + prompt
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt))
+class GQAInpaintDataset(Dataset):
+    r"""
+    shoud download and unzip the data first
+    ```
+    mkdir -p ../datasets
+    cd ../datasets
+    # if file exists, then skip
+    if [ ! -f "gqa-inpaint.zip" ]; then
+        sudo azcopy copy "https://bingdatawu2.blob.core.windows.net/genrecog/private/t-thang/gqa-inpaint.zip${TOKEN}" .
+        unzip gqa-inpaint.zip -d gqa-inpaint > /dev/null
+    fi
+    if [ ! -f "images.zip" ]; then
+        sudo azcopy copy "https://bingdatawu2.blob.core.windows.net/genrecog/private/t-thang/images.zip${TOKEN}" .
+        unzip images.zip > /dev/null
+    fi
+    ```
+    """
+    def __init__(self, **kwargs):
+        # load from json ../datasets/gqa-inpaint/meta_info.json
+        self.path = kwargs.get("path", "../datasets/gqa-inpaint")
+        self.instruct = kwargs.get("instruct", False)
+        with open(self.path + "/meta_info.json", "r") as f:
+            self.meta_info = json.load(f)
+        self.min_resize_res = kwargs.get("min_resize_res", 256)
+        self.max_resize_res = kwargs.get("max_resize_res", 256)
+        self.crop_res = kwargs.get("crop_res", 256)
+        self.flip_prob = kwargs.get("flip_prob", 0.5)
+    def __len__(self):
+        return len(self.meta_info)
+    def __getitem__(self, i):
+        item = self.meta_info[i]
+        src_img = Image.open(item["source_image_path"].replace("../datasets", self.path)).convert("RGB")
+        tgt_img = Image.open(item["target_image_path"].replace("../datasets/gqa-inpaint", self.path)).convert("RGB")
+        image_0 = src_img
+        image_1 = tgt_img
+        reize_res = torch.randint(self.min_resize_res, self.max_resize_res + 1, ()).item()
+        image_0 = image_0.resize((reize_res, reize_res), RESAMPLING_METHOD)
+        image_1 = image_1.resize((reize_res, reize_res), RESAMPLING_METHOD)
+        instruction = item["instruction"]
+        if self.instruct:
+            instruction = "Image Editing: " + instruction
+        # return image_0, image_1, instruction
+        image_0 = rearrange(2 * torch.tensor(np.array(image_0)).float() / 255 - 1, "h w c -> c h w")
+        image_1 = rearrange(2 * torch.tensor(np.array(image_1)).float() / 255 - 1, "h w c -> c h w")
+        crop = torchvision.transforms.RandomCrop(self.crop_res)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((image_0, image_1)))).chunk(2)
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=instruction))
+class MagicBrushDataset(Dataset):
+    def __init__(
+        self,
+        path: str,
+        split: str = "train",
+        splits: tuple[float, float, float] = (0.9, 0.05, 0.05),
+        min_resize_res: int = 256,
+        max_resize_res: int = 256,
+        crop_res: int = 256,
+        flip_prob: float = 0.0,
+        zip_start_index: int = 0,
+        zip_end_index: int = 30,
+        len_dataset: int = -1,
+        instruct: bool = False,
+        sample_weight: float = 1.0,
+    ):
+        assert split in ("train", "val", "test")
+        assert sum(splits) == 1
+        self.path = path
+        self.min_resize_res = min_resize_res
+        self.max_resize_res = max_resize_res
+        self.crop_res = crop_res
+        self.flip_prob = flip_prob
+        self.instruct = instruct
+        self.sample_weight = sample_weight
+        self.meta_path = os.path.join(self.path, "magic_train.json")
+        with open(self.meta_path, "r") as f:
+            self.meta = json.load(f)
+    def __len__(self) -> int:
+        return int(len(self.meta) * self.sample_weight)
+    def __getitem__(self, i: int) -> dict[str, Any]:
+        if self.sample_weight >= 1:
+            i = i % len(self.meta)
+        else:
+            i = int(i / self.sample_weight) + random.randint(0, int(1 / self.sample_weight) - 1)
+        item = self.meta[i]
+        try:
+            image_0 = Image.open(os.path.join(self.path, item["input"])).convert("RGB")
+            image_1 = Image.open(os.path.join(self.path, item["edited"])).convert("RGB")
+        except (PIL.UnidentifiedImageError, FileNotFoundError):
+            print(f"\n {self.path}/{item['input']} or {self.path}/{item['edited']} is not a valid image")
+            i = random.randint(0, len(self.meta) - 1)
+            return self.__getitem__(i)
+        prompt = item["instruction"]
+        reize_res = torch.randint(self.min_resize_res, self.max_resize_res + 1, ()).item()
+        image_0 = image_0.resize((reize_res, reize_res), RESAMPLING_METHOD)
+        image_1 = image_1.resize((reize_res, reize_res), RESAMPLING_METHOD)
+        if self.instruct:
+            prompt = "Image Editing: " + prompt
+        # return image_0, image_1, prompt
+        image_0 = rearrange(2 * torch.tensor(np.array(image_0)).float() / 255 - 1, "h w c -> c h w")
+        image_1 = rearrange(2 * torch.tensor(np.array(image_1)).float() / 255 - 1, "h w c -> c h w")
+        crop = torchvision.transforms.RandomCrop(self.crop_res)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((image_0, image_1)))).chunk(2)
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt))
+class IEIWDataset(Dataset):
+    def __init__(
+        self,
+        path: str,
+        split: str = "train",
+        splits: tuple[float, float, float] = (0.9, 0.05, 0.05),
+        min_resize_res: int = 256,
+        max_resize_res: int = 256,
+        crop_res: int = 256,
+        flip_prob: float = 0.0,
+        zip_start_index: int = 0,
+        zip_end_index: int = 30,
+        sample_weight: float = 1.0,
+        instruct: bool = False,
+    ):
+        assert split in ("train", "val", "test")
+        assert sum(splits) == 1
+        self.path = path
+        self.min_resize_res = min_resize_res
+        self.max_resize_res = max_resize_res
+        self.crop_res = crop_res
+        self.flip_prob = flip_prob
+        self.instruct = instruct
+        self.meta_path = os.path.join(self.path, "meta_infov1.json")
+        with open(self.meta_path, "r") as f:
+            self.meta = json.load(f)
+        self.sample_weight = sample_weight
+        print('original synthetic', len(self.meta))
+    def __len__(self) -> int:
+        return int(len(self.meta) * self.sample_weight)
+    def __getitem__(self, i: int) -> dict[str, Any]:
+        if self.sample_weight >= 1:
+            i = i % len(self.meta)
+        else:
+            i = int(i / self.sample_weight) + random.randint(0, int(1 / self.sample_weight) - 1)
+        item = self.meta[i]
+        item['input'] = item['input'].replace('/mnt/external/tmp/2023/06/11/', self.path)
+        item['edited'] = item['edited'].replace('/mnt/external/tmp/2023/06/11/', self.path)
+        try:
+            image_0 = Image.open(item["input"]).convert("RGB")
+            image_1 = Image.open(item["edited"]).convert("RGB")
+        except (PIL.UnidentifiedImageError, FileNotFoundError):
+            print(f"\n {item['input']} or {item['edited']} is not a valid image")
+            i = random.randint(0, len(self.meta) - 1)
+            return self.__getitem__(i)
+        prompt = item["instruction"]
+        reize_res = torch.randint(self.min_resize_res, self.max_resize_res + 1, ()).item()
+        image_0 = image_0.resize((reize_res, reize_res), RESAMPLING_METHOD)
+        image_1 = image_1.resize((reize_res, reize_res), RESAMPLING_METHOD)
+        if self.instruct:
+            prompt = "Image Editing: " + prompt
+        # return image_0, image_1, prompt
+        image_0 = rearrange(2 * torch.tensor(np.array(image_0)).float() / 255 - 1, "h w c -> c h w")
+        image_1 = rearrange(2 * torch.tensor(np.array(image_1)).float() / 255 - 1, "h w c -> c h w")
+        crop = torchvision.transforms.RandomCrop(self.crop_res)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((image_0, image_1)))).chunk(2)
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt))

dataset/low_level/lowlevel_clwd.py ADDED Viewed

	@@ -0,0 +1,106 @@

+# --------------------------------------------------------
+# InstructDiffusion
+# Based on instruct-pix2pix (https://github.com/timothybrooks/instruct-pix2pix)
+# Modified by Chen Li (edward82@stu.xjtu.edu.cn)
+# --------------------------------------------------------
+import os
+import numpy as np
+from torch.utils.data import Dataset
+import torch
+from PIL import Image
+import torchvision.transforms.functional as TF
+from pdb import set_trace as stx
+import random
+import cv2
+from PIL import Image
+import torchvision
+def is_image_file(filename):
+    return any(filename.endswith(extension) for extension in ['jpeg', 'JPEG', 'jpg', 'png', 'JPG', 'PNG', 'gif'])
+class CLWD(Dataset):
+    def __init__(self, path, split="train", size=256, interpolation="pil_lanczos",
+        flip_prob=0.5, sample_weight=1.0, instruct=False):
+        super(CLWD, self).__init__()
+        inp_files = sorted(os.listdir(os.path.join(path, split, 'Watermarked_image')))
+        tar_files = sorted(os.listdir(os.path.join(path, split, 'Watermark_free_image')))
+        self.inp_filenames = [os.path.join(path, split, 'Watermarked_image', x) for x in inp_files if is_image_file(x)]
+        self.tar_filenames = [os.path.join(path, split, 'Watermark_free_image', x) for x in tar_files if is_image_file(x)]
+        self.size = size
+        self.flip_prob = flip_prob
+        self.sample_weight = sample_weight
+        self.instruct = instruct
+        self.sizex = len(self.tar_filenames)  # get the size of target
+        self.interpolation = {
+            "cv_nearest": cv2.INTER_NEAREST,
+            "cv_bilinear": cv2.INTER_LINEAR,
+            "cv_bicubic": cv2.INTER_CUBIC,
+            "cv_area": cv2.INTER_AREA,
+            "cv_lanczos": cv2.INTER_LANCZOS4,
+            "pil_nearest": Image.NEAREST,
+            "pil_bilinear": Image.BILINEAR,
+            "pil_bicubic": Image.BICUBIC,
+            "pil_box": Image.BOX,
+            "pil_hamming": Image.HAMMING,
+            "pil_lanczos": Image.LANCZOS,
+        }[interpolation]
+        prompt_path='dataset/prompt/prompt_dewatermark.txt'
+        self.prompt_list=[]
+        with open(prompt_path) as f:
+            line=f.readline()
+            while line:
+                line=line.strip('\n')
+                self.prompt_list.append(line)
+                line=f.readline()
+        print(f"CLWD has {len(self)} samples!!")
+    def __len__(self):
+        return int(self.sizex * self.sample_weight)
+    def __getitem__(self, index):
+        if self.sample_weight >= 1:
+            index_ = index % self.sizex
+        else:
+            index_ = int(index / self.sample_weight) + random.randint(0, int(1 / self.sample_weight) - 1)
+        inp_path = self.inp_filenames[index_]
+        tar_path = self.tar_filenames[index_]
+        inp_img = Image.open(inp_path)
+        tar_img = Image.open(tar_path)
+        width, height = inp_img.size
+        tar_width, tar_height = tar_img.size
+        assert tar_width == width and tar_height == height, "Input and target image mismatch"
+        aspect_ratio = float(width) / float(height)
+        if width < height:
+            new_width = self.size
+            new_height = int(self.size  / aspect_ratio)
+        else:
+            new_height = self.size
+            new_width = int(self.size * aspect_ratio)
+        inp_img = inp_img.resize((new_width, new_height), self.interpolation)
+        tar_img = tar_img.resize((new_width, new_height), self.interpolation)
+        inp_img = np.array(inp_img).astype(np.float32).transpose(2, 0, 1)
+        inp_img_tensor = torch.tensor((inp_img / 127.5 - 1.0).astype(np.float32))
+        tar_img = np.array(tar_img).astype(np.float32).transpose(2, 0, 1)
+        tar_img_tensor = torch.tensor((tar_img / 127.5 - 1.0).astype(np.float32))
+        crop = torchvision.transforms.RandomCrop(self.size)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((inp_img_tensor, tar_img_tensor)))).chunk(2)
+        prompt = random.choice(self.prompt_list)
+        if self.instruct:
+            prompt = "Watermark Removal: " + prompt
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt))

dataset/low_level/lowlevel_gopro.py ADDED Viewed

	@@ -0,0 +1,106 @@

+# --------------------------------------------------------
+# InstructDiffusion
+# Based on instruct-pix2pix (https://github.com/timothybrooks/instruct-pix2pix)
+# Modified by Chen Li (edward82@stu.xjtu.edu.cn)
+# --------------------------------------------------------
+import os
+import numpy as np
+from torch.utils.data import Dataset
+import torch
+from PIL import Image
+import torchvision.transforms.functional as TF
+from pdb import set_trace as stx
+import random
+import cv2
+from PIL import Image
+import torchvision
+def is_image_file(filename):
+    return any(filename.endswith(extension) for extension in ['jpeg', 'JPEG', 'jpg', 'png', 'JPG', 'PNG', 'gif'])
+class GoPro(Dataset):
+    def __init__(self, path, split="train", size=256, interpolation="pil_lanczos",
+        flip_prob=0.5, sample_weight=1.0, instruct=False):
+        super(GoPro, self).__init__()
+        inp_files = sorted(os.listdir(os.path.join(path, split, 'input')))
+        tar_files = sorted(os.listdir(os.path.join(path, split, 'target')))
+        self.inp_filenames = [os.path.join(path, split, 'input', x) for x in inp_files if is_image_file(x)]
+        self.tar_filenames = [os.path.join(path, split, 'target', x) for x in tar_files if is_image_file(x)]
+        self.size = size
+        self.flip_prob = flip_prob
+        self.sample_weight = sample_weight
+        self.instruct = instruct
+        self.sizex = len(self.tar_filenames)  # get the size of target
+        self.interpolation = {
+            "cv_nearest": cv2.INTER_NEAREST,
+            "cv_bilinear": cv2.INTER_LINEAR,
+            "cv_bicubic": cv2.INTER_CUBIC,
+            "cv_area": cv2.INTER_AREA,
+            "cv_lanczos": cv2.INTER_LANCZOS4,
+            "pil_nearest": Image.NEAREST,
+            "pil_bilinear": Image.BILINEAR,
+            "pil_bicubic": Image.BICUBIC,
+            "pil_box": Image.BOX,
+            "pil_hamming": Image.HAMMING,
+            "pil_lanczos": Image.LANCZOS,
+        }[interpolation]
+        prompt_path='dataset/prompt/prompt_deblur.txt'
+        self.prompt_list=[]
+        with open(prompt_path) as f:
+            line=f.readline()
+            while line:
+                line=line.strip('\n')
+                self.prompt_list.append(line)
+                line=f.readline()
+        print(f"GoPro has {len(self)} samples!!")
+    def __len__(self):
+        return int(self.sizex * self.sample_weight)
+    def __getitem__(self, index):
+        if self.sample_weight >= 1:
+            index_ = index % self.sizex
+        else:
+            index_ = int(index / self.sample_weight) + random.randint(0, int(1 / self.sample_weight) - 1)
+        inp_path = self.inp_filenames[index_]
+        tar_path = self.tar_filenames[index_]
+        inp_img = Image.open(inp_path)
+        tar_img = Image.open(tar_path)
+        width, height = inp_img.size
+        tar_width, tar_height = tar_img.size
+        assert tar_width == width and tar_height == height, "Input and target image mismatch"
+        aspect_ratio = float(width) / float(height)
+        if width < height:
+            new_width = self.size
+            new_height = int(self.size  / aspect_ratio)
+        else:
+            new_height = self.size
+            new_width = int(self.size * aspect_ratio)
+        inp_img = inp_img.resize((new_width, new_height), self.interpolation)
+        tar_img = tar_img.resize((new_width, new_height), self.interpolation)
+        inp_img = np.array(inp_img).astype(np.float32).transpose(2, 0, 1)
+        inp_img_tensor = torch.tensor((inp_img / 127.5 - 1.0).astype(np.float32))
+        tar_img = np.array(tar_img).astype(np.float32).transpose(2, 0, 1)
+        tar_img_tensor = torch.tensor((tar_img / 127.5 - 1.0).astype(np.float32))
+        crop = torchvision.transforms.RandomCrop(self.size)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((inp_img_tensor, tar_img_tensor)))).chunk(2)
+        prompt = random.choice(self.prompt_list)
+        if self.instruct:
+            prompt = "Image Deblurring: " + prompt
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt))

dataset/low_level/lowlevel_reds.py ADDED Viewed

	@@ -0,0 +1,111 @@

+# --------------------------------------------------------
+# InstructDiffusion
+# Based on instruct-pix2pix (https://github.com/timothybrooks/instruct-pix2pix)
+# Modified by Chen Li (edward82@stu.xjtu.edu.cn)
+# --------------------------------------------------------
+import os
+import numpy as np
+from torch.utils.data import Dataset
+import torch
+from PIL import Image
+import torchvision.transforms.functional as TF
+from pdb import set_trace as stx
+import random
+import cv2
+from PIL import Image
+import torchvision
+def is_image_file(filename):
+    return any(filename.endswith(extension) for extension in ['jpeg', 'JPEG', 'jpg', 'png', 'JPG', 'PNG', 'gif'])
+class REDS(Dataset):
+    def __init__(self, path, split="train", size=256, interpolation="pil_lanczos",
+        flip_prob=0.5, sample_weight=1.0, instruct=False):
+        super(REDS, self).__init__()
+        inp_files = sorted(os.listdir(os.path.join(path, split, 'blur')))
+        tar_files = sorted(os.listdir(os.path.join(path, split, 'sharp')))
+        if split == "train":
+            self.inp_filenames = [os.path.join(path, split, 'blur', d, x) for d in inp_files for x in sorted(os.listdir(os.path.join(path, split, 'blur', d))) if is_image_file(x)]
+            self.tar_filenames = [os.path.join(path, split, 'sharp', d,  x) for d in tar_files for x in sorted(os.listdir(os.path.join(path, split, 'sharp', d))) if is_image_file(x)]
+        else:
+            self.inp_filenames = [os.path.join(path, split, 'blur', x) for x in inp_files if is_image_file(x)]
+            self.tar_filenames = [os.path.join(path, split, 'sharp', x) for x in tar_files if is_image_file(x)]
+        self.size = size
+        self.flip_prob = flip_prob
+        self.sample_weight = sample_weight
+        self.instruct = instruct
+        assert len(self.inp_filenames) == len(self.tar_filenames)
+        self.sizex = len(self.tar_filenames)  # get the size of target
+        self.interpolation = {
+            "cv_nearest": cv2.INTER_NEAREST,
+            "cv_bilinear": cv2.INTER_LINEAR,
+            "cv_bicubic": cv2.INTER_CUBIC,
+            "cv_area": cv2.INTER_AREA,
+            "cv_lanczos": cv2.INTER_LANCZOS4,
+            "pil_nearest": Image.NEAREST,
+            "pil_bilinear": Image.BILINEAR,
+            "pil_bicubic": Image.BICUBIC,
+            "pil_box": Image.BOX,
+            "pil_hamming": Image.HAMMING,
+            "pil_lanczos": Image.LANCZOS,
+        }[interpolation]
+        prompt_path='dataset/prompt/prompt_deblur.txt'
+        self.prompt_list=[]
+        with open(prompt_path) as f:
+            line=f.readline()
+            while line:
+                line=line.strip('\n')
+                self.prompt_list.append(line)
+                line=f.readline()
+        print(f"REDS has {len(self)} samples!!")
+    def __len__(self):
+        return int(self.sizex * self.sample_weight)
+    def __getitem__(self, index):
+        if self.sample_weight >= 1:
+            index_ = index % self.sizex
+        else:
+            index_ = int(index / self.sample_weight) + random.randint(0, int(1 / self.sample_weight) - 1)
+        inp_path = self.inp_filenames[index_]
+        tar_path = self.tar_filenames[index_]
+        inp_img = Image.open(inp_path)
+        tar_img = Image.open(tar_path)
+        width, height = inp_img.size
+        tar_width, tar_height = tar_img.size
+        assert tar_width == width and tar_height == height, "Input and target image mismatch"
+        aspect_ratio = float(width) / float(height)
+        if width < height:
+            new_width = self.size
+            new_height = int(self.size  / aspect_ratio)
+        else:
+            new_height = self.size
+            new_width = int(self.size * aspect_ratio)
+        inp_img = inp_img.resize((new_width, new_height), self.interpolation)
+        tar_img = tar_img.resize((new_width, new_height), self.interpolation)
+        inp_img = np.array(inp_img).astype(np.float32).transpose(2, 0, 1)
+        inp_img_tensor = torch.tensor((inp_img / 127.5 - 1.0).astype(np.float32))
+        tar_img = np.array(tar_img).astype(np.float32).transpose(2, 0, 1)
+        tar_img_tensor = torch.tensor((tar_img / 127.5 - 1.0).astype(np.float32))
+        crop = torchvision.transforms.RandomCrop(self.size)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((inp_img_tensor, tar_img_tensor)))).chunk(2)
+        prompt = random.choice(self.prompt_list)
+        if self.instruct:
+            prompt = "Image Deblurring: " + prompt
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt))

dataset/low_level/lowlevel_sidd.py ADDED Viewed

	@@ -0,0 +1,96 @@

+# --------------------------------------------------------
+# InstructDiffusion
+# Based on instruct-pix2pix (https://github.com/timothybrooks/instruct-pix2pix)
+# Modified by Chen Li (edward82@stu.xjtu.edu.cn)
+# --------------------------------------------------------
+import os
+import numpy as np
+from torch.utils.data import Dataset
+import torch
+from PIL import Image
+import torchvision.transforms.functional as TF
+from pdb import set_trace as stx
+import random
+import cv2
+from PIL import Image
+import torchvision
+def is_image_file(filename):
+    return any(filename.endswith(extension) for extension in ['jpeg', 'JPEG', 'jpg', 'png', 'JPG', 'PNG', 'gif'])
+class SIDD(Dataset):
+    def __init__(self, path, split="train", size=256, interpolation="pil_lanczos",
+        flip_prob=0.5, sample_weight=1.0, instruct=False):
+        super(SIDD, self).__init__()
+        inp_files = sorted(os.listdir(os.path.join(path, split, 'input')))
+        tar_files = sorted(os.listdir(os.path.join(path, split, 'gt')))
+        self.inp_filenames = [os.path.join(path, split, 'input', x) for x in inp_files if is_image_file(x)]
+        self.tar_filenames = [os.path.join(path, split, 'gt', x) for x in tar_files if is_image_file(x)]
+        self.size = size
+        self.flip_prob = flip_prob
+        self.sample_weight = sample_weight
+        self.instruct = instruct
+        self.sizex = len(self.tar_filenames)  # get the size of target
+        self.interpolation = {
+            "cv_nearest": cv2.INTER_NEAREST,
+            "cv_bilinear": cv2.INTER_LINEAR,
+            "cv_bicubic": cv2.INTER_CUBIC,
+            "cv_area": cv2.INTER_AREA,
+            "cv_lanczos": cv2.INTER_LANCZOS4,
+            "pil_nearest": Image.NEAREST,
+            "pil_bilinear": Image.BILINEAR,
+            "pil_bicubic": Image.BICUBIC,
+            "pil_box": Image.BOX,
+            "pil_hamming": Image.HAMMING,
+            "pil_lanczos": Image.LANCZOS,
+        }[interpolation]
+        prompt_path='dataset/prompt/prompt_denoise.txt'
+        self.prompt_list=[]
+        with open(prompt_path) as f:
+            line=f.readline()
+            while line:
+                line=line.strip('\n')
+                self.prompt_list.append(line)
+                line=f.readline()
+        print(f"SIDD has {len(self)} samples!!")
+    def __len__(self):
+        return int(self.sizex * self.sample_weight)
+    def __getitem__(self, index):
+        if self.sample_weight >= 1:
+            index_ = index % self.sizex
+        else:
+            index_ = int(index / self.sample_weight) + random.randint(0, int(1 / self.sample_weight) - 1)
+        inp_path = self.inp_filenames[index_]
+        tar_path = self.tar_filenames[index_]
+        inp_img = Image.open(inp_path)
+        tar_img = Image.open(tar_path)
+        width, height = inp_img.size
+        tar_width, tar_height = tar_img.size
+        assert tar_width == width and tar_height == height, "Input and target image mismatch"
+        inp_img = np.array(inp_img).astype(np.float32).transpose(2, 0, 1)
+        inp_img_tensor = torch.tensor((inp_img / 127.5 - 1.0).astype(np.float32))
+        tar_img = np.array(tar_img).astype(np.float32).transpose(2, 0, 1)
+        tar_img_tensor = torch.tensor((tar_img / 127.5 - 1.0).astype(np.float32))
+        crop = torchvision.transforms.RandomCrop(self.size)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((inp_img_tensor, tar_img_tensor)))).chunk(2)
+        prompt = random.choice(self.prompt_list)
+        if self.instruct:
+            prompt = "Image Denoising: " + prompt
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt))

dataset/pose/pose.py ADDED Viewed

	@@ -0,0 +1,760 @@

+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Bin Xiao (Bin.Xiao@microsoft.com)
+# Modified by Zigang Geng (zigang@mail.ustc.edu.cn)
+# ------------------------------------------------------------------------------
+from __future__ import annotations
+import logging
+import os
+import json
+import copy
+import math
+import random
+from pathlib import Path
+from typing import Any
+import cv2
+import numpy as np
+import torch
+import torchvision
+from einops import rearrange
+from PIL import Image
+from torch.utils.data import Dataset
+import torchvision.transforms as transforms
+from pycocotools.coco import COCO
+logger = logging.getLogger(__name__)
+colors = {
+    'red': (255, 0, 0),
+    'green': (0, 255, 0),
+    'blue': (0, 0, 255),
+    'yellow': (255, 255, 0),
+    'cyan': (0, 255, 255),
+    'magenta': (255, 0, 255),
+    'gray': (128, 128, 128),
+    'white': (255, 255, 255),
+    'black': (0, 0, 0)}
+def readTXT(txt_path):
+    with open(txt_path, 'r') as f:
+        listInTXT = [line.strip() for line in f]
+    return listInTXT
+class PoseDataset(Dataset):
+    def __init__(self, root, image_set, is_train, max_prompt_num=5, min_prompt_num=1,
+        radius=10, size=256, transparency=0.0, sample_weight=1.0, transform=None):
+        self.sample_weight = sample_weight
+        self.max_prompt_num = max_prompt_num
+        self.min_prompt_num = min_prompt_num
+        self.radius = radius
+        self.transparency = transparency
+        self.num_joints = 0
+        self.pixel_std = 200
+        self.flip_pairs = []
+        self.parent_ids = []
+        self.keypoints_type = {}
+        self.is_train = is_train
+        self.image_set = image_set
+        self.root = root
+        self.scale_factor = 0.35
+        self.rotation_factor = 45
+        self.flip = True
+        self.num_joints_half_body = 8
+        self.prob_half_body = 0.3
+        self.image_size = np.array((size, size))
+        self.heatmap_size = np.array((size, size))
+        self.transform = transform
+        self.db = []
+        pose_diverse_prompt_path = 'dataset/prompt/prompt_pose.txt'
+        self.pose_diverse_prompt_list = []
+        with open(pose_diverse_prompt_path) as f:
+            line = f.readline()
+            while line:
+                line = line.strip('\n')
+                self.pose_diverse_prompt_list.append(line)
+                line = f.readline()
+    def _get_db(self):
+        raise NotImplementedError
+    def evaluate(self, preds, output_dir, *args, **kwargs):
+        raise NotImplementedError
+    def half_body_transform(self, joints, joints_vis):
+        upper_joints = []
+        lower_joints = []
+        for joint_id in range(self.num_joints):
+            if joints_vis[joint_id][0] > 0:
+                if joint_id in self.upper_body_ids:
+                    upper_joints.append(joints[joint_id])
+                else:
+                    lower_joints.append(joints[joint_id])
+        if np.random.randn() < 0.5 and len(upper_joints) > 2:
+            selected_joints = upper_joints
+        else:
+            selected_joints = lower_joints \
+                if len(lower_joints) > 2 else upper_joints
+        if len(selected_joints) < 2:
+            return None, None
+        selected_joints = np.array(selected_joints, dtype=np.float32)
+        center = selected_joints.mean(axis=0)[:2]
+        left_top = np.amin(selected_joints, axis=0)
+        right_bottom = np.amax(selected_joints, axis=0)
+        w = right_bottom[0] - left_top[0]
+        h = right_bottom[1] - left_top[1]
+        if w > self.aspect_ratio * h:
+            h = w * 1.0 / self.aspect_ratio
+        elif w < self.aspect_ratio * h:
+            w = h * self.aspect_ratio
+        scale = np.array(
+            [
+                w * 1.0 / self.pixel_std,
+                h * 1.0 / self.pixel_std
+            ],
+            dtype=np.float32
+        )
+        scale = scale * 1.5
+        return center, scale
+    def __len__(self,):
+        return int(len(self.db) * self.sample_weight)
+    def __getitem__(self, idx):
+        if self.sample_weight >= 1:
+            idx = idx % len(self.db)
+        else:
+            idx = int(idx / self.sample_weight) + random.randint(0, int(1 / self.sample_weight) - 1)
+        db_rec = copy.deepcopy(self.db[idx])
+        image_file = db_rec['image']
+        filename = db_rec['filename'] if 'filename' in db_rec else ''
+        imgnum = db_rec['imgnum'] if 'imgnum' in db_rec else ''
+        data_numpy = cv2.imread(
+            image_file, cv2.IMREAD_COLOR | cv2.IMREAD_IGNORE_ORIENTATION
+        )
+        data_numpy = cv2.cvtColor(data_numpy, cv2.COLOR_BGR2RGB)
+        if data_numpy is None:
+            logger.error('=> fail to read {}'.format(image_file))
+            raise ValueError('Fail to read {}'.format(image_file))
+        joints = db_rec['joints_3d']
+        joints_vis = db_rec['joints_3d_vis']
+        c = db_rec['center']
+        s = db_rec['scale']
+        score = db_rec['score'] if 'score' in db_rec else 1
+        r = 0
+        if self.is_train:
+            if (np.sum(joints_vis[:, 0]) > self.num_joints_half_body
+                and np.random.rand() < self.prob_half_body):
+                c_half_body, s_half_body = self.half_body_transform(
+                    joints, joints_vis
+                )
+                if c_half_body is not None and s_half_body is not None:
+                    c, s = c_half_body, s_half_body
+            sf = self.scale_factor
+            rf = self.rotation_factor
+            s = s * np.clip(np.random.randn()*sf + 1, 1 - sf, 1 + sf)
+            r = np.clip(np.random.randn()*rf, -rf*2, rf*2) \
+                if random.random() <= 0.6 else 0
+            if self.flip and random.random() <= 0.5:
+                data_numpy = data_numpy[:, ::-1, :]
+                joints, joints_vis = fliplr_joints(
+                    joints, joints_vis, data_numpy.shape[1], self.flip_pairs)
+                c[0] = data_numpy.shape[1] - c[0] - 1
+        trans = get_affine_transform(c, s, r, self.image_size)
+        input = cv2.warpAffine(
+            data_numpy,
+            trans,
+            (int(self.image_size[0]), int(self.image_size[1])),
+            flags=cv2.INTER_LINEAR)
+        if self.transform:
+            input = self.transform(input)
+        for i in range(self.num_joints):
+            if joints_vis[i, 0] > 0.0:
+                joints[i, 0:2] = affine_transform(joints[i, 0:2], trans)
+        target, prompt = self.generate_target(input, joints, joints_vis)
+        # return Image.fromarray(input), Image.fromarray(target), prompt
+        image_0 = rearrange(2 * torch.tensor(np.array(input)).float() / 255 - 1, "h w c -> c h w")
+        image_1 = rearrange(2 * torch.tensor(np.array(target)).float() / 255 - 1, "h w c -> c h w")
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt))
+    def generate_target(self, input, joints, joints_vis):
+        '''
+        :param input: [height, width, 3]
+        :param joints:  [num_joints, 3]
+        :param joints_vis: [num_joints, 3]
+        :return: target
+        '''
+        radius = self.radius
+        target = copy.deepcopy(input)
+        joint_num = random.randint(self.min_prompt_num, self.max_prompt_num)
+        joint_ids = np.random.choice([i for i in range(self.num_joints)], joint_num, replace=False)
+        random_color_names = random.sample(list(colors.keys()), len(joint_ids))
+        random_marker_names = ['circle' for i in range(len(joint_ids))]
+        prompt = ""
+        for color_idx, joint_id in enumerate(joint_ids):
+            feat_stride = self.image_size / self.heatmap_size
+            mu_x = int(joints[joint_id][0] / feat_stride[0] + 0.5)
+            mu_y = int(joints[joint_id][1] / feat_stride[1] + 0.5)
+            # Check that any part of the gaussian is in-bounds
+            ul = [int(mu_x - radius), int(mu_y - radius)]
+            br = [int(mu_x + radius + 1), int(mu_y + radius + 1)]
+            if ul[0] >= self.heatmap_size[0] or ul[1] >= self.heatmap_size[1] \
+                    or br[0] < 0 or br[1] < 0:
+                # If not, just return the image as is
+                joints_vis[joint_id][0] = 0
+                continue
+            marker_size = 2 * radius + 1
+            g = np.zeros((marker_size, marker_size))
+            x, y = np.indices((marker_size, marker_size))
+            interval = int((marker_size - marker_size / math.sqrt(2)) // 2)
+            mask = (x - radius) ** 2 + (y - radius) ** 2 <= radius ** 2 + 1
+            g[mask] = 1
+            # Usable gaussian range
+            g_x = max(0, -ul[0]), min(br[0], self.heatmap_size[0]) - ul[0]
+            g_y = max(0, -ul[1]), min(br[1], self.heatmap_size[1]) - ul[1]
+            # Image range
+            img_x = max(0, ul[0]), min(br[0], self.heatmap_size[0])
+            img_y = max(0, ul[1]), min(br[1], self.heatmap_size[1])
+            v = joints_vis[joint_id][0]
+            random_color_name = random_color_names[color_idx]
+            random_color = colors[random_color_name]
+            prompt += random.choice(self.pose_diverse_prompt_list).format(
+                color=random_color_name,
+                joint=self.keypoints_type[joint_id])
+            if v > 0.5:
+                target[img_y[0]:img_y[1], img_x[0]:img_x[1]][g[g_y[0]:g_y[1], g_x[0]:g_x[1]]>0] \
+                    = self.transparency*target[img_y[0]:img_y[1], img_x[0]:img_x[1]][g[g_y[0]:g_y[1], g_x[0]:g_x[1]]>0] \
+                        + (1-self.transparency)*np.array(random_color)
+        return target, prompt
+class COCODataset(PoseDataset):
+    def __init__(self, root, image_set, is_train, max_prompt_num=5, min_prompt_num=1,
+            radius=10, size=256, transparency=0.0, sample_weight=1.0, transform=None):
+        super().__init__(root, image_set, is_train, max_prompt_num, min_prompt_num,
+            radius, size, transparency, sample_weight, transform)
+        self.keypoints_type = {
+                0: "nose",
+                1: "left eye",
+                2: "right eye",
+                3: "left ear",
+                4: "right ear",
+                5: "left shoulder",
+                6: "right shoulder",
+                7: "left elbow",
+                8: "right elbow",
+                9: "left wrist",
+                10: "right wrist",
+                11: "left hip",
+                12: "right hip",
+                13: "left knee",
+                14: "right knee",
+                15: "left ankle",
+                16: "right ankle"
+            }
+        self.image_width = size
+        self.image_height = size
+        self.aspect_ratio = self.image_width * 1.0 / self.image_height
+        self.pixel_std = 200
+        self.coco = COCO(self._get_ann_file_keypoint())
+        # deal with class names
+        cats = [cat['name']
+                for cat in self.coco.loadCats(self.coco.getCatIds())]
+        self.classes = ['__background__'] + cats
+        logger.info('=> classes: {}'.format(self.classes))
+        self.num_classes = len(self.classes)
+        self._class_to_ind = dict(zip(self.classes, range(self.num_classes)))
+        self._class_to_coco_ind = dict(zip(cats, self.coco.getCatIds()))
+        self._coco_ind_to_class_ind = dict(
+            [
+                (self._class_to_coco_ind[cls], self._class_to_ind[cls])
+                for cls in self.classes[1:]
+            ]
+        )
+        # load image file names
+        self.image_set_index = self._load_image_set_index()
+        self.num_images = len(self.image_set_index)
+        logger.info('=> num_images: {}'.format(self.num_images))
+        self.num_joints = 17
+        self.flip_pairs = [[1, 2], [3, 4], [5, 6], [7, 8],
+                           [9, 10], [11, 12], [13, 14], [15, 16]]
+        self.parent_ids = None
+        self.upper_body_ids = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
+        self.lower_body_ids = (11, 12, 13, 14, 15, 16)
+        if 'coco' in self.root:
+            self.db = self._get_db()
+        logger.info('=> load {} samples'.format(len(self.db)))
+    def _get_ann_file_keypoint(self):
+        """ self.root / annotations / person_keypoints_train2017.json """
+        if 'coco' in self.root:
+            prefix = 'person_keypoints' \
+                if 'test' not in self.image_set else 'image_info'
+            return os.path.join(
+                self.root,
+                'annotations',
+                prefix + '_' + self.image_set + '.json'
+            )
+        elif 'crowdpose' in self.root:
+            prefix = 'crowdpose'
+            return os.path.join(
+                self.root,
+                'json',
+                prefix + '_' + self.image_set + '.json'
+            )
+        elif 'aic' in self.root:
+            prefix = 'aic'
+            return os.path.join(
+                self.root,
+                'annotations',
+                prefix + '_' + self.image_set + '.json'
+            )
+        else:
+            raise ValueError('Please write the path for this new dataset.')
+    def _load_image_set_index(self):
+        """ image id: int """
+        image_ids = self.coco.getImgIds()
+        return image_ids
+    def _get_db(self):
+        gt_db = self._load_coco_keypoint_annotations()
+        return gt_db
+    def _load_coco_keypoint_annotations(self):
+        """ ground truth bbox and keypoints """
+        gt_db = []
+        for index in self.image_set_index:
+            gt_db.extend(self._load_coco_keypoint_annotation_kernal(index))
+        return gt_db
+    def _load_coco_keypoint_annotation_kernal(self, index):
+        """
+        coco ann: [u'segmentation', u'area', u'iscrowd', u'image_id', u'bbox', u'category_id', u'id']
+        iscrowd:
+            crowd instances are handled by marking their overlaps with all categories to -1
+            and later excluded in training
+        bbox:
+            [x1, y1, w, h]
+        :param index: coco image id
+        :return: db entry
+        """
+        im_ann = self.coco.loadImgs(index)[0]
+        width = im_ann['width']
+        height = im_ann['height']
+        annIds = self.coco.getAnnIds(imgIds=index, iscrowd=False)
+        objs = self.coco.loadAnns(annIds)
+        # sanitize bboxes
+        valid_objs = []
+        for obj in objs:
+            x, y, w, h = obj['bbox']
+            x1 = np.max((0, x))
+            y1 = np.max((0, y))
+            x2 = np.min((width - 1, x1 + np.max((0, w - 1))))
+            y2 = np.min((height - 1, y1 + np.max((0, h - 1))))
+            if 'crowdpose' in self.root:
+                obj['area'] = 1
+            if obj['area'] > 0 and x2 >= x1 and y2 >= y1:
+                obj['clean_bbox'] = [x1, y1, x2-x1, y2-y1]
+                valid_objs.append(obj)
+        objs = valid_objs
+        rec = []
+        for obj in objs:
+            cls = self._coco_ind_to_class_ind[obj['category_id']]
+            if cls != 1:
+                continue
+            # ignore objs without keypoints annotation
+            if max(obj['keypoints']) == 0:
+                continue
+            joints_3d = np.zeros((self.num_joints, 3), dtype=np.float32)
+            joints_3d_vis = np.zeros((self.num_joints, 3), dtype=np.float32)
+            for ipt in range(self.num_joints):
+                joints_3d[ipt, 0] = obj['keypoints'][ipt * 3 + 0]
+                joints_3d[ipt, 1] = obj['keypoints'][ipt * 3 + 1]
+                joints_3d[ipt, 2] = 0
+                t_vis = obj['keypoints'][ipt * 3 + 2]
+                if t_vis > 1:
+                    t_vis = 1
+                joints_3d_vis[ipt, 0] = t_vis
+                joints_3d_vis[ipt, 1] = t_vis
+                joints_3d_vis[ipt, 2] = 0
+            center, scale = self._box2cs(obj['clean_bbox'][:4])
+            rec.append({
+                'image': self.image_path_from_index(index, im_ann),
+                'center': center,
+                'scale': scale,
+                'joints_3d': joints_3d,
+                'joints_3d_vis': joints_3d_vis,
+                'filename': '',
+                'imgnum': 0,
+            })
+        return rec
+    def _box2cs(self, box):
+        x, y, w, h = box[:4]
+        return self._xywh2cs(x, y, w, h)
+    def _xywh2cs(self, x, y, w, h):
+        center = np.zeros((2), dtype=np.float32)
+        center[0] = x + w * 0.5
+        center[1] = y + h * 0.5
+        if w > self.aspect_ratio * h:
+            h = w * 1.0 / self.aspect_ratio
+        elif w < self.aspect_ratio * h:
+            w = h * self.aspect_ratio
+        scale = np.array(
+            [w * 1.0 / self.pixel_std, h * 1.0 / self.pixel_std],
+            dtype=np.float32)
+        if center[0] != -1:
+            scale = scale * 1.25
+        return center, scale
+    def image_path_from_index(self, index, im_ann):
+        """ example: images / train2017 / 000000119993.jpg """
+        if 'coco' in self.root:
+            file_name = '%012d.jpg' % index
+            if '2014' in self.image_set:
+                file_name = 'COCO_%s_' % self.image_set + file_name
+            prefix = 'test2017' if 'test' in self.image_set else self.image_set
+            data_name = prefix
+            image_path = os.path.join(
+                self.root, 'images', data_name, file_name)
+            return image_path
+        elif 'crowdpose' in self.root:
+            file_name = f'{index}.jpg'
+            image_path = os.path.join(
+                self.root, 'images', file_name)
+            return image_path
+        elif 'aic' in self.root:
+            file_name = im_ann["file_name"]
+            image_path = os.path.join(
+                self.root, 'ai_challenger_keypoint_train_20170902', 'keypoint_train_images_20170902', file_name)
+            return image_path
+def flip_back(output_flipped, matched_parts):
+    '''
+    ouput_flipped: numpy.ndarray(batch_size, num_joints, height, width)
+    '''
+    assert output_flipped.ndim == 4,\
+        'output_flipped should be [batch_size, num_joints, height, width]'
+    output_flipped = output_flipped[:, :, :, ::-1]
+    for pair in matched_parts:
+        tmp = output_flipped[:, pair[0], :, :].copy()
+        output_flipped[:, pair[0], :, :] = output_flipped[:, pair[1], :, :]
+        output_flipped[:, pair[1], :, :] = tmp
+    return output_flipped
+def fliplr_joints(joints, joints_vis, width, matched_parts):
+    """
+    flip coords
+    """
+    # Flip horizontal
+    joints[:, 0] = width - joints[:, 0] - 1
+    # Change left-right parts
+    for pair in matched_parts:
+        joints[pair[0], :], joints[pair[1], :] = \
+            joints[pair[1], :], joints[pair[0], :].copy()
+        joints_vis[pair[0], :], joints_vis[pair[1], :] = \
+            joints_vis[pair[1], :], joints_vis[pair[0], :].copy()
+    return joints*joints_vis, joints_vis
+def get_affine_transform(
+        center, scale, rot, output_size,
+        shift=np.array([0, 0], dtype=np.float32), inv=0
+):
+    if not isinstance(scale, np.ndarray) and not isinstance(scale, list):
+        print(scale)
+        scale = np.array([scale, scale])
+    scale_tmp = scale * 200.0
+    src_w = scale_tmp[0]
+    dst_w = output_size[0]
+    dst_h = output_size[1]
+    rot_rad = np.pi * rot / 180
+    src_dir = get_dir([0, src_w * -0.5], rot_rad)
+    dst_dir = np.array([0, dst_w * -0.5], np.float32)
+    src = np.zeros((3, 2), dtype=np.float32)
+    dst = np.zeros((3, 2), dtype=np.float32)
+    src[0, :] = center + scale_tmp * shift
+    src[1, :] = center + src_dir + scale_tmp * shift
+    dst[0, :] = [dst_w * 0.5, dst_h * 0.5]
+    dst[1, :] = np.array([dst_w * 0.5, dst_h * 0.5]) + dst_dir
+    src[2:, :] = get_3rd_point(src[0, :], src[1, :])
+    dst[2:, :] = get_3rd_point(dst[0, :], dst[1, :])
+    if inv:
+        trans = cv2.getAffineTransform(np.float32(dst), np.float32(src))
+    else:
+        trans = cv2.getAffineTransform(np.float32(src), np.float32(dst))
+    return trans
+def affine_transform(pt, t):
+    new_pt = np.array([pt[0], pt[1], 1.]).T
+    new_pt = np.dot(t, new_pt)
+    return new_pt[:2]
+def get_3rd_point(a, b):
+    direct = a - b
+    return b + np.array([-direct[1], direct[0]], dtype=np.float32)
+def get_dir(src_point, rot_rad):
+    sn, cs = np.sin(rot_rad), np.cos(rot_rad)
+    src_result = [0, 0]
+    src_result[0] = src_point[0] * cs - src_point[1] * sn
+    src_result[1] = src_point[0] * sn + src_point[1] * cs
+    return src_result
+class CrowdPoseDataset(COCODataset):
+    def __init__(self, root, image_set, is_train, max_prompt_num=5, min_prompt_num=1,
+            radius=10, size=256, transparency=0.0, sample_weight=1.0, transform=None):
+        super().__init__(root, image_set, is_train, max_prompt_num, min_prompt_num,
+            radius, size, transparency, sample_weight, transform)
+        self.keypoints_type = {
+                0: 'left_shoulder',
+                1: 'right_shoulder',
+                2: 'left_elbow',
+                3: 'right_elbow',
+                4: 'left_wrist',
+                5: 'right_wrist',
+                6: 'left_hip',
+                7: 'right_hip',
+                8: 'left_knee',
+                9: 'right_knee',
+                10: 'left_ankle',
+                11: 'right_ankle',
+                12: 'top_head',
+                13: 'neck'
+            }
+        self.num_joints = 14
+        self.prob_half_body = -1
+        self.flip_pairs = [[0, 1], [2, 3], [4, 5], [6, 7], [8, 9], [10, 11]]
+        self.parent_ids = None
+        self.upper_body_ids = (0, 1, 2, 3, 4, 5, 12, 13)
+        self.lower_body_ids = (6, 7, 8, 9, 10, 11)
+        self.db = self._get_db()
+        logger.info('=> load {} samples'.format(len(self.db)))
+class AICDataset(COCODataset):
+    def __init__(self, root, image_set, is_train, max_prompt_num=5, min_prompt_num=1,
+            radius=10, size=256, transparency=0.0, sample_weight=1.0, transform=None):
+        super().__init__(root, image_set, is_train, max_prompt_num, min_prompt_num,
+            radius, size, transparency, sample_weight, transform)
+        self.keypoints_type = {
+                0: "right_shoulder",
+                1: "right_elbow",
+                2: "right_wrist",
+                3: "left_shoulder",
+                4: "left_elbow",
+                5: "left_wrist",
+                6: "right_hip",
+                7: "right_knee",
+                8: "right_ankle",
+                9: "left_hip",
+                10: "left_knee",
+                11: "left_ankle",
+                12: "head_top",
+                13: "neck"
+            }
+        self.num_joints = 14
+        self.prob_half_body = -1
+        self.flip_pairs = [[0, 3], [1, 4], [2, 5], [6, 9], [7, 10], [8, 11]]
+        self.parent_ids = None
+        self.upper_body_ids = (0, 1, 2, 3, 4, 5, 12, 13)
+        self.lower_body_ids = (6, 7, 8, 9, 10, 11)
+        self.db = self._get_db()
+        logger.info('=> load {} samples'.format(len(self.db)))
+class MPIIDataset(PoseDataset):
+    def __init__(self, root, image_set, is_train, max_prompt_num=5, min_prompt_num=1,
+            radius=10, size=256, transparency=0.0, sample_weight=1.0, transform=None):
+        super().__init__(root, image_set, is_train, max_prompt_num, min_prompt_num,
+            radius, size, transparency, sample_weight, transform)
+        self.keypoints_type = {
+                0: 'right_ankle',
+                1: 'right_knee',
+                2: 'right_hip',
+                3: 'left_hip',
+                4: 'left_knee',
+                5: 'left_ankle',
+                6: 'pelvis',
+                7: 'thorax',
+                8: 'upper_neck',
+                9: 'head_top',
+                10: 'right_wrist',
+                11: 'right_elbow',
+                12: 'right_shoulder',
+                13: 'left_shoulder',
+                14: 'left_elbow',
+                15: 'left_wrist'
+            }
+        self.data_format = 'jpg'
+        self.num_joints = 16
+        self.prob_half_body = -1
+        self.flip_pairs = [[0, 5], [1, 4], [2, 3], [10, 15], [11, 14], [12, 13]]
+        self.parent_ids = None
+        self.upper_body_ids = (7, 8, 9, 10, 11, 12, 13, 14, 15)
+        self.lower_body_ids = (0, 1, 2, 3, 4, 5, 6)
+        self.db = self._get_db()
+        logger.info('=> load {} samples'.format(len(self.db)))
+    def _get_db(self):
+        # create train/val split
+        file_name = os.path.join(
+            self.root, 'annot', self.image_set+'.json'
+        )
+        with open(file_name) as anno_file:
+            anno = json.load(anno_file)
+        gt_db = []
+        for a in anno:
+            image_name = a['image']
+            c = np.array(a['center'], dtype=np.float32)
+            s = np.array([a['scale'], a['scale']], dtype=np.float32)
+            # Adjust center/scale slightly to avoid cropping limbs
+            if c[0] != -1:
+                c[1] = c[1] + 15 * s[1]
+                s = s * 1.25
+            # MPII uses matlab format, index is based 1,
+            # we should first convert to 0-based index
+            c = c - 1
+            joints_3d = np.zeros((self.num_joints, 3), dtype=np.float32)
+            joints_3d_vis = np.zeros((self.num_joints,  3), dtype=np.float32)
+            if self.image_set != 'test':
+                joints = np.array(a['joints'])
+                joints[:, 0:2] = joints[:, 0:2] - 1
+                joints_vis = np.array(a['joints_vis'])
+                assert len(joints) == self.num_joints, \
+                    'joint num diff: {} vs {}'.format(len(joints),
+                                                      self.num_joints)
+                joints_3d[:, 0:2] = joints[:, 0:2]
+                joints_3d_vis[:, 0] = joints_vis[:]
+                joints_3d_vis[:, 1] = joints_vis[:]
+            image_dir = 'images.zip@' if self.data_format == 'zip' else 'images'
+            gt_db.append(
+                {
+                    'image': os.path.join(self.root, image_dir, image_name),
+                    'center': c,
+                    'scale': s,
+                    'joints_3d': joints_3d,
+                    'joints_3d_vis': joints_3d_vis,
+                    'filename': '',
+                    'imgnum': 0,
+                }
+            )
+        return gt_db

dataset/prompt/color_list_train_small.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+Red 纯红 #FF0000 255,0,0
+Purple 紫色 #800080 128,0,128
+Blue 纯蓝 #0000FF 0,0,255
+Green 纯绿 #008000 0,128,0
+Yellow 纯黄 #FFFF00 255,255,0
+White 纯白 #FFFFFF 255,255,255
+Black 纯黑 #000000 0,0,0
+Gray 灰色 #808080 128,128,128

dataset/prompt/prompt_deblur.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+Sharpen this blurry image
+Increase the sharpness of this unclear photo
+Correct the lack of focus in this misty picture
+Heighten the definition of this smeared image
+Clear up this fuzzy picture
+Refine this indistinct photograph
+Improve the focus of this hazy image
+Amend the softness of this out-of-focus photograph
+Polish the murkiness of this low-definition photo
+Rectify the vagueness of this blurred image

dataset/prompt/prompt_denoise.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+Remove noise from this image
+Eliminate the noise in this picture
+Purify this photo by removing noise
+Clear up the image by filtering out noise
+Eradicate the noise from this photograph
+Minimize the noise present in this picture
+Cancel out the noise within this image
+Clean this photo by discarding the noise
+Suppress the noise in this visual representation
+Rectify the noise interference in this image

dataset/prompt/prompt_dewatermark.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+Remove watermark from this picture
+Erase the watermark from this photograph.
+Extract the watermark from this image.
+Take out the watermark overlay from this photo.
+Wipe off the watermark imprint on this image.
+Detach the watermark from this visual representation.
+Get rid of the watermarking on this picture.
+Withdraw the watermark applied to this photograph.
+Clean up this image by deleting the watermark.
+Unmark this photo by removing the watermark.

dataset/prompt/prompt_pose.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+Circle the {joint} of the people with the color {color},
+Use the {color} color to draw circles around the {joint} of the people,
+Make {color} circles around the {joint} of the people,
+Put {color} circles on the {joint} of the people,
+Draw {color} circles over the {joint} of the people,
+Surround the {joint} of the people with {color} circles,
+Use the color {color} to make circles on the {joint} of the people,
+Mark the {joint} of the people with {color} circles,
+Create {color} circles around the {joint} of the people,
+Use the color {color} to encircle the {joint} of the people,

dataset/prompt/prompt_seg.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+Mark the pixels of {object} in {color} and leave the rest unchanged.
+Color the {object}'s pixels in {color}, keeping the remaining pixels unaltered.
+Apply {color} to the pixels of {object} while maintaining the current state of other pixels.
+Assign {color} to the pixels belonging to {object}, preserving the rest as they are.
+For {object}, set its pixels to {color} and let the others remain the same.
+Modify the pixels of {object} to {color} without affecting any other pixels.
+Set the {object} pixels to {color} and keep the other pixels in their original state.
+Update the pixels of {object} to {color}, but leave the other pixels untouched.
+Fill in the pixels of {object} with {color}, retaining the existing colors of the remaining pixels.
+Change the {object} pixels to {color}, while keeping the other pixels constant.
+Paint the pixels of {object} in {color} and maintain the current appearance of the other pixels.

dataset/seg/coco_stuff.py ADDED Viewed

	@@ -0,0 +1,175 @@

+# --------------------------------------------------------
+# InstructDiffusion
+# Based on instruct-pix2pix (https://github.com/timothybrooks/instruct-pix2pix)
+# Modified by Binxin Yang (tennyson@mail.ustc.edu.cn)
+# --------------------------------------------------------
+from __future__ import annotations
+import json
+import math
+from pathlib import Path
+from typing import Any
+import numpy as np
+import torch
+import torchvision
+from einops import rearrange
+from PIL import Image
+from torch.utils.data import Dataset
+import cv2
+import os
+import random
+import copy
+from glob import glob
+class COCOStuffDataset(Dataset):
+    def __init__(
+        self,
+        path: str,
+        path_edit: str = "None",
+        split: str = "train",
+        splits: tuple[float, float, float] = (0.9, 0.05, 0.05),
+        crop_res: int = 256,
+        flip_prob: float = 0.0,
+        transparency: float = 0,
+        batch_size: int = 10,
+        empty_percentage: float = 0,
+    ):
+        assert split in ("train2017", "val2017")
+        assert sum(splits) == 1
+        self.split = split
+        self.path = path
+        self.path_edit = path_edit
+        self.batch_size = batch_size
+        self.crop_res = crop_res
+        self.flip_prob = flip_prob
+        self.empty_percentage = empty_percentage
+        self.transparency = transparency
+        if self.split in ["train2017", "val2017"]:
+            file_list = sorted(glob(os.path.join(self.path, "images", self.split, "*.jpg")))
+            assert len(file_list) > 0, "{} has no image".format(
+                os.path.join(self.path, "images", self.split)
+            )
+            file_list = [f.split("/")[-1].replace(".jpg", "") for f in file_list]
+            self.files = file_list
+        else:
+            raise ValueError("Invalid split name: {}".format(self.split))
+        seg_diverse_prompt_path = 'dataset/prompt/prompt_seg.txt'
+        self.seg_diverse_prompt_list=[]
+        with open(seg_diverse_prompt_path) as f:
+            line=f.readline()
+            while line:
+                line=line.strip('\n')
+                self.seg_diverse_prompt_list.append(line)
+                line=f.readline()
+        color_list_file_path='dataset/prompt/color_list_train_small.txt'
+        self.color_list=[]
+        with open(color_list_file_path) as f:
+            line = f.readline()
+            while line:
+                line_split = line.strip('\n').split(" ")
+                if len(line_split)>1:
+                    temp = []
+                    for i in range(4):
+                        temp.append(line_split[i])
+                    self.color_list.append(temp)
+                line = f.readline()
+        coco_label_list_path = self.path + '/labels.txt'
+        self.label_dict={}
+        with open(coco_label_list_path) as f:
+            line = f.readline()
+            while line:
+                line_split = line.strip('\n').split(": ")
+                self.label_dict[int(line_split[0])]=line_split[1]
+                line = f.readline()
+    def __len__(self) -> int:
+        length=len(self.files)
+        return length
+    def _augmentation_new(self, image, label):
+        # Cropping
+        h, w = label.shape
+        if h > w:
+            start_h = random.randint(0, h - w)
+            end_h = start_h + w
+            image = image[start_h:end_h]
+            label = label[start_h:end_h]
+        elif h < w:
+            start_w = random.randint(0, w - h)
+            end_w = start_w + h
+            image = image[:, start_w:end_w]
+            label = label[:, start_w:end_w]
+        else:
+            pass
+        image = Image.fromarray(image).resize((self.crop_res, self.crop_res), resample=Image.Resampling.LANCZOS)
+        image = np.asarray(image, dtype=np.uint8)
+        label = Image.fromarray(label).resize((self.crop_res, self.crop_res), resample=Image.Resampling.NEAREST)
+        label = np.asarray(label, dtype=np.int64)
+        return image, label
+    def __getitem__(self, i):
+        image_id = self.files[i]
+        img_path = os.path.join(self.path, "images", self.split, image_id + ".jpg")
+        mask_path = os.path.join(self.path, "annotations", self.split, image_id + ".png")
+        label = Image.open(mask_path).convert("L")
+        image = Image.open(img_path).convert("RGB")
+        label = np.asarray(label)
+        image = np.asarray(image)
+        image, label = self._augmentation_new(image,label)
+        label_list = np.unique(label)
+        label_list = list(label_list)
+        label_list_rest = [i for i in range(182)]
+        for item in label_list_rest:
+            if item in label_list:
+                label_list_rest.remove(item)
+        if 255 in label_list:
+            label_list.remove(255)
+        if len(label_list)!=0:
+            label_idx = random.choice(label_list)
+            if random.uniform(0, 1) < self.empty_percentage:
+                label_idx = random.choice(label_list_rest)
+            class_name = self.label_dict[label_idx+1]
+            prompt = random.choice(self.seg_diverse_prompt_list)
+            color = random.choice(self.color_list)
+            color_name = color[0]
+            prompt = prompt.format(color=color_name.lower(), object=class_name.lower())
+            R, G, B = color[3].split(",")
+            R = int(R)
+            G = int(G)
+            B = int(B)
+        else:
+            label_idx = 200
+            prompt = "leave the picture as it is."
+        mask = (label==label_idx)
+        image_0 = Image.fromarray(image)
+        image_1 = copy.deepcopy(image)
+        if len(label_list)!=0:
+            image_1[:,:,0][mask]=self.transparency*image_1[:,:,0][mask]+(1-self.transparency)*R
+            image_1[:,:,1][mask]=self.transparency*image_1[:,:,1][mask]+(1-self.transparency)*G
+            image_1[:,:,2][mask]=self.transparency*image_1[:,:,2][mask]+(1-self.transparency)*B
+        image_1 = Image.fromarray(image_1)
+        # return image_0, image_1, prompt
+        image_0 = rearrange(2 * torch.tensor(np.array(image_0)).float() / 255 - 1, "h w c -> c h w")
+        image_1 = rearrange(2 * torch.tensor(np.array(image_1)).float() / 255 - 1, "h w c -> c h w")
+        mask = torch.tensor(mask).float()
+        crop = torchvision.transforms.RandomCrop(self.crop_res)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((image_0, image_1)))).chunk(2)
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt))

dataset/seg/grefcoco.py ADDED Viewed

	@@ -0,0 +1,329 @@

+"""
+grefer v0.1
+This interface provides access to gRefCOCO.
+The following API functions are defined:
+G_REFER      - REFER api class
+getRefIds    - get ref ids that satisfy given filter conditions.
+getAnnIds    - get ann ids that satisfy given filter conditions.
+getImgIds    - get image ids that satisfy given filter conditions.
+getCatIds    - get category ids that satisfy given filter conditions.
+loadRefs     - load refs with the specified ref ids.
+loadAnns     - load anns with the specified ann ids.
+loadImgs     - load images with the specified image ids.
+loadCats     - load category names with the specified category ids.
+getRefBox    - get ref's bounding box [x, y, w, h] given the ref_id
+showRef      - show image, segmentation or box of the referred object with the ref
+getMaskByRef - get mask and area of the referred object given ref or ref ids
+getMask      - get mask and area of the referred object given ref
+showMask     - show mask of the referred object given ref
+"""
+import os.path as osp
+import json
+import pickle
+import time
+import itertools
+import skimage.io as io
+import matplotlib.pyplot as plt
+from matplotlib.collections import PatchCollection
+from matplotlib.patches import Polygon, Rectangle
+import numpy as np
+from pycocotools import mask
+class G_REFER:
+    def __init__(self, data_root, dataset='grefcoco', splitBy='unc'):
+        # provide data_root folder which contains grefcoco
+        print('loading dataset %s into memory...' % dataset)
+        self.ROOT_DIR = osp.abspath(osp.dirname(__file__))
+        self.DATA_DIR = osp.join(data_root, dataset)
+        if dataset in ['grefcoco']:
+            self.IMAGE_DIR = osp.join(data_root, 'images/train2014')
+        else:
+            raise KeyError('No refer dataset is called [%s]' % dataset)
+        tic = time.time()
+        # load refs from data/dataset/refs(dataset).json
+        self.data = {}
+        self.data['dataset'] = dataset
+        ref_file = osp.join(self.DATA_DIR, f'grefs({splitBy}).p')
+        if osp.exists(ref_file):
+            self.data['refs'] = pickle.load(open(ref_file, 'rb'),fix_imports=True)
+        else:
+            ref_file = osp.join(self.DATA_DIR, f'grefs({splitBy}).json')
+            if osp.exists(ref_file):
+                self.data['refs'] = json.load(open(ref_file, 'rb'))
+            else:
+                raise FileNotFoundError('JSON file not found')
+        # load annotations from data/dataset/instances.json
+        instances_file = osp.join(self.DATA_DIR, 'instances.json')
+        instances = json.load(open(instances_file, 'r'))
+        self.data['images'] = instances['images']
+        self.data['annotations'] = instances['annotations']
+        self.data['categories'] = instances['categories']
+        # create index
+        self.createIndex()
+        print('DONE (t=%.2fs)' % (time.time()-tic))
+    @staticmethod
+    def _toList(x):
+        return x if isinstance(x, list) else [x]
+    @staticmethod
+    def match_any(a, b):
+        a = a if isinstance(a, list) else [a]
+        b = b if isinstance(b, list) else [b]
+        return set(a) & set(b)
+    def createIndex(self):
+        # create sets of mapping
+        # 1)  Refs: 	 	{ref_id: ref}
+        # 2)  Anns: 	 	{ann_id: ann}
+        # 3)  Imgs:		 	{image_id: image}
+        # 4)  Cats: 	 	{category_id: category_name}
+        # 5)  Sents:     	{sent_id: sent}
+        # 6)  imgToRefs: 	{image_id: refs}
+        # 7)  imgToAnns: 	{image_id: anns}
+        # 8)  refToAnn:  	{ref_id: ann}
+        # 9)  annToRef:  	{ann_id: ref}
+        # 10) catToRefs: 	{category_id: refs}
+        # 11) sentToRef: 	{sent_id: ref}
+        # 12) sentToTokens: {sent_id: tokens}
+        print('creating index...')
+        # fetch info from instances
+        Anns, Imgs, Cats, imgToAnns = {}, {}, {}, {}
+        Anns[-1] = None
+        for ann in self.data['annotations']:
+            Anns[ann['id']] = ann
+            imgToAnns[ann['image_id']] = imgToAnns.get(ann['image_id'], []) + [ann]
+        for img in self.data['images']:
+            Imgs[img['id']] = img
+        for cat in self.data['categories']:
+            Cats[cat['id']] = cat['name']
+        # fetch info from refs
+        Refs, imgToRefs, refToAnn, annToRef, catToRefs = {}, {}, {}, {}, {}
+        Sents, sentToRef, sentToTokens = {}, {}, {}
+        availableSplits = []
+        for ref in self.data['refs']:
+            # ids
+            ref_id = ref['ref_id']
+            ann_id = ref['ann_id']
+            category_id = ref['category_id']
+            image_id = ref['image_id']
+            if ref['split'] not in availableSplits:
+                availableSplits.append(ref['split'])
+            # add mapping related to ref
+            if ref_id in Refs:
+                print('Duplicate ref id')
+            Refs[ref_id] = ref
+            imgToRefs[image_id] = imgToRefs.get(image_id, []) + [ref]
+            category_id = self._toList(category_id)
+            added_cats = []
+            for cat in category_id:
+                if cat not in added_cats:
+                    added_cats.append(cat)
+                    catToRefs[cat] = catToRefs.get(cat, []) + [ref]
+            ann_id = self._toList(ann_id)
+            refToAnn[ref_id] = [Anns[ann] for ann in ann_id]
+            for ann_id_n in ann_id:
+                annToRef[ann_id_n] = annToRef.get(ann_id_n, []) + [ref]
+            # add mapping of sent
+            for sent in ref['sentences']:
+                Sents[sent['sent_id']] = sent
+                sentToRef[sent['sent_id']] = ref
+                sentToTokens[sent['sent_id']] = sent['tokens']
+        # create class members
+        self.Refs = Refs
+        self.Anns = Anns
+        self.Imgs = Imgs
+        self.Cats = Cats
+        self.Sents = Sents
+        self.imgToRefs = imgToRefs
+        self.imgToAnns = imgToAnns
+        self.refToAnn = refToAnn
+        self.annToRef = annToRef
+        self.catToRefs = catToRefs
+        self.sentToRef = sentToRef
+        self.sentToTokens = sentToTokens
+        self.availableSplits = availableSplits
+        print('index created.')
+    def getRefIds(self, image_ids=[], cat_ids=[], split=[]):
+        image_ids = self._toList(image_ids)
+        cat_ids = self._toList(cat_ids)
+        split = self._toList(split)
+        for s in split:
+            if s not in self.availableSplits:
+                raise ValueError(f'Invalid split name: {s}')
+        refs = self.data['refs']
+        if len(image_ids) > 0:
+            lists = [self.imgToRefs[image_id] for image_id in image_ids]
+            refs = list(itertools.chain.from_iterable(lists))
+        if len(cat_ids) > 0:
+            refs = [ref for ref in refs if self.match_any(ref['category_id'], cat_ids)]
+        if len(split) > 0:
+            refs = [ref for ref in refs if ref['split'] in split]
+        ref_ids = [ref['ref_id'] for ref in refs]
+        return ref_ids
+    def getAnnIds(self, image_ids=[], ref_ids=[]):
+        image_ids = self._toList(image_ids)
+        ref_ids = self._toList(ref_ids)
+        if any([len(image_ids), len(ref_ids)]):
+            if len(image_ids) > 0:
+                lists = [self.imgToAnns[image_id] for image_id in image_ids if image_id in self.imgToAnns]
+                anns = list(itertools.chain.from_iterable(lists))
+            else:
+                anns = self.data['annotations']
+            ann_ids = [ann['id'] for ann in anns]
+            if len(ref_ids) > 0:
+                lists = [self.Refs[ref_id]['ann_id'] for ref_id in ref_ids]
+                anns_by_ref_id = list(itertools.chain.from_iterable(lists))
+                ann_ids = list(set(ann_ids).intersection(set(anns_by_ref_id)))
+        else:
+            ann_ids = [ann['id'] for ann in self.data['annotations']]
+        return ann_ids
+    def getImgIds(self, ref_ids=[]):
+        ref_ids = self._toList(ref_ids)
+        if len(ref_ids) > 0:
+            image_ids = list(set([self.Refs[ref_id]['image_id'] for ref_id in ref_ids]))
+        else:
+            image_ids = self.Imgs.keys()
+        return image_ids
+    def getCatIds(self):
+        return self.Cats.keys()
+    def loadRefs(self, ref_ids=[]):
+        return [self.Refs[ref_id] for ref_id in self._toList(ref_ids)]
+    def loadAnns(self, ann_ids=[]):
+        if isinstance(ann_ids, str):
+            ann_ids = int(ann_ids)
+        return [self.Anns[ann_id] for ann_id in self._toList(ann_ids)]
+    def loadImgs(self, image_ids=[]):
+        return [self.Imgs[image_id] for image_id in self._toList(image_ids)]
+    def loadCats(self, cat_ids=[]):
+        return [self.Cats[cat_id] for cat_id in self._toList(cat_ids)]
+    def getRefBox(self, ref_id):
+        anns = self.refToAnn[ref_id]
+        return [ann['bbox'] for ann in anns]  # [x, y, w, h]
+    def showRef(self, ref, seg_box='seg'):
+        ax = plt.gca()
+        # show image
+        image = self.Imgs[ref['image_id']]
+        I = io.imread(osp.join(self.IMAGE_DIR, image['file_name']))
+        ax.imshow(I)
+        # show refer expression
+        for sid, sent in enumerate(ref['sentences']):
+            print('%s. %s' % (sid+1, sent['sent']))
+        # show segmentations
+        if seg_box == 'seg':
+            ann_id = ref['ann_id']
+            ann = self.Anns[ann_id]
+            polygons = []
+            color = []
+            c = 'none'
+            if type(ann['segmentation'][0]) == list:
+                # polygon used for refcoco*
+                for seg in ann['segmentation']:
+                    poly = np.array(seg).reshape((len(seg)/2, 2))
+                    polygons.append(Polygon(poly, True, alpha=0.4))
+                    color.append(c)
+                p = PatchCollection(polygons, facecolors=color, edgecolors=(1,1,0,0), linewidths=3, alpha=1)
+                ax.add_collection(p)  # thick yellow polygon
+                p = PatchCollection(polygons, facecolors=color, edgecolors=(1,0,0,0), linewidths=1, alpha=1)
+                ax.add_collection(p)  # thin red polygon
+            else:
+                # mask used for refclef
+                rle = ann['segmentation']
+                m = mask.decode(rle)
+                img = np.ones( (m.shape[0], m.shape[1], 3) )
+                color_mask = np.array([2.0,166.0,101.0])/255
+                for i in range(3):
+                    img[:,:,i] = color_mask[i]
+                ax.imshow(np.dstack( (img, m*0.5) ))
+        # show bounding-box
+        elif seg_box == 'box':
+            ann_id = ref['ann_id']
+            ann = self.Anns[ann_id]
+            bbox = 	self.getRefBox(ref['ref_id'])
+            box_plot = Rectangle((bbox[0], bbox[1]), bbox[2], bbox[3], fill=False, edgecolor='green', linewidth=3)
+            ax.add_patch(box_plot)
+    def getMask(self, ann):
+        if not ann:
+            return None
+        if ann['iscrowd']:
+            raise ValueError('Crowd object')
+        image = self.Imgs[ann['image_id']]
+        if type(ann['segmentation'][0]) == list: # polygon
+            rle = mask.frPyObjects(ann['segmentation'], image['height'], image['width'])
+        else:
+            rle = ann['segmentation']
+        m = mask.decode(rle)
+        m = np.sum(m, axis=2)  # sometimes there are multiple binary map (corresponding to multiple segs)
+        m = m.astype(np.uint8) # convert to np.uint8
+        # compute area
+        area = sum(mask.area(rle))  # should be close to ann['area']
+        return {'mask': m, 'area': area}
+    def getMaskByRef(self, ref=None, ref_id=None, merge=False):
+        if not ref and not ref_id:
+            raise ValueError
+        if ref:
+            ann_ids = ref['ann_id']
+            ref_id = ref['ref_id']
+        else:
+            ann_ids = self.getAnnIds(ref_ids=ref_id)
+        if ann_ids == [-1]:
+            img = self.Imgs[self.Refs[ref_id]['image_id']]
+            return {
+                'mask': np.zeros([img['height'], img['width']], dtype=np.uint8),
+                'empty': True
+                }
+        anns = self.loadAnns(ann_ids)
+        mask_list = [self.getMask(ann) for ann in anns if not ann['iscrowd']]
+        if merge:
+            merged_masks = sum([mask['mask'] for mask in mask_list])
+            merged_masks[np.where(merged_masks>1)] = 1
+            return {
+                'mask': merged_masks,
+                'empty': False
+                }
+        else:
+            return mask_list
+    def showMask(self, ref):
+        M = self.getMask(ref)
+        msk = M['mask']
+        ax = plt.gca()
+        ax.imshow(msk)

dataset/seg/grefcoco_segmentation.py ADDED Viewed

	@@ -0,0 +1,149 @@

+# --------------------------------------------------------
+# InstructDiffusion
+# Based on instruct-pix2pix (https://github.com/timothybrooks/instruct-pix2pix)
+# Modified by Binxin Yang (tennyson@mail.ustc.edu.cn)
+# --------------------------------------------------------
+from __future__ import annotations
+import os
+import random
+import copy
+import json
+import math
+from pathlib import Path
+from typing import Any
+import numpy as np
+import torch
+import torchvision
+from einops import rearrange
+from PIL import Image
+from torch.utils.data import Dataset
+from dataset.seg.grefcoco import G_REFER
+class GrefCOCODataset(Dataset):
+    def __init__(
+        self,
+        path: str,
+        split: str = "train",
+        min_resize_res: int = 256,
+        max_resize_res: int = 256,
+        crop_res: int = 256,
+        flip_prob: float = 0.0,
+        transparency: float = 0.0,
+        test: bool = False,
+    ):
+        assert split in ("train", "val", "test")
+        self.path = path
+        self.min_resize_res = min_resize_res
+        self.max_resize_res = max_resize_res
+        self.crop_res = crop_res
+        self.flip_prob = flip_prob
+        self.G_ref_dataset=G_REFER(data_root=path)
+        self.IMAGE_DIR = os.path.join(path, 'images/train2014')
+        self.list_ref=self.G_ref_dataset.getRefIds(split=split)
+        self.transparency = transparency
+        self.test = test
+        seg_diverse_prompt_path = 'dataset/prompt/prompt_seg.txt'
+        self.seg_diverse_prompt_list=[]
+        with open(seg_diverse_prompt_path) as f:
+            line=f.readline()
+            while line:
+                line=line.strip('\n')
+                self.seg_diverse_prompt_list.append(line)
+                line=f.readline()
+        color_list_file_path='dataset/prompt/color_list_train_small.txt'
+        self.color_list=[]
+        with open(color_list_file_path) as f:
+            line = f.readline()
+            while line:
+                line_split = line.strip('\n').split(" ")
+                if len(line_split)>1:
+                    temp = []
+                    for i in range(4):
+                        temp.append(line_split[i])
+                    self.color_list.append(temp)
+                line = f.readline()
+    def __len__(self) -> int:
+        return len(self.list_ref)
+    def _augmentation_new(self, image, label):
+        # Cropping
+        h, w = label.shape
+        if h > w:
+            start_h = random.randint(0, h - w)
+            end_h = start_h + w
+            image = image[start_h:end_h]
+            label = label[start_h:end_h]
+        elif h < w:
+            start_w = random.randint(0, w - h)
+            end_w = start_w + h
+            image = image[:, start_w:end_w]
+            label = label[:, start_w:end_w]
+        else:
+            pass
+        image = Image.fromarray(image).resize((self.min_resize_res, self.min_resize_res), resample=Image.Resampling.LANCZOS)
+        image = np.asarray(image, dtype=np.uint8)
+        label = Image.fromarray(label).resize((self.min_resize_res, self.min_resize_res), resample=Image.Resampling.NEAREST)
+        label = np.asarray(label, dtype=np.int64)
+        return image, label
+    def __getitem__(self, i: int) -> dict[str, Any]:
+        ref_ids = self.list_ref[i]
+        ref = self.G_ref_dataset.loadRefs(ref_ids)[0]
+        sentences = random.choice(ref['sentences'])['sent']
+        prompt = random.choice(self.seg_diverse_prompt_list)
+        color = random.choice(self.color_list)
+        color_name = color[0]
+        prompt = prompt.format(color=color_name.lower(), object=sentences.lower())
+        R, G, B = color[3].split(",")
+        R = int(R)
+        G = int(G)
+        B = int(B)
+        image_name = self.G_ref_dataset.loadImgs(ref['image_id'])[0]['file_name']
+        image_path = os.path.join(self.IMAGE_DIR,image_name)
+        mask = self.G_ref_dataset.getMaskByRef(ref=ref,merge=True)['mask']
+        image = Image.open(image_path).convert("RGB")
+        image = np.asarray(image)
+        image, mask = self._augmentation_new(image,mask)
+        mask = (mask == 1)
+        image_0 = Image.fromarray(image)
+        image_1 = copy.deepcopy(image)
+        image_1[:,:,0][mask]=self.transparency*image_1[:,:,0][mask]+(1-self.transparency)*R
+        image_1[:,:,1][mask]=self.transparency*image_1[:,:,1][mask]+(1-self.transparency)*G
+        image_1[:,:,2][mask]=self.transparency*image_1[:,:,2][mask]+(1-self.transparency)*B
+        image_1 = Image.fromarray(image_1)
+        reize_res = torch.randint(self.min_resize_res, self.max_resize_res + 1, ()).item()
+        image_0 = image_0.resize((reize_res, reize_res), Image.Resampling.LANCZOS)
+        image_1 = image_1.resize((reize_res, reize_res), Image.Resampling.LANCZOS)
+        image_0 = rearrange(2 * torch.tensor(np.array(image_0)).float() / 255 - 1, "h w c -> c h w")
+        image_1 = rearrange(2 * torch.tensor(np.array(image_1)).float() / 255 - 1, "h w c -> c h w")
+        crop = torchvision.transforms.RandomCrop(self.crop_res)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((image_0, image_1)))).chunk(2)
+        mask = torch.tensor(mask).float()
+        crop = torchvision.transforms.RandomCrop(self.crop_res)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((image_0, image_1)))).chunk(2)
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt))

dataset/seg/refcoco.py ADDED Viewed

	@@ -0,0 +1,354 @@

+__author__ = 'licheng'
+"""
+This interface provides access to four datasets:
+1) refclef
+2) refcoco
+3) refcoco+
+4) refcocog
+split by unc and google
+The following API functions are defined:
+REFER      - REFER api class
+getRefIds  - get ref ids that satisfy given filter conditions.
+getAnnIds  - get ann ids that satisfy given filter conditions.
+getImgIds  - get image ids that satisfy given filter conditions.
+getCatIds  - get category ids that satisfy given filter conditions.
+loadRefs   - load refs with the specified ref ids.
+loadAnns   - load anns with the specified ann ids.
+loadImgs   - load images with the specified image ids.
+loadCats   - load category names with the specified category ids.
+getRefBox  - get ref's bounding box [x, y, w, h] given the ref_id
+showRef    - show image, segmentation or box of the referred object with the ref
+getMask    - get mask and area of the referred object given ref
+showMask   - show mask of the referred object given ref
+"""
+import sys
+sys.path.append("./dataset")
+import os.path as osp
+import json
+import pickle
+import time
+import itertools
+import skimage.io as io
+import matplotlib.pyplot as plt
+from matplotlib.collections import PatchCollection
+from matplotlib.patches import Polygon, Rectangle
+from pprint import pprint
+import numpy as np
+from pycocotools import mask
+# import cv2
+# from skimage.measure import label, regionprops
+class REFER:
+	def __init__(self, data_root, dataset='refcoco', splitBy='unc'):
+		# provide data_root folder which contains refclef, refcoco, refcoco+ and refcocog
+		# also provide dataset name and splitBy information
+		# e.g., dataset = 'refcoco', splitBy = 'unc'
+		print('loading dataset %s into memory...' % dataset)
+		self.ROOT_DIR = osp.abspath(osp.dirname(__file__))
+		self.DATA_DIR = osp.join(data_root, dataset)
+		if dataset in ['refcoco', 'refcoco+', 'refcocog']:
+			self.IMAGE_DIR = osp.join(data_root, 'images/mscoco/images/train2014')
+		elif dataset == 'refclef':
+			self.IMAGE_DIR = osp.join(data_root, 'images/saiapr_tc-12')
+		else:
+			print('No refer dataset is called [%s]' % dataset)
+			sys.exit()
+		# load refs from data/dataset/refs(dataset).json
+		tic = time.time()
+		ref_file = osp.join(self.DATA_DIR, 'refs('+splitBy+').p')
+		self.data = {}
+		self.data['dataset'] = dataset
+		self.data['refs'] = pickle.load(open(ref_file, 'rb'),fix_imports=True)
+		# load annotations from data/dataset/instances.json
+		instances_file = osp.join(self.DATA_DIR, 'instances.json')
+		instances = json.load(open(instances_file, 'r'))
+		self.data['images'] = instances['images']
+		self.data['annotations'] = instances['annotations']
+		self.data['categories'] = instances['categories']
+		# create index
+		self.createIndex()
+		print('DONE (t=%.2fs)' % (time.time()-tic))
+	def createIndex(self):
+		# create sets of mapping
+		# 1)  Refs: 	 	{ref_id: ref}
+		# 2)  Anns: 	 	{ann_id: ann}
+		# 3)  Imgs:		 	{image_id: image}
+		# 4)  Cats: 	 	{category_id: category_name}
+		# 5)  Sents:     	{sent_id: sent}
+		# 6)  imgToRefs: 	{image_id: refs}
+		# 7)  imgToAnns: 	{image_id: anns}
+		# 8)  refToAnn:  	{ref_id: ann}
+		# 9)  annToRef:  	{ann_id: ref}
+		# 10) catToRefs: 	{category_id: refs}
+		# 11) sentToRef: 	{sent_id: ref}
+		# 12) sentToTokens: {sent_id: tokens}
+		print('creating index...')
+		# fetch info from instances
+		Anns, Imgs, Cats, imgToAnns = {}, {}, {}, {}
+		for ann in self.data['annotations']:
+			Anns[ann['id']] = ann
+			imgToAnns[ann['image_id']] = imgToAnns.get(ann['image_id'], []) + [ann]
+		for img in self.data['images']:
+			Imgs[img['id']] = img
+		for cat in self.data['categories']:
+			Cats[cat['id']] = cat['name']
+		# fetch info from refs
+		Refs, imgToRefs, refToAnn, annToRef, catToRefs = {}, {}, {}, {}, {}
+		Sents, sentToRef, sentToTokens = {}, {}, {}
+		for ref in self.data['refs']:
+			# ids
+			ref_id = ref['ref_id']
+			ann_id = ref['ann_id']
+			category_id = ref['category_id']
+			image_id = ref['image_id']
+			# add mapping related to ref
+			Refs[ref_id] = ref
+			imgToRefs[image_id] = imgToRefs.get(image_id, []) + [ref]
+			catToRefs[category_id] = catToRefs.get(category_id, []) + [ref]
+			refToAnn[ref_id] = Anns[ann_id]
+			annToRef[ann_id] = ref
+			# add mapping of sent
+			for sent in ref['sentences']:
+				Sents[sent['sent_id']] = sent
+				sentToRef[sent['sent_id']] = ref
+				sentToTokens[sent['sent_id']] = sent['tokens']
+		# create class members
+		self.Refs = Refs
+		self.Anns = Anns
+		self.Imgs = Imgs
+		self.Cats = Cats
+		self.Sents = Sents
+		self.imgToRefs = imgToRefs
+		self.imgToAnns = imgToAnns
+		self.refToAnn = refToAnn
+		self.annToRef = annToRef
+		self.catToRefs = catToRefs
+		self.sentToRef = sentToRef
+		self.sentToTokens = sentToTokens
+		print('index created.')
+	def getRefIds(self, image_ids=[], cat_ids=[], ref_ids=[], split=''):
+		image_ids = image_ids if type(image_ids) == list else [image_ids]
+		cat_ids = cat_ids if type(cat_ids) == list else [cat_ids]
+		ref_ids = ref_ids if type(ref_ids) == list else [ref_ids]
+		if len(image_ids)==len(cat_ids)==len(ref_ids)==len(split)==0:
+			refs = self.data['refs']
+		else:
+			if not len(image_ids) == 0:
+				refs = [self.imgToRefs[image_id] for image_id in image_ids]
+			else:
+				refs = self.data['refs']
+			if not len(cat_ids) == 0:
+				refs = [ref for ref in refs if ref['category_id'] in cat_ids]
+			if not len(ref_ids) == 0:
+				refs = [ref for ref in refs if ref['ref_id'] in ref_ids]
+			if not len(split) == 0:
+				if split in ['testA', 'testB', 'testC']:
+					refs = [ref for ref in refs if split[-1] in ref['split']] # we also consider testAB, testBC, ...
+				elif split in ['testAB', 'testBC', 'testAC']:
+					refs = [ref for ref in refs if ref['split'] == split]  # rarely used I guess...
+				elif split == 'test':
+					refs = [ref for ref in refs if 'test' in ref['split']]
+				elif split == 'train' or split == 'val':
+					refs = [ref for ref in refs if ref['split'] == split]
+				else:
+					print('No such split [%s]' % split)
+					sys.exit()
+		ref_ids = [ref['ref_id'] for ref in refs]
+		return ref_ids
+	def getAnnIds(self, image_ids=[], cat_ids=[], ref_ids=[]):
+		image_ids = image_ids if type(image_ids) == list else [image_ids]
+		cat_ids = cat_ids if type(cat_ids) == list else [cat_ids]
+		ref_ids = ref_ids if type(ref_ids) == list else [ref_ids]
+		if len(image_ids) == len(cat_ids) == len(ref_ids) == 0:
+			ann_ids = [ann['id'] for ann in self.data['annotations']]
+		else:
+			if not len(image_ids) == 0:
+				lists = [self.imgToAnns[image_id] for image_id in image_ids if image_id in self.imgToAnns]  # list of [anns]
+				anns = list(itertools.chain.from_iterable(lists))
+			else:
+				anns = self.data['annotations']
+			if not len(cat_ids) == 0:
+				anns = [ann for ann in anns if ann['category_id'] in cat_ids]
+			ann_ids = [ann['id'] for ann in anns]
+			if not len(ref_ids) == 0:
+				ids = set(ann_ids).intersection(set([self.Refs[ref_id]['ann_id'] for ref_id in ref_ids]))
+		return ann_ids
+	def getImgIds(self, ref_ids=[]):
+		ref_ids = ref_ids if type(ref_ids) == list else [ref_ids]
+		if not len(ref_ids) == 0:
+			image_ids = list(set([self.Refs[ref_id]['image_id'] for ref_id in ref_ids]))
+		else:
+			image_ids = self.Imgs.keys()
+		return image_ids
+	def getCatIds(self):
+		return self.Cats.keys()
+	def loadRefs(self, ref_ids=[]):
+		if type(ref_ids) == list:
+			return [self.Refs[ref_id] for ref_id in ref_ids]
+		elif type(ref_ids) == int:
+			return [self.Refs[ref_ids]]
+	def loadAnns(self, ann_ids=[]):
+		if type(ann_ids) == list:
+			return [self.Anns[ann_id] for ann_id in ann_ids]
+		elif type(ann_ids) == int or type(ann_ids) == unicode:
+			return [self.Anns[ann_ids]]
+	def loadImgs(self, image_ids=[]):
+		if type(image_ids) == list:
+			return [self.Imgs[image_id] for image_id in image_ids]
+		elif type(image_ids) == int:
+			return [self.Imgs[image_ids]]
+	def loadCats(self, cat_ids=[]):
+		if type(cat_ids) == list:
+			return [self.Cats[cat_id] for cat_id in cat_ids]
+		elif type(cat_ids) == int:
+			return [self.Cats[cat_ids]]
+	def getRefBox(self, ref_id):
+		ref = self.Refs[ref_id]
+		ann = self.refToAnn[ref_id]
+		return ann['bbox']  # [x, y, w, h]
+	def showRef(self, ref, seg_box='seg'):
+		ax = plt.gca()
+		# show image
+		image = self.Imgs[ref['image_id']]
+		I = io.imread(osp.join(self.IMAGE_DIR, image['file_name']))
+		ax.imshow(I)
+		# show refer expression
+		for sid, sent in enumerate(ref['sentences']):
+			print('%s. %s' % (sid+1, sent['sent']))
+		# show segmentations
+		if seg_box == 'seg':
+			ann_id = ref['ann_id']
+			ann = self.Anns[ann_id]
+			polygons = []
+			color = []
+			c = 'none'
+			if type(ann['segmentation'][0]) == list:
+				# polygon used for refcoco*
+				for seg in ann['segmentation']:
+					poly = np.array(seg).reshape((len(seg)/2, 2))
+					polygons.append(Polygon(poly, True, alpha=0.4))
+					color.append(c)
+				p = PatchCollection(polygons, facecolors=color, edgecolors=(1,1,0,0), linewidths=3, alpha=1)
+				ax.add_collection(p)  # thick yellow polygon
+				p = PatchCollection(polygons, facecolors=color, edgecolors=(1,0,0,0), linewidths=1, alpha=1)
+				ax.add_collection(p)  # thin red polygon
+			else:
+				# mask used for refclef
+				rle = ann['segmentation']
+				m = mask.decode(rle)
+				img = np.ones( (m.shape[0], m.shape[1], 3) )
+				color_mask = np.array([2.0,166.0,101.0])/255
+				for i in range(3):
+					img[:,:,i] = color_mask[i]
+				ax.imshow(np.dstack( (img, m*0.5) ))
+		# show bounding-box
+		elif seg_box == 'box':
+			ann_id = ref['ann_id']
+			ann = self.Anns[ann_id]
+			bbox = 	self.getRefBox(ref['ref_id'])
+			box_plot = Rectangle((bbox[0], bbox[1]), bbox[2], bbox[3], fill=False, edgecolor='green', linewidth=3)
+			ax.add_patch(box_plot)
+	def getMask(self, ref):
+		# return mask, area and mask-center
+		ann = self.refToAnn[ref['ref_id']]
+		image = self.Imgs[ref['image_id']]
+		if type(ann['segmentation'][0]) == list: # polygon
+			rle = mask.frPyObjects(ann['segmentation'], image['height'], image['width'])
+		else:
+			rle = ann['segmentation']
+		m = mask.decode(rle)
+		m = np.sum(m, axis=2)  # sometimes there are multiple binary map (corresponding to multiple segs)
+		m = m.astype(np.uint8) # convert to np.uint8
+		# compute area
+		area = sum(mask.area(rle))  # should be close to ann['area']
+		return {'mask': m, 'area': area}
+		# # position
+		# position_x = np.mean(np.where(m==1)[1]) # [1] means columns (matlab style) -> x (c style)
+		# position_y = np.mean(np.where(m==1)[0]) # [0] means rows (matlab style)    -> y (c style)
+		# # mass position (if there were multiple regions, we use the largest one.)
+		# label_m = label(m, connectivity=m.ndim)
+		# regions = regionprops(label_m)
+		# if len(regions) > 0:
+		# 	largest_id = np.argmax(np.array([props.filled_area for props in regions]))
+		# 	largest_props = regions[largest_id]
+		# 	mass_y, mass_x = largest_props.centroid
+		# else:
+		# 	mass_x, mass_y = position_x, position_y
+		# # if centroid is not in mask, we find the closest point to it from mask
+		# if m[mass_y, mass_x] != 1:
+		# 	print 'Finding closes mask point ...'
+		# 	kernel = np.ones((10, 10),np.uint8)
+		# 	me = cv2.erode(m, kernel, iterations = 1)
+		# 	points = zip(np.where(me == 1)[0].tolist(), np.where(me == 1)[1].tolist())  # row, col style
+		# 	points = np.array(points)
+		# 	dist   = np.sum((points - (mass_y, mass_x))**2, axis=1)
+		# 	id     = np.argsort(dist)[0]
+		# 	mass_y, mass_x = points[id]
+		# 	# return
+		# return {'mask': m, 'area': area, 'position_x': position_x, 'position_y': position_y, 'mass_x': mass_x, 'mass_y': mass_y}
+		# # show image and mask
+		# I = io.imread(osp.join(self.IMAGE_DIR, image['file_name']))
+		# plt.figure()
+		# plt.imshow(I)
+		# ax = plt.gca()
+		# img = np.ones( (m.shape[0], m.shape[1], 3) )
+		# color_mask = np.array([2.0,166.0,101.0])/255
+		# for i in range(3):
+		#     img[:,:,i] = color_mask[i]
+		# ax.imshow(np.dstack( (img, m*0.5) ))
+		# plt.show()
+	def showMask(self, ref):
+		M = self.getMask(ref)
+		msk = M['mask']
+		ax = plt.gca()
+		ax.imshow(msk)
+if __name__ == '__main__':
+	refer = REFER(dataset='refcocog', splitBy='google')
+	ref_ids = refer.getRefIds()
+	print(len(ref_ids))
+	print(len(refer.Imgs))
+	print(len(refer.imgToRefs))
+	ref_ids = refer.getRefIds(split='train')
+	print('There are %s training referred objects.' % len(ref_ids))
+	for ref_id in ref_ids:
+		ref = refer.loadRefs(ref_id)[0]
+		if len(ref['sentences']) < 2:
+			continue
+		pprint(ref)
+		print('The label is %s.' % refer.Cats[ref['category_id']])
+		plt.figure()
+		refer.showRef(ref, seg_box='box')
+		plt.show()

dataset/seg/refcoco_segmentation.py ADDED Viewed

	@@ -0,0 +1,149 @@

+# --------------------------------------------------------
+# InstructDiffusion
+# Based on instruct-pix2pix (https://github.com/timothybrooks/instruct-pix2pix)
+# Modified by Binxin Yang (tennyson@mail.ustc.edu.cn)
+# --------------------------------------------------------
+from __future__ import annotations
+import os
+import random
+import copy
+import json
+import math
+from pathlib import Path
+from typing import Any
+import numpy as np
+import torch
+import torchvision
+from einops import rearrange
+from PIL import Image
+from torch.utils.data import Dataset
+from dataset.seg.refcoco import REFER
+class RefCOCODataset(Dataset):
+    def __init__(
+        self,
+        path: str,
+        split: str = "train",
+        min_resize_res: int = 256,
+        max_resize_res: int = 256,
+        crop_res: int = 256,
+        flip_prob: float = 0.0,
+        transparency: float = 0.0,
+        test: bool = False,
+    ):
+        assert split in ("train", "val", "test")
+        self.path = path
+        self.min_resize_res = min_resize_res
+        self.max_resize_res = max_resize_res
+        self.crop_res = crop_res
+        self.flip_prob = flip_prob
+        self.G_ref_dataset=REFER(data_root=path)
+        self.IMAGE_DIR = os.path.join(path, 'images/train2014')
+        self.list_ref=self.G_ref_dataset.getRefIds(split=split)
+        self.transparency = transparency
+        self.test = test
+        seg_diverse_prompt_path = 'dataset/prompt/prompt_seg.txt'
+        self.seg_diverse_prompt_list=[]
+        with open(seg_diverse_prompt_path) as f:
+            line=f.readline()
+            while line:
+                line=line.strip('\n')
+                self.seg_diverse_prompt_list.append(line)
+                line=f.readline()
+        color_list_file_path='dataset/prompt/color_list_train_small.txt'
+        self.color_list=[]
+        with open(color_list_file_path) as f:
+            line = f.readline()
+            while line:
+                line_split = line.strip('\n').split(" ")
+                if len(line_split)>1:
+                    temp = []
+                    for i in range(4):
+                        temp.append(line_split[i])
+                    self.color_list.append(temp)
+                line = f.readline()
+    def __len__(self) -> int:
+        return len(self.list_ref)
+    def _augmentation_new(self, image, label):
+        # Cropping
+        h, w = label.shape
+        if h > w:
+            start_h = random.randint(0, h - w)
+            end_h = start_h + w
+            image = image[start_h:end_h]
+            label = label[start_h:end_h]
+        elif h < w:
+            start_w = random.randint(0, w - h)
+            end_w = start_w + h
+            image = image[:, start_w:end_w]
+            label = label[:, start_w:end_w]
+        else:
+            pass
+        image = Image.fromarray(image).resize((self.min_resize_res, self.min_resize_res), resample=Image.Resampling.LANCZOS)
+        image = np.asarray(image, dtype=np.uint8)
+        label = Image.fromarray(label).resize((self.min_resize_res, self.min_resize_res), resample=Image.Resampling.NEAREST)
+        label = np.asarray(label, dtype=np.int64)
+        return image, label
+    def __getitem__(self, i: int) -> dict[str, Any]:
+        ref_ids = self.list_ref[i]
+        ref = self.G_ref_dataset.loadRefs(ref_ids)[0]
+        sentences = random.choice(ref['sentences'])['sent']
+        prompt = random.choice(self.seg_diverse_prompt_list)
+        color = random.choice(self.color_list)
+        color_name = color[0]
+        prompt = prompt.format(color=color_name.lower(), object=sentences.lower())
+        R, G, B = color[3].split(",")
+        R = int(R)
+        G = int(G)
+        B = int(B)
+        image_name = self.G_ref_dataset.loadImgs(ref['image_id'])[0]['file_name']
+        image_path = os.path.join(self.IMAGE_DIR,image_name)
+        mask = self.G_ref_dataset.getMask(ref=ref)['mask']
+        image = Image.open(image_path).convert("RGB")
+        image = np.asarray(image)
+        image, mask = self._augmentation_new(image,mask)
+        mask = (mask == 1)
+        image_0 = Image.fromarray(image)
+        image_1 = copy.deepcopy(image)
+        image_1[:,:,0][mask]=self.transparency*image_1[:,:,0][mask]+(1-self.transparency)*R
+        image_1[:,:,1][mask]=self.transparency*image_1[:,:,1][mask]+(1-self.transparency)*G
+        image_1[:,:,2][mask]=self.transparency*image_1[:,:,2][mask]+(1-self.transparency)*B
+        image_1 = Image.fromarray(image_1)
+        reize_res = torch.randint(self.min_resize_res, self.max_resize_res + 1, ()).item()
+        image_0 = image_0.resize((reize_res, reize_res), Image.Resampling.LANCZOS)
+        image_1 = image_1.resize((reize_res, reize_res), Image.Resampling.LANCZOS)
+        image_0 = rearrange(2 * torch.tensor(np.array(image_0)).float() / 255 - 1, "h w c -> c h w")
+        image_1 = rearrange(2 * torch.tensor(np.array(image_1)).float() / 255 - 1, "h w c -> c h w")
+        crop = torchvision.transforms.RandomCrop(self.crop_res)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((image_0, image_1)))).chunk(2)
+        mask = torch.tensor(mask).float()
+        crop = torchvision.transforms.RandomCrop(self.crop_res)
+        flip = torchvision.transforms.RandomHorizontalFlip(float(self.flip_prob))
+        image_0, image_1 = flip(crop(torch.cat((image_0, image_1)))).chunk(2)
+        return dict(edited=image_1, edit=dict(c_concat=image_0, c_crossattn=prompt))

dataset/utils/zip_manager.py ADDED Viewed

	@@ -0,0 +1,144 @@

+import zipfile
+import os.path as osp
+# import lmdb
+import logging
+from PIL import Image
+import pickle
+import io
+import glob
+import os
+from pathlib import Path
+import time
+from threading import Thread
+from PIL import ImageFile
+ImageFile.LOAD_TRUNCATED_IMAGES = True
+home = str(Path.home())
+abs_blob_path=os.path.realpath("/mnt/blob/")
+CACHE_FOLDER=os.path.join(home,"caching")
+USE_CACHE=True
+def norm(path):
+    assert "*" not in path
+    return os.path.realpath(os.path.abspath(path))
+def in_blob(file):
+    if abs_blob_path in file:
+        return True
+    else:
+        return False
+def map_name(file):
+    path=norm(file)
+    path=path.lstrip(abs_blob_path+"/")
+    path=path.replace("/","_")
+    assert len(path)<250
+    return path
+def preload(db,sync=False):
+    if sync:
+        db.initialize()
+    else:
+        p = Thread(target=db.initialize)
+        p.start()
+def get_keys_from_lmdb(db):
+    with db.begin(write=False) as txn:
+        return list(txn.cursor().iternext(values=False))
+def decode_img(byteflow):
+    try:
+        img=Image.open(io.BytesIO(byteflow)).convert("RGB")
+        img.load()
+    except:
+        img = Image.open("white.jpeg").convert("RGB")
+        img.load()
+    return img
+def decode_text(byteflow):
+    return pickle.loads(byteflow)
+decode_funcs={
+    "image": decode_img,
+    "text": decode_text
+}
+class ZipManager:
+    def __init__(self, zip_path,data_type,prefix=None) -> None:
+        self.decode_func=decode_funcs[data_type]
+        self.zip_path=zip_path
+        self._init=False
+        preload(self)
+    def deinitialze(self):
+        self.zip_fd.close()
+        del self.zip_fd
+        self._init = False
+    def initialize(self,close=True):
+        self.zip_fd = zipfile.ZipFile(self.zip_path, mode="r")
+        if not hasattr(self,"_keys"):
+            self._keys = self.zip_fd.namelist()
+        self._init = True
+        if close:
+            self.deinitialze()
+    @property
+    def keys(self):
+        while not hasattr(self,"_keys"):
+            time.sleep(0.1)
+        return self._keys
+    def get(self, name):
+        if not self._init:
+            self.initialize(close=False)
+        byteflow = self.zip_fd.read(name)
+        return self.decode_func(byteflow)
+class MultipleZipManager:
+    def __init__(self, files: list, data_type, sync=True):
+        self.files = files
+        self._is_init = False
+        self.data_type=data_type
+        if sync:
+            print("sync",files)
+            self.initialize()
+        else:
+            print("async",files)
+            preload(self)
+        print("initialize over")
+    def initialize(self):
+        self.mapping={}
+        self.managers={}
+        for file in self.files:
+            manager = ZipManager(file, self.data_type)
+            self.managers[file]=manager
+        for file,manager in self.managers.items():
+            print(file)
+            # print("loading")
+            logging.info(f"{file} loading")
+            keys=manager.keys
+            for key in keys:
+                self.mapping[key]=file
+            logging.info(f"{file} loaded, size = {len(keys)}")
+            print("loaded")
+        self._keys=list(self.mapping.keys())
+        self._is_init=True
+    @property
+    def keys(self):
+        while not self._is_init:
+            time.sleep(0.1)
+        return self._keys
+    def get(self, name):
+        data = self.managers[self.mapping[name]].get(name)
+        return data

edit_cli.py ADDED Viewed

	@@ -0,0 +1,136 @@

+# --------------------------------------------------------
+# InstructDiffusion
+# Based on instruct-pix2pix (https://github.com/timothybrooks/instruct-pix2pix)
+# Modified by Zigang Geng (zigang@mail.ustc.edu.cn)
+# --------------------------------------------------------
+from __future__ import annotations
+import os
+import math
+import random
+import sys
+from argparse import ArgumentParser
+import einops
+import k_diffusion as K
+import numpy as np
+import torch
+import torch.nn as nn
+from einops import rearrange
+from omegaconf import OmegaConf
+from PIL import Image, ImageOps
+from torch import autocast
+import requests
+sys.path.append("./stable_diffusion")
+from stable_diffusion.ldm.util import instantiate_from_config
+class CFGDenoiser(nn.Module):
+    def __init__(self, model):
+        super().__init__()
+        self.inner_model = model
+    def forward(self, z, sigma, cond, uncond, text_cfg_scale, image_cfg_scale):
+        cfg_z = einops.repeat(z, "b ... -> (repeat b) ...", repeat=3)
+        cfg_sigma = einops.repeat(sigma, "b ... -> (repeat b) ...", repeat=3)
+        cfg_cond = {
+            "c_crossattn": [torch.cat([cond["c_crossattn"][0], uncond["c_crossattn"][0], cond["c_crossattn"][0]])],
+            "c_concat": [torch.cat([cond["c_concat"][0], cond["c_concat"][0], uncond["c_concat"][0]])],
+        }
+        out_cond, out_img_cond, out_txt_cond \
+            = self.inner_model(cfg_z, cfg_sigma, cond=cfg_cond).chunk(3)
+        return 0.5 * (out_img_cond + out_txt_cond) + \
+            text_cfg_scale * (out_cond - out_img_cond) + \
+                image_cfg_scale * (out_cond - out_txt_cond)
+def load_model_from_config(config, ckpt, vae_ckpt=None, verbose=False):
+    model = instantiate_from_config(config.model)
+    print(f"Loading model from {ckpt}")
+    pl_sd = torch.load(ckpt, map_location="cpu")
+    if 'state_dict' in pl_sd:
+        pl_sd = pl_sd['state_dict']
+    m, u = model.load_state_dict(pl_sd, strict=False)
+    print(m, u)
+    return model
+def main():
+    parser = ArgumentParser()
+    parser.add_argument("--resolution", default=512, type=int)
+    parser.add_argument("--steps", default=100, type=int)
+    parser.add_argument("--config", default="configs/instruct_diffusion.yaml", type=str)
+    parser.add_argument("--ckpt", default="checkpoints/v1-5-pruned-emaonly-adaption-task.ckpt", type=str)
+    parser.add_argument("--vae-ckpt", default=None, type=str)
+    parser.add_argument("--input", required=True, type=str)
+    parser.add_argument("--outdir", default="logs", type=str)
+    parser.add_argument("--edit", required=True, type=str)
+    parser.add_argument("--cfg-text", default=5.0, type=float)
+    parser.add_argument("--cfg-image", default=1.25, type=float)
+    parser.add_argument("--seed", type=int)
+    args = parser.parse_args()
+    config = OmegaConf.load(args.config)
+    model = load_model_from_config(config, args.ckpt, args.vae_ckpt)
+    model.eval().cuda()
+    model_wrap = K.external.CompVisDenoiser(model)
+    model_wrap_cfg = CFGDenoiser(model_wrap)
+    null_token = model.get_learned_conditioning([""])
+    seed = random.randint(0, 100000) if args.seed is None else args.seed
+    if args.input.startswith("http"):
+        input_image = Image.open(requests.get(args.input, stream=True).raw).convert("RGB")
+    else:
+        input_image = Image.open(args.input).convert("RGB")
+    width, height = input_image.size
+    factor = args.resolution / max(width, height)
+    factor = math.ceil(min(width, height) * factor / 64) * 64 / min(width, height)
+    width_resize = int((width * factor) // 64) * 64
+    height_resize = int((height * factor) // 64) * 64
+    input_image = ImageOps.fit(input_image, (width_resize, height_resize), method=Image.Resampling.LANCZOS)
+    output_dir = args.outdir
+    os.makedirs(output_dir, exist_ok=True)
+    with torch.no_grad(), autocast("cuda"):
+        cond = {}
+        cond["c_crossattn"] = [model.get_learned_conditioning([args.edit])]
+        input_image = 2 * torch.tensor(np.array(input_image)).float() / 255 - 1
+        input_image = rearrange(input_image, "h w c -> 1 c h w").to(next(model.parameters()).device)
+        cond["c_concat"] = [model.encode_first_stage(input_image).mode()]
+        uncond = {}
+        uncond["c_crossattn"] = [null_token]
+        uncond["c_concat"] = [torch.zeros_like(cond["c_concat"][0])]
+        sigmas = model_wrap.get_sigmas(args.steps)
+        extra_args = {
+            "cond": cond,
+            "uncond": uncond,
+            "text_cfg_scale": args.cfg_text,
+            "image_cfg_scale": args.cfg_image,
+        }
+        torch.manual_seed(seed)
+        z = torch.randn_like(cond["c_concat"][0]) * sigmas[0]
+        z = K.sampling.sample_euler_ancestral(model_wrap_cfg, z, sigmas, extra_args=extra_args)
+        x = model.decode_first_stage(z)
+        x = torch.clamp((x + 1.0) / 2.0, min=0.0, max=1.0)
+        x = 255.0 * rearrange(x, "1 c h w -> h w c")
+        print(x.shape)
+        edited_image = Image.fromarray(x.type(torch.uint8).cpu().numpy())
+        edited_image = ImageOps.fit(edited_image, (width, height), method=Image.Resampling.LANCZOS)
+        edited_image.save(output_dir+'/output_'+args.input.split('/')[-1].split('.')[0]+'_seed'+str(seed)+'.jpg')
+if __name__ == "__main__":
+    main()

environment.yaml ADDED Viewed

	@@ -0,0 +1,40 @@

+# File modified by authors of InstructDiffusion from original (https://github.com/CompVis/stable-diffusion).
+# See more details in LICENSE.
+name: instructdiff
+channels:
+  - pytorch
+  - defaults
+dependencies:
+  - python=3.8.5
+  - pip=20.3
+  - cudatoolkit=11.3
+  - pytorch=1.11.0
+  - torchvision=0.12.0
+  - numpy=1.19.2
+  - pip:
+    - albumentations==0.4.3
+    - datasets==2.8.0
+    - diffusers
+    - opencv-python==4.1.2.30
+    - pudb==2019.2
+    - invisible-watermark
+    - imageio==2.9.0
+    - imageio-ffmpeg==0.4.2
+    - pytorch-lightning==1.4.2
+    - omegaconf==2.1.1
+    - test-tube>=0.7.5
+    - streamlit>=0.73.1
+    - einops==0.3.0
+    - torch-fidelity==0.3.0
+    - transformers==4.19.2
+    - torchmetrics==0.6.0
+    - kornia==0.6
+    - -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
+    - -e git+https://github.com/openai/CLIP.git@main#egg=clip
+    - openai
+    - gradio
+    - seaborn
+    - git+https://github.com/crowsonkb/k-diffusion.git
+    - deepspeed
+    - timm

figure/animals.png ADDED Viewed

figure/mirrorcat.jpg ADDED Viewed

figure/people.jpg ADDED Viewed

figure/watermark.png ADDED Viewed

main.py ADDED Viewed

	@@ -0,0 +1,566 @@

+# --------------------------------------------------------
+# InstructDiffusion
+# Based on instruct-pix2pix (https://github.com/timothybrooks/instruct-pix2pix)
+# Removed Pytorch-lightning and supported deepspeed by Zigang Geng (zigang@mail.ustc.edu.cn)
+# --------------------------------------------------------
+import argparse, os, sys, datetime, glob
+import numpy as np
+import time
+import json
+import pickle
+import wandb
+import deepspeed
+from packaging import version
+from omegaconf import OmegaConf
+from functools import partial
+from PIL import Image
+from timm.utils import AverageMeter
+import torch
+import torchvision
+import torch.cuda.amp as amp
+import torch.distributed as dist
+import torch.backends.cudnn as cudnn
+from torch.utils.data import DataLoader, Dataset, ConcatDataset
+sys.path.append("./stable_diffusion")
+from ldm.data.base import Txt2ImgIterableBaseDataset
+from ldm.util import instantiate_from_config
+from ldm.modules.ema import LitEma
+from utils.logger import create_logger
+from utils.utils import load_checkpoint, save_checkpoint, get_grad_norm, auto_resume_helper
+from utils.deepspeed import create_ds_config
+def wandb_log(*args, **kwargs):
+    if dist.get_rank() == 0:
+        wandb.log(*args, **kwargs)
+def get_parser(**parser_kwargs):
+    def str2bool(v):
+        if isinstance(v, bool):
+            return v
+        if v.lower() in ("yes", "true", "t", "y", "1"):
+            return True
+        elif v.lower() in ("no", "false", "f", "n", "0"):
+            return False
+        else:
+            raise argparse.ArgumentTypeError("Boolean value expected.")
+    parser = argparse.ArgumentParser(**parser_kwargs)
+    parser.add_argument(
+        "-n",
+        "--name",
+        type=str,
+        const=True,
+        default="",
+        nargs="?",
+        help="postfix for logdir",
+    )
+    parser.add_argument(
+        "-r",
+        "--resume",
+        type=str,
+        const=True,
+        default="",
+        nargs="?",
+        help="resume from logdir or checkpoint in logdir",
+    )
+    parser.add_argument(
+        "-b",
+        "--base",
+        nargs="*",
+        metavar="base_config.yaml",
+        help="paths to base configs. Loaded from left-to-right. "
+             "Parameters can be overwritten or added with command-line options of the form `--key value`.",
+        default=list(),
+    )
+    parser.add_argument(
+        "-t",
+        "--train",
+        type=str2bool,
+        const=True,
+        default=False,
+        nargs="?",
+        help="train",
+    )
+    parser.add_argument(
+        "--no-test",
+        type=str2bool,
+        const=True,
+        default=False,
+        nargs="?",
+        help="disable test",
+    )
+    parser.add_argument(
+        "-p",
+        "--project",
+        help="name of new or path to existing project"
+    )
+    parser.add_argument(
+        "-d",
+        "--debug",
+        type=str2bool,
+        nargs="?",
+        const=True,
+        default=False,
+        help="enable post-mortem debugging",
+    )
+    parser.add_argument(
+        "-s",
+        "--seed",
+        type=int,
+        default=23,
+        help="seed for seed_everything",
+    )
+    parser.add_argument(
+        "-f",
+        "--postfix",
+        type=str,
+        default="",
+        help="post-postfix for default name",
+    )
+    parser.add_argument(
+        "-l",
+        "--logdir",
+        type=str,
+        default="logs",
+        help="directory for logging dat shit",
+    )
+    parser.add_argument(
+        "--scale_lr",
+        action="store_true",
+        default=False,
+        help="scale base-lr by ngpu * batch_size * n_accumulate",
+    )
+    parser.add_argument(
+        "--amd",
+        action="store_true",
+        default=False,
+        help="amd",
+    )
+    parser.add_argument(
+        "--local_rank",
+        type=int,
+        # required=False,
+        default=int(os.environ.get('LOCAL_RANK', 0)),
+        help="local rank for DistributedDataParallel",
+    )
+    return parser
+class WrappedDataset(Dataset):
+    """Wraps an arbitrary object with __len__ and __getitem__ into a pytorch dataset"""
+    def __init__(self, dataset):
+        self.data = dataset
+    def __len__(self):
+        return len(self.data)
+    def __getitem__(self, idx):
+        return self.data[idx]
+class DataModuleFromConfig():
+    def __init__(self, batch_size, train=None, validation=None, test=None, predict=None,
+                 wrap=False, num_workers=None, shuffle_test_loader=False, use_worker_init_fn=False,
+                 shuffle_val_dataloader=False):
+        super().__init__()
+        self.batch_size = batch_size
+        self.dataset_configs = dict()
+        self.num_workers = num_workers if num_workers is not None else batch_size * 2
+        self.use_worker_init_fn = use_worker_init_fn
+        if train is not None:
+            if "target" in train:
+                self.dataset_configs["train"] = train
+                self.train_dataloader = self._train_dataloader
+            else:
+                for ds in train:
+                    ds_name = str([key for key in ds.keys()][0])
+                    self.dataset_configs[ds_name] = ds
+                self.train_dataloader = self._train_concat_dataloader
+        if validation is not None:
+            self.dataset_configs["validation"] = validation
+            self.val_dataloader = partial(self._val_dataloader, shuffle=shuffle_val_dataloader)
+        if test is not None:
+            self.dataset_configs["test"] = test
+            self.test_dataloader = partial(self._test_dataloader, shuffle=shuffle_test_loader)
+        if predict is not None:
+            self.dataset_configs["predict"] = predict
+            self.predict_dataloader = self._predict_dataloader
+        self.wrap = wrap
+    def prepare_data(self):
+        for data_cfg in self.dataset_configs.values():
+            instantiate_from_config(data_cfg)
+    def setup(self, stage=None):
+        self.datasets = dict(
+            (k, instantiate_from_config(self.dataset_configs[k]))
+            for k in self.dataset_configs)
+        if self.wrap:
+            for k in self.datasets:
+                self.datasets[k] = WrappedDataset(self.datasets[k])
+    def _train_concat_dataloader(self):
+        is_iterable_dataset = isinstance(self.datasets['ds1'], Txt2ImgIterableBaseDataset)
+        if is_iterable_dataset or self.use_worker_init_fn:
+            init_fn = worker_init_fn
+        else:
+            init_fn = None
+        concat_dataset = []
+        for ds in self.datasets.keys():
+            concat_dataset.append(self.datasets[ds])
+        concat_dataset = ConcatDataset(concat_dataset)
+        sampler_train = torch.utils.data.DistributedSampler(
+            concat_dataset, num_replicas=dist.get_world_size(), rank=dist.get_rank(), shuffle=True
+        )
+        return DataLoader(concat_dataset, batch_size=self.batch_size, sampler=sampler_train,
+                          num_workers=self.num_workers, worker_init_fn=init_fn, persistent_workers=True)
+    def _train_dataloader(self):
+        is_iterable_dataset = isinstance(self.datasets['train'], Txt2ImgIterableBaseDataset)
+        if is_iterable_dataset or self.use_worker_init_fn:
+            init_fn = worker_init_fn
+        else:
+            init_fn = None
+        sampler_train = torch.utils.data.DistributedSampler(
+            self.datasets["train"], num_replicas=dist.get_world_size(), rank=dist.get_rank(), shuffle=True
+        )
+        return DataLoader(self.datasets["train"], batch_size=self.batch_size, sampler=sampler_train,
+                          num_workers=self.num_workers, worker_init_fn=init_fn, persistent_workers=True)
+    def _val_dataloader(self, shuffle=False):
+        if isinstance(self.datasets['validation'], Txt2ImgIterableBaseDataset) or self.use_worker_init_fn:
+            init_fn = worker_init_fn
+        else:
+            init_fn = None
+        return DataLoader(self.datasets["validation"],
+                          batch_size=self.batch_size,
+                          num_workers=self.num_workers,
+                          worker_init_fn=init_fn,
+                          shuffle=shuffle, persistent_workers=True)
+    def _test_dataloader(self, shuffle=False):
+        is_iterable_dataset = isinstance(self.datasets['train'], Txt2ImgIterableBaseDataset)
+        if is_iterable_dataset or self.use_worker_init_fn:
+            init_fn = worker_init_fn
+        else:
+            init_fn = None
+        # do not shuffle dataloader for iterable dataset
+        shuffle = shuffle and (not is_iterable_dataset)
+        return DataLoader(self.datasets["test"], batch_size=self.batch_size,
+                          num_workers=self.num_workers, worker_init_fn=init_fn, shuffle=shuffle, persistent_workers=True)
+    def _predict_dataloader(self, shuffle=False):
+        if isinstance(self.datasets['predict'], Txt2ImgIterableBaseDataset) or self.use_worker_init_fn:
+            init_fn = worker_init_fn
+        else:
+            init_fn = None
+        return DataLoader(self.datasets["predict"], batch_size=self.batch_size,
+                          num_workers=self.num_workers, worker_init_fn=init_fn, persistent_workers=True)
+def train_one_epoch(config, model, model_ema, data_loader, val_data_loader, optimizer, epoch, lr_scheduler, scaler):
+    model.train()
+    optimizer.zero_grad()
+    num_steps = len(data_loader)
+    accumul_steps = config.trainer.accumulate_grad_batches
+    batch_time = AverageMeter()
+    loss_meter = AverageMeter()
+    val_loss_meter = AverageMeter()
+    norm_meter = AverageMeter()
+    loss_scale_meter = AverageMeter()
+    loss_scale_meter_min = AverageMeter()
+    start = time.time()
+    end = time.time()
+    for idx, batch in enumerate(data_loader):
+        batch_size = batch['edited'].shape[0]
+        if config.model.params.deepspeed != '':
+            loss, _ = model(batch, idx, accumul_steps)
+            model.backward(loss)
+            model.step()
+            loss_scale = optimizer.cur_scale
+            grad_norm = model.get_global_grad_norm()
+            with torch.no_grad():
+                if idx % config.trainer.accumulate_grad_batches == 0:
+                    model_ema(model)
+            loss_number = loss.item()
+        else:
+            with amp.autocast(enabled=config.model.params.fp16):
+                loss, _ = model(batch, idx, accumul_steps)
+            if config.trainer.accumulate_grad_batches > 1:
+                loss = loss / config.trainer.accumulate_grad_batches
+                scaler.scale(loss).backward()
+                # loss.backward()
+                if config.trainer.clip_grad > 0.0:
+                    scaler.unscale_(optimizer)
+                    grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), config.trainer.clip_grad)
+                else:
+                    grad_norm = get_grad_norm(model.parameters())
+                if (idx + 1) % config.trainer.accumulate_grad_batches == 0:
+                    scaler.step(optimizer)
+                    optimizer.zero_grad()
+                    scaler.update()
+                    # scaler.unscale_grads()
+                    # optimizer.step()
+                    # optimizer.zero_grad()
+                    # lr_scheduler.step_update(epoch * num_steps + idx)
+            else:
+                optimizer.zero_grad()
+                scaler.scale(loss).backward()
+                if config.trainer.clip_grad > 0.0:
+                    scaler.unscale_(optimizer)
+                    grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), config.trainer.clip_grad)
+                else:
+                    grad_norm = get_grad_norm(model.parameters())
+                scaler.step(optimizer)
+                scaler.update()
+                # lr_scheduler.step_update(epoch * num_steps + idx)
+            loss_scale = scaler.get_scale()
+            loss_number = loss.item() * config.trainer.accumulate_grad_batches
+        torch.cuda.synchronize()
+        loss_meter.update(loss_number, batch_size)
+        if grad_norm is not None:
+            norm_meter.update(grad_norm)
+        else:
+            norm_meter.update(0.0)
+        loss_scale_meter.update(loss_scale)
+        # loss_scale_meter.update(0.0)
+        batch_time.update(time.time() - end)
+        end = time.time()
+        if idx % 100 == 0:
+            lr = optimizer.param_groups[0]['lr']
+            memory_used = torch.cuda.max_memory_allocated() / (1024.0 * 1024.0)
+            etas = batch_time.avg * (num_steps - idx)
+            logger.info(
+                f'Train: [{epoch}][{idx}/{num_steps}]\t'
+                f'eta {datetime.timedelta(seconds=int(etas))} lr {lr:.6f}\t'
+                f'time {batch_time.val:.4f} ({batch_time.avg:.4f})\t'
+                f'loss {loss_meter.val:.4f} ({loss_meter.avg:.4f})\t'
+                f'grad_norm {norm_meter.val:.4f} ({norm_meter.avg:.4f})\t'
+                f'loss_scale {loss_scale_meter.val:.4f} ({loss_scale_meter.avg:.4f})\t'
+                f'mem {memory_used:.0f}MB')
+        if (epoch * num_steps + idx) % 100 == 0:
+            log_message = dict(
+                lr=optimizer.param_groups[0]['lr'],
+                time=batch_time.val,
+                epoch=epoch,
+                iter=idx,
+                loss=loss_meter.val,
+                grad_norm=norm_meter.val,
+                loss_scale=loss_scale_meter.val,
+                memory=torch.cuda.max_memory_allocated() / (1024.0 * 1024.0),
+                global_iter=epoch * num_steps + idx)
+            # log_message.update({'ref_img': wandb.Image(unnormalize(img[:8].cpu().float())), 'mask': wandb.Image(mask[:8].cpu().float().unsqueeze(1))})
+            # if x_rec is not None:
+                # log_message.update({'rec_img': wandb.Image(unnormalize(x_rec[:8].cpu().float()))})
+            wandb_log(
+                data=log_message,
+                step=epoch * num_steps + idx,
+            )
+        if idx == num_steps - 1:
+            with torch.no_grad():
+                model_ema.store(model.parameters())
+                model_ema.copy_to(model)
+                for val_idx, batch in enumerate(val_data_loader):
+                    batch_size = batch['edited'].shape[0]
+                    loss, _ = model(batch, -1, 1)
+                    loss_number = loss.item()
+                    val_loss_meter.update(loss_number, batch_size)
+                    if val_idx % 10 == 0:
+                        logger.info(
+                            f'Val: [{val_idx}/{len(val_data_loader)}]\t'
+                            f'loss {val_loss_meter.val:.4f} ({val_loss_meter.avg:.4f})\t')
+                    if val_idx == 50:
+                        break
+                model_ema.restore(model.parameters())
+    epoch_time = time.time() - start
+    logger.info(f"EPOCH {epoch} training takes {datetime.timedelta(seconds=int(epoch_time))}")
+if __name__ == "__main__":
+    now = datetime.datetime.now().strftime("%Y-%m-%dT%H-%M-%S")
+    # add cwd for convenience and to make classes in this file available when
+    # running as `python main.py`
+    # (in particular `main.DataModuleFromConfig`)
+    sys.path.append(os.getcwd())
+    parser = get_parser()
+    opt, unknown = parser.parse_known_args()
+    assert opt.name
+    cfg_fname = os.path.split(opt.base[0])[-1]
+    cfg_name = os.path.splitext(cfg_fname)[0]
+    nowname = f"{cfg_name}_{opt.name}"
+    logdir = os.path.join(opt.logdir, nowname)
+    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
+        rank = int(os.environ["RANK"])
+        world_size = int(os.environ['WORLD_SIZE'])
+        print(f"RANK and WORLD_SIZE in environ: {rank}/{world_size}")
+    else:
+        rank = -1
+        world_size = -1
+    if opt.amd:
+        os.environ["CUDA_VISIBLE_DEVICES"] = str(opt.local_rank)
+        torch.distributed.init_process_group(backend='gloo', init_method='env://', world_size=world_size, rank=rank)
+    else:
+        torch.cuda.set_device(opt.local_rank)
+        torch.distributed.init_process_group(backend='nccl', init_method='env://', world_size=world_size, rank=rank)
+    torch.distributed.barrier()
+    seed = opt.seed + dist.get_rank()
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+    cudnn.benchmark = True
+    ckptdir = os.path.join(logdir, "checkpoints")
+    cfgdir = os.path.join(logdir, "configs")
+    os.makedirs(logdir, exist_ok=True)
+    os.makedirs(ckptdir, exist_ok=True)
+    os.makedirs(cfgdir, exist_ok=True)
+    # init and save configs
+    # config: the configs in the config file
+    configs = [OmegaConf.load(cfg) for cfg in opt.base]
+    cli = OmegaConf.from_dotlist(unknown)
+    config = OmegaConf.merge(*configs, cli)
+    if config.model.params.deepspeed != '':
+        create_ds_config(opt, config, cfgdir)
+    if dist.get_rank() == 0:
+        run = wandb.init(
+            id=nowname,
+            name=nowname,
+            project='readoutpose',
+            config=OmegaConf.to_container(config, resolve=True),
+        )
+    logger = create_logger(output_dir=logdir, dist_rank=dist.get_rank(), name=f"{nowname}")
+    resume_file = auto_resume_helper(config, ckptdir)
+    if resume_file:
+        resume = True
+        logger.info(f'resume checkpoint in {resume_file}')
+    else:
+        resume = False
+        logger.info(f'no checkpoint found in {ckptdir}, ignoring auto resume')
+    # model
+    model = instantiate_from_config(config.model)
+    model_ema = LitEma(model, decay_resume=config.model.params.get('ema_resume', 0.9999))
+    # data
+    data = instantiate_from_config(config.data)
+    # NOTE according to https://pytorch-lightning.readthedocs.io/en/latest/datamodules.html
+    # calling these ourselves should not be necessary but it is.
+    # lightning still takes care of proper multiprocessing though
+    data.prepare_data()
+    data.setup()
+    data_loader_train = data.train_dataloader()
+    data_loader_val = data.val_dataloader()
+    print("#### Data #####")
+    for k in data.datasets:
+        print(f"{k}, {data.datasets[k].__class__.__name__}, {len(data.datasets[k])}")
+    # configure learning rate
+    bs, base_lr = config.data.params.batch_size, config.model.base_learning_rate
+    ngpu = dist.get_world_size()
+    if 'accumulate_grad_batches' in config.trainer:
+        accumulate_grad_batches = config.trainer.accumulate_grad_batches
+    else:
+        accumulate_grad_batches = 1
+    print(f"accumulate_grad_batches = {accumulate_grad_batches}")
+    if opt.scale_lr:
+        model.learning_rate = accumulate_grad_batches * ngpu * bs * base_lr
+        print(
+            "Setting learning rate to {:.2e} = {} (accumulate_grad_batches) * {} (num_gpus) * {} (batchsize) * {:.2e} (base_lr)".format(
+                model.learning_rate, accumulate_grad_batches, ngpu, bs, base_lr))
+    else:
+        model.learning_rate = base_lr
+        print("++++ NOT USING LR SCALING ++++")
+        print(f"Setting learning rate to {model.learning_rate:.2e}")
+    if not opt.amd:
+        model.cuda()
+    if config.model.params.fp16 and config.model.params.deepspeed == '':
+        scaler = amp.GradScaler()
+        param_groups = model.parameters()
+    else:
+        scaler = None
+        param_groups = model.parameters()
+    if config.model.params.deepspeed != '':
+        model, optimizer, _, _ = deepspeed.initialize(
+            args=config,
+            model=model,
+            model_parameters=param_groups,
+            dist_init_required=False,
+        )
+        for name, param in model.named_parameters():
+            param.global_name = name
+        model_without_ddp = model
+        lr_scheduler = None
+        model_ema = model_ema.to(next(model.parameters()).device)
+    else:
+        optimizer, lr_scheduler = model.configure_optimizers()
+        model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[opt.local_rank], broadcast_buffers=False)
+        model_without_ddp = model.module
+    # print(optimizer.param_groups[1])
+    if opt.resume != '':
+        resume_file = opt.resume
+    if resume_file:
+        _, start_epoch = load_checkpoint(resume_file, config, model_without_ddp, model_ema, optimizer, lr_scheduler, scaler, logger)
+    else:
+        start_epoch = 0
+    logger.info("Start training")
+    start_time = time.time()
+    for epoch in range(start_epoch, config.trainer.max_epochs):
+        data_loader_train.sampler.set_epoch(epoch)
+        train_one_epoch(config, model, model_ema, data_loader_train, data_loader_val, optimizer, epoch, lr_scheduler, scaler)
+        if epoch % config.trainer.save_freq == 0:
+            save_checkpoint(ckptdir, config, epoch, model_without_ddp, model_ema, 0., optimizer, lr_scheduler, scaler, logger)
+    total_time = time.time() - start_time
+    total_time_str = str(datetime.timedelta(seconds=int(total_time)))
+    logger.info('Training time {}'.format(total_time_str))

scripts/convert_ckpt.py ADDED Viewed

	@@ -0,0 +1,51 @@

+# ------------------------------------------------------------------------------
+# Copyright (c) Microsoft
+# Licensed under the MIT License.
+# Written by Zigang Geng (zigang@mail.ustc.edu.cn)
+# ------------------------------------------------------------------------------
+from __future__ import annotations
+import sys
+import torch
+from argparse import ArgumentParser
+from omegaconf import OmegaConf
+sys.path.append("./stable_diffusion")
+from stable_diffusion.ldm.util import instantiate_from_config
+if __name__ == "__main__":
+    parser = ArgumentParser()
+    parser.add_argument("--config", default="configs/instruct_diffusion.yaml", type=str)
+    parser.add_argument("--ema-ckpt", default="logs/instruct_diffusion/checkpoints/ckpt_epoch_200/state.pth", type=str)
+    parser.add_argument("--vae-ckpt", default="stable_diffusion/models/ldm/stable-diffusion-v1/v1-5-pruned-emaonly.ckpt", type=str)
+    parser.add_argument("--out-ckpt", default="checkpoints/v1-5-pruned-emaonly-adaption-task.ckpt", type=str)
+    args = parser.parse_args()
+    config = OmegaConf.load(args.config)
+    model = instantiate_from_config(config.model)
+    ema_ckpt = torch.load(args.ema_ckpt, map_location="cpu")
+    all_keys = [key for key, value in model.named_parameters()]
+    all_keys_rmv = [key.replace('.','') for key in all_keys]
+    new_ema_ckpt = {}
+    for k, v in ema_ckpt['model_ema'].items():
+        try:
+            k_index = all_keys_rmv.index(k)
+            new_ema_ckpt[all_keys[k_index]] = v
+        except:
+            print(k+' is not in the list.')
+    vae_ckpt = torch.load(args.vae_ckpt, map_location="cpu")
+    for k, v in vae_ckpt['state_dict'].items():
+        if k not in new_ema_ckpt and k in all_keys:
+            new_ema_ckpt[k] = v
+    checkpoint = {'state_dict': new_ema_ckpt}
+    with open(args.out_ckpt, 'wb') as f:
+        torch.save(checkpoint, f)
+        f.flush()
+    print('Converted successfully, the new checkpoint has been saved to ' + str(args.out_ckpt))

scripts/download_pretrained_sd.sh ADDED Viewed

	@@ -0,0 +1,7 @@

+#!/bin/bash
+SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
+mkdir -p $SCRIPT_DIR/../stable_diffusion/models/ldm/stable-diffusion-v1
+curl -L https://huggingface.co/runwayml/stable-diffusion-v1-5/resolve/main/v1-5-pruned-emaonly.ckpt -o $SCRIPT_DIR/../stable_diffusion/models/ldm/stable-diffusion-v1/v1-5-pruned-emaonly.ckpt
+curl -L https://huggingface.co/stabilityai/sd-vae-ft-mse-original/resolve/main/vae-ft-mse-840000-ema-pruned.ckpt -o $SCRIPT_DIR/../stable_diffusion/models/ldm/stable-diffusion-v1/vae-ft-mse-840000-ema-pruned.ckpt

scripts/inference_example.sh ADDED Viewed

	@@ -0,0 +1,12 @@

+# Example: Image Editing
+python edit_cli.py --input figure/animals.png --edit "Transform it to van Gogh, starry night style." --resolution 768 --steps 100 --config configs/instruct_diffusion.yaml --ckpt checkpoints/v1-5-pruned-emaonly-adaption-task.ckpt --cfg-text 5.0 --cfg-image 1.25 --outdir logs/ --seed 93151
+python edit_cli.py --input figure/animals.png --edit "Help the elephant wear a crown and maintain the appearance of others." --resolution 512 --steps 100 --config configs/instruct_diffusion.yaml --ckpt checkpoints/v1-5-pruned-emaonly-adaption-task.ckpt --cfg-text 5.0 --cfg-image 1.25 --outdir logs/ --seed 51557
+# Example: Segmentation   More prompts can be found in the dataset/prompts/prompt_seg.txt
+python edit_cli.py --input figure/mirrorcat.jpg --edit "Mark the pixels of the cat in the mirror to blue and leave the rest unchanged." --resolution 512 --steps 100 --config configs/instruct_diffusion.yaml --ckpt checkpoints/v1-5-pruned-emaonly-adaption-task.ckpt --cfg-text 7.5 --cfg-image 1.5 --outdir logs/ --seed 94746
+# Example: Keypoint Detection   More prompts can be found in the dataset/prompts/prompt_pose.txt
+python edit_cli.py --input figure/people.jpg --edit "Use yellow to encircle the left knee of the people on the far left and draw a blue circle over the nose of the tallest people." --resolution 512 --steps 100 --config configs/instruct_diffusion.yaml --ckpt checkpoints/v1-5-pruned-emaonly-adaption-task.ckpt --cfg-text 6.0 --cfg-image 0.5 --outdir logs/ --seed 27775
+# Example: Watermark Removal   More prompts can be found in the dataset/prompts/prompt_dewatermark.txt
+python edit_cli.py --input figure/watermark.png --edit "Remove watermark from this picture." --resolution 512 --steps 100 --config configs/instruct_diffusion.yaml --ckpt checkpoints/v1-5-pruned-emaonly-adaption-task.ckpt --cfg-text 1.0 --cfg-image 1.0 --outdir logs/ --seed 54763

scripts/run_multinode.sh ADDED Viewed

	@@ -0,0 +1,6 @@

+EXP=$1
+NAME=$2
+GPUMUM=$3
+set -x
+python -m torch.distributed.launch --nnodes=${GPUMUM} --nproc_per_node=8 --node_rank=$NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT main.py --name ${NAME} --base configs/${EXP}.yaml --train --logdir /mnt/data/readout_torch_output/

stable_diffusion/LICENSE ADDED Viewed

	@@ -0,0 +1,82 @@

+Copyright (c) 2022 Robin Rombach and Patrick Esser and contributors
+CreativeML Open RAIL-M
+dated August 22, 2022
+Section I: PREAMBLE
+Multimodal generative models are being widely adopted and used, and have the potential to transform the way artists, among other individuals, conceive and benefit from AI or ML technologies as a tool for content creation.
+Notwithstanding the current and potential benefits that these artifacts can bring to society at large, there are also concerns about potential misuses of them, either due to their technical limitations or ethical considerations.
+In short, this license strives for both the open and responsible downstream use of the accompanying model. When it comes to the open character, we took inspiration from open source permissive licenses regarding the grant of IP rights. Referring to the downstream responsible use, we added use-based restrictions not permitting the use of the Model in very specific scenarios, in order for the licensor to be able to enforce the license in case potential misuses of the Model may occur. At the same time, we strive to promote open and responsible research on generative models for art and content generation.
+Even though downstream derivative versions of the model could be released under different licensing terms, the latter will always have to include - at minimum - the same use-based restrictions as the ones in the original license (this license). We believe in the intersection between open and responsible AI development; thus, this License aims to strike a balance between both in order to enable responsible open-science in the field of AI.
+This License governs the use of the model (and its derivatives) and is informed by the model card associated with the model.
+NOW THEREFORE, You and Licensor agree as follows:
+1. Definitions
+- "License" means the terms and conditions for use, reproduction, and Distribution as defined in this document.
+- "Data" means a collection of information and/or content extracted from the dataset used with the Model, including to train, pretrain, or otherwise evaluate the Model. The Data is not licensed under this License.
+- "Output" means the results of operating a Model as embodied in informational content resulting therefrom.
+- "Model" means any accompanying machine-learning based assemblies (including checkpoints), consisting of learnt weights, parameters (including optimizer states), corresponding to the model architecture as embodied in the Complementary Material, that have been trained or tuned, in whole or in part on the Data, using the Complementary Material.
+- "Derivatives of the Model" means all modifications to the Model, works based on the Model, or any other model which is created or initialized by transfer of patterns of the weights, parameters, activations or output of the Model, to the other model, in order to cause the other model to perform similarly to the Model, including - but not limited to - distillation methods entailing the use of intermediate data representations or methods based on the generation of synthetic data by the Model for training the other model.
+- "Complementary Material" means the accompanying source code and scripts used to define, run, load, benchmark or evaluate the Model, and used to prepare data for training or evaluation, if any. This includes any accompanying documentation, tutorials, examples, etc, if any.
+- "Distribution" means any transmission, reproduction, publication or other sharing of the Model or Derivatives of the Model to a third party, including providing the Model as a hosted service made available by electronic or other remote means - e.g. API-based or web access.
+- "Licensor" means the copyright owner or entity authorized by the copyright owner that is granting the License, including the persons or entities that may have rights in the Model and/or distributing the Model.
+- "You" (or "Your") means an individual or Legal Entity exercising permissions granted by this License and/or making use of the Model for whichever purpose and in any field of use, including usage of the Model in an end-use application - e.g. chatbot, translator, image generator.
+- "Third Parties" means individuals or legal entities that are not under common control with Licensor or You.
+- "Contribution" means any work of authorship, including the original version of the Model and any modifications or additions to that Model or Derivatives of the Model thereof, that is intentionally submitted to Licensor for inclusion in the Model by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Model, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
+- "Contributor" means Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Model.
+Section II: INTELLECTUAL PROPERTY RIGHTS
+Both copyright and patent grants apply to the Model, Derivatives of the Model and Complementary Material. The Model and Derivatives of the Model are subject to additional terms as described in Section III.
+2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare, publicly display, publicly perform, sublicense, and distribute the Complementary Material, the Model, and Derivatives of the Model.
+3. Grant of Patent License. Subject to the terms and conditions of this License and where and as applicable, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this paragraph) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Model and the Complementary Material, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Model to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Model and/or Complementary Material or a Contribution incorporated within the Model and/or Complementary Material constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for the Model and/or Work shall terminate as of the date such litigation is asserted or filed.
+Section III: CONDITIONS OF USAGE, DISTRIBUTION AND REDISTRIBUTION
+4. Distribution and Redistribution. You may host for Third Party remote access purposes (e.g. software-as-a-service), reproduce and distribute copies of the Model or Derivatives of the Model thereof in any medium, with or without modifications, provided that You meet the following conditions:
+Use-based restrictions as referenced in paragraph 5 MUST be included as an enforceable provision by You in any type of legal agreement (e.g. a license) governing the use and/or distribution of the Model or Derivatives of the Model, and You shall give notice to subsequent users You Distribute to, that the Model or Derivatives of the Model are subject to paragraph 5. This provision does not apply to the use of Complementary Material.
+You must give any Third Party recipients of the Model or Derivatives of the Model a copy of this License;
+You must cause any modified files to carry prominent notices stating that You changed the files;
+You must retain all copyright, patent, trademark, and attribution notices excluding those notices that do not pertain to any part of the Model, Derivatives of the Model.
+You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions - respecting paragraph 4.a. - for use, reproduction, or Distribution of Your modifications, or for any such Derivatives of the Model as a whole, provided Your use, reproduction, and Distribution of the Model otherwise complies with the conditions stated in this License.
+5. Use-based restrictions. The restrictions set forth in Attachment A are considered Use-based restrictions. Therefore You cannot use the Model and the Derivatives of the Model for the specified restricted uses. You may use the Model subject to this License, including only for lawful purposes and in accordance with the License. Use may include creating any content with, finetuning, updating, running, training, evaluating and/or reparametrizing the Model. You shall require all of Your users who use the Model or a Derivative of the Model to comply with the terms of this paragraph (paragraph 5).
+6. The Output You Generate. Except as set forth herein, Licensor claims no rights in the Output You generate using the Model. You are accountable for the Output you generate and its subsequent uses. No use of the output can contravene any provision as stated in the License.
+Section IV: OTHER PROVISIONS
+7. Updates and Runtime Restrictions. To the maximum extent permitted by law, Licensor reserves the right to restrict (remotely or otherwise) usage of the Model in violation of this License, update the Model through electronic means, or modify the Output of the Model based on updates. You shall undertake reasonable efforts to use the latest version of the Model.
+8. Trademarks and related. Nothing in this License permits You to make use of Licensors’ trademarks, trade names, logos or to otherwise suggest endorsement or misrepresent the relationship between the parties; and any rights not expressly granted herein are reserved by the Licensors.
+9. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Model and the Complementary Material (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Model, Derivatives of the Model, and the Complementary Material and assume any risks associated with Your exercise of permissions under this License.
+10. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Model and the Complementary Material (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+11. Accepting Warranty or Additional Liability. While redistributing the Model, Derivatives of the Model and the Complementary Material thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
+12. If any provision of this License is held to be invalid, illegal or unenforceable, the remaining provisions shall be unaffected thereby and remain valid as if such provision had not been set forth herein.
+END OF TERMS AND CONDITIONS
+Attachment A
+Use Restrictions
+You agree not to use the Model or Derivatives of the Model:
+- In any way that violates any applicable national, federal, state, local or international law or regulation;
+- For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
+- To generate or disseminate verifiably false information and/or content with the purpose of harming others;
+- To generate or disseminate personal identifiable information that can be used to harm an individual;
+- To defame, disparage or otherwise harass others;
+- For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation;
+- For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics;
+- To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
+- For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories;
+- To provide medical advice and medical results interpretation;
+- To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal relationships between assertions made in documents, indiscriminate and arbitrarily-targeted use).

stable_diffusion/README.md ADDED Viewed

	@@ -0,0 +1,215 @@

+# Stable Diffusion
+*Stable Diffusion was made possible thanks to a collaboration with [Stability AI](https://stability.ai/) and [Runway](https://runwayml.com/) and builds upon our previous work:*
+[**High-Resolution Image Synthesis with Latent Diffusion Models**](https://ommer-lab.com/research/latent-diffusion-models/)<br/>
+[Robin Rombach](https://github.com/rromb)\*,
+[Andreas Blattmann](https://github.com/ablattmann)\*,
+[Dominik Lorenz](https://github.com/qp-qp)\,
+[Patrick Esser](https://github.com/pesser),
+[Björn Ommer](https://hci.iwr.uni-heidelberg.de/Staff/bommer)<br/>
+_[CVPR '22 Oral](https://openaccess.thecvf.com/content/CVPR2022/html/Rombach_High-Resolution_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2022_paper.html) |
+[GitHub](https://github.com/CompVis/latent-diffusion) | [arXiv](https://arxiv.org/abs/2112.10752) | [Project page](https://ommer-lab.com/research/latent-diffusion-models/)_
+![txt2img-stable2](assets/stable-samples/txt2img/merged-0006.png)
+[Stable Diffusion](#stable-diffusion-v1) is a latent text-to-image diffusion
+model.
+Thanks to a generous compute donation from [Stability AI](https://stability.ai/) and support from [LAION](https://laion.ai/), we were able to train a Latent Diffusion Model on 512x512 images from a subset of the [LAION-5B](https://laion.ai/blog/laion-5b/) database.
+Similar to Google's [Imagen](https://arxiv.org/abs/2205.11487),
+this model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts.
+With its 860M UNet and 123M text encoder, the model is relatively lightweight and runs on a GPU with at least 10GB VRAM.
+See [this section](#stable-diffusion-v1) below and the [model card](https://huggingface.co/CompVis/stable-diffusion).
+## Requirements
+A suitable [conda](https://conda.io/) environment named `ldm` can be created
+and activated with:
+```
+conda env create -f environment.yaml
+conda activate ldm
+```
+You can also update an existing [latent diffusion](https://github.com/CompVis/latent-diffusion) environment by running
+```
+conda install pytorch torchvision -c pytorch
+pip install transformers==4.19.2 diffusers invisible-watermark
+pip install -e .
+```
+## Stable Diffusion v1
+Stable Diffusion v1 refers to a specific configuration of the model
+architecture that uses a downsampling-factor 8 autoencoder with an 860M UNet
+and CLIP ViT-L/14 text encoder for the diffusion model. The model was pretrained on 256x256 images and
+then finetuned on 512x512 images.
+*Note: Stable Diffusion v1 is a general text-to-image diffusion model and therefore mirrors biases and (mis-)conceptions that are present
+in its training data.
+Details on the training procedure and data, as well as the intended use of the model can be found in the corresponding [model card](Stable_Diffusion_v1_Model_Card.md).*
+The weights are available via [the CompVis organization at Hugging Face](https://huggingface.co/CompVis) under [a license which contains specific use-based restrictions to prevent misuse and harm as informed by the model card, but otherwise remains permissive](LICENSE). While commercial use is permitted under the terms of the license, **we do not recommend using the provided weights for services or products without additional safety mechanisms and considerations**, since there are [known limitations and biases](Stable_Diffusion_v1_Model_Card.md#limitations-and-bias) of the weights, and research on safe and ethical deployment of general text-to-image models is an ongoing effort. **The weights are research artifacts and should be treated as such.**
+[The CreativeML OpenRAIL M license](LICENSE) is an [Open RAIL M license](https://www.licenses.ai/blog/2022/8/18/naming-convention-of-responsible-ai-licenses), adapted from the work that [BigScience](https://bigscience.huggingface.co/) and [the RAIL Initiative](https://www.licenses.ai/) are jointly carrying in the area of responsible AI licensing. See also [the article about the BLOOM Open RAIL license](https://bigscience.huggingface.co/blog/the-bigscience-rail-license) on which our license is based.
+### Weights
+We currently provide the following checkpoints:
+- `sd-v1-1.ckpt`: 237k steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
+  194k steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
+- `sd-v1-2.ckpt`: Resumed from `sd-v1-1.ckpt`.
+  515k steps at resolution `512x512` on [laion-aesthetics v2 5+](https://laion.ai/blog/laion-aesthetics/) (a subset of laion2B-en with estimated aesthetics score `> 5.0`, and additionally
+filtered to images with an original size `>= 512x512`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the [LAION-5B](https://laion.ai/blog/laion-5b/) metadata, the aesthetics score is estimated using the [LAION-Aesthetics Predictor V2](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
+- `sd-v1-3.ckpt`: Resumed from `sd-v1-2.ckpt`. 195k steps at resolution `512x512` on "laion-aesthetics v2 5+" and 10\% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
+- `sd-v1-4.ckpt`: Resumed from `sd-v1-2.ckpt`. 225k steps at resolution `512x512` on "laion-aesthetics v2 5+" and 10\% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
+Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
+5.0, 6.0, 7.0, 8.0) and 50 PLMS sampling
+steps show the relative improvements of the checkpoints:
+![sd evaluation results](assets/v1-variants-scores.jpg)
+### Text-to-Image with Stable Diffusion
+![txt2img-stable2](assets/stable-samples/txt2img/merged-0005.png)
+![txt2img-stable2](assets/stable-samples/txt2img/merged-0007.png)
+Stable Diffusion is a latent diffusion model conditioned on the (non-pooled) text embeddings of a CLIP ViT-L/14 text encoder.
+We provide a [reference script for sampling](#reference-sampling-script), but
+there also exists a [diffusers integration](#diffusers-integration), which we
+expect to see more active community development.
+#### Reference Sampling Script
+We provide a reference sampling script, which incorporates
+- a [Safety Checker Module](https://github.com/CompVis/stable-diffusion/pull/36),
+  to reduce the probability of explicit outputs,
+- an [invisible watermarking](https://github.com/ShieldMnt/invisible-watermark)
+  of the outputs, to help viewers [identify the images as machine-generated](scripts/tests/test_watermark.py).
+After [obtaining the `stable-diffusion-v1-*-original` weights](#weights), link them
+```
+mkdir -p models/ldm/stable-diffusion-v1/
+ln -s <path/to/model.ckpt> models/ldm/stable-diffusion-v1/model.ckpt
+```
+and sample with
+```
+python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms
+```
+By default, this uses a guidance scale of `--scale 7.5`, [Katherine Crowson's implementation](https://github.com/CompVis/latent-diffusion/pull/51) of the [PLMS](https://arxiv.org/abs/2202.09778) sampler,
+and renders images of size 512x512 (which it was trained on) in 50 steps. All supported arguments are listed below (type `python scripts/txt2img.py --help`).
+```commandline
+usage: txt2img.py [-h] [--prompt [PROMPT]] [--outdir [OUTDIR]] [--skip_grid] [--skip_save] [--ddim_steps DDIM_STEPS] [--plms] [--laion400m] [--fixed_code] [--ddim_eta DDIM_ETA]
+                  [--n_iter N_ITER] [--H H] [--W W] [--C C] [--f F] [--n_samples N_SAMPLES] [--n_rows N_ROWS] [--scale SCALE] [--from-file FROM_FILE] [--config CONFIG] [--ckpt CKPT]
+                  [--seed SEED] [--precision {full,autocast}]
+optional arguments:
+  -h, --help            show this help message and exit
+  --prompt [PROMPT]     the prompt to render
+  --outdir [OUTDIR]     dir to write results to
+  --skip_grid           do not save a grid, only individual samples. Helpful when evaluating lots of samples
+  --skip_save           do not save individual samples. For speed measurements.
+  --ddim_steps DDIM_STEPS
+                        number of ddim sampling steps
+  --plms                use plms sampling
+  --laion400m           uses the LAION400M model
+  --fixed_code          if enabled, uses the same starting code across samples
+  --ddim_eta DDIM_ETA   ddim eta (eta=0.0 corresponds to deterministic sampling
+  --n_iter N_ITER       sample this often
+  --H H                 image height, in pixel space
+  --W W                 image width, in pixel space
+  --C C                 latent channels
+  --f F                 downsampling factor
+  --n_samples N_SAMPLES
+                        how many samples to produce for each given prompt. A.k.a. batch size
+  --n_rows N_ROWS       rows in the grid (default: n_samples)
+  --scale SCALE         unconditional guidance scale: eps = eps(x, empty) + scale * (eps(x, cond) - eps(x, empty))
+  --from-file FROM_FILE
+                        if specified, load prompts from this file
+  --config CONFIG       path to config which constructs model
+  --ckpt CKPT           path to checkpoint of model
+  --seed SEED           the seed (for reproducible sampling)
+  --precision {full,autocast}
+                        evaluate at this precision
+```
+Note: The inference config for all v1 versions is designed to be used with EMA-only checkpoints.
+For this reason `use_ema=False` is set in the configuration, otherwise the code will try to switch from
+non-EMA to EMA weights. If you want to examine the effect of EMA vs no EMA, we provide "full" checkpoints
+which contain both types of weights. For these, `use_ema=False` will load and use the non-EMA weights.
+#### Diffusers Integration
+A simple way to download and sample Stable Diffusion is by using the [diffusers library](https://github.com/huggingface/diffusers/tree/main#new--stable-diffusion-is-now-fully-compatible-with-diffusers):
+```py
+# make sure you're logged in with `huggingface-cli login`
+from torch import autocast
+from diffusers import StableDiffusionPipeline
+pipe = StableDiffusionPipeline.from_pretrained(
+	"CompVis/stable-diffusion-v1-4",
+	use_auth_token=True
+).to("cuda")
+prompt = "a photo of an astronaut riding a horse on mars"
+with autocast("cuda"):
+    image = pipe(prompt)["sample"][0]
+image.save("astronaut_rides_horse.png")
+```
+### Image Modification with Stable Diffusion
+By using a diffusion-denoising mechanism as first proposed by [SDEdit](https://arxiv.org/abs/2108.01073), the model can be used for different
+tasks such as text-guided image-to-image translation and upscaling. Similar to the txt2img sampling script,
+we provide a script to perform image modification with Stable Diffusion.
+The following describes an example where a rough sketch made in [Pinta](https://www.pinta-project.com/) is converted into a detailed artwork.
+```
+python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8
+```
+Here, strength is a value between 0.0 and 1.0, that controls the amount of noise that is added to the input image.
+Values that approach 1.0 allow for lots of variations but will also produce images that are not semantically consistent with the input. See the following example.
+**Input**
+![sketch-in](assets/stable-samples/img2img/sketch-mountains-input.jpg)
+**Outputs**
+![out3](assets/stable-samples/img2img/mountains-3.png)
+![out2](assets/stable-samples/img2img/mountains-2.png)
+This procedure can, for example, also be used to upscale samples from the base model.
+## Comments
+- Our codebase for the diffusion models builds heavily on [OpenAI's ADM codebase](https://github.com/openai/guided-diffusion)
+and [https://github.com/lucidrains/denoising-diffusion-pytorch](https://github.com/lucidrains/denoising-diffusion-pytorch).
+Thanks for open-sourcing!
+- The implementation of the transformer encoder is from [x-transformers](https://github.com/lucidrains/x-transformers) by [lucidrains](https://github.com/lucidrains?tab=repositories).
+## BibTeX
+```
+@misc{rombach2021highresolution,
+      title={High-Resolution Image Synthesis with Latent Diffusion Models},
+      author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
+      year={2021},
+      eprint={2112.10752},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```

stable_diffusion/Stable_Diffusion_v1_Model_Card.md ADDED Viewed

	@@ -0,0 +1,144 @@

+# Stable Diffusion v1 Model Card
+This model card focuses on the model associated with the Stable Diffusion model, available [here](https://github.com/CompVis/stable-diffusion).
+## Model Details
+- **Developed by:** Robin Rombach, Patrick Esser
+- **Model type:** Diffusion-based text-to-image generation model
+- **Language(s):** English
+- **License:** [Proprietary](LICENSE)
+- **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses a fixed, pretrained text encoder ([CLIP ViT-L/14](https://arxiv.org/abs/2103.00020)) as suggested in the [Imagen paper](https://arxiv.org/abs/2205.11487).
+- **Resources for more information:** [GitHub Repository](https://github.com/CompVis/stable-diffusion), [Paper](https://arxiv.org/abs/2112.10752).
+- **Cite as:**
+      @InProceedings{Rombach_2022_CVPR,
+          author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
+          title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
+          booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+          month     = {June},
+          year      = {2022},
+          pages     = {10684-10695}
+      }
+# Uses
+## Direct Use
+The model is intended for research purposes only. Possible research areas and
+tasks include
+- Safe deployment of models which have the potential to generate harmful content.
+- Probing and understanding the limitations and biases of generative models.
+- Generation of artworks and use in design and other artistic processes.
+- Applications in educational or creative tools.
+- Research on generative models.
+Excluded uses are described below.
+ ### Misuse, Malicious Use, and Out-of-Scope Use
+_Note: This section is taken from the [DALLE-MINI model card](https://huggingface.co/dalle-mini/dalle-mini), but applies in the same way to Stable Diffusion v1_.
+The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
+#### Out-of-Scope Use
+The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
+#### Misuse and Malicious Use
+Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
+- Generating demeaning, dehumanizing, or otherwise harmful representations of people or their environments, cultures, religions, etc.
+- Intentionally promoting or propagating discriminatory content or harmful stereotypes.
+- Impersonating individuals without their consent.
+- Sexual content without consent of the people who might see it.
+- Mis- and disinformation
+- Representations of egregious violence and gore
+- Sharing of copyrighted or licensed material in violation of its terms of use.
+- Sharing content that is an alteration of copyrighted or licensed material in violation of its terms of use.
+## Limitations and Bias
+### Limitations
+- The model does not achieve perfect photorealism
+- The model cannot render legible text
+- The model does not perform well on more difficult tasks which involve compositionality, such as rendering an image corresponding to “A red cube on top of a blue sphere”
+- Faces and people in general may not be generated properly.
+- The model was trained mainly with English captions and will not work as well in other languages.
+- The autoencoding part of the model is lossy
+- The model was trained on a large-scale dataset
+  [LAION-5B](https://laion.ai/blog/laion-5b/) which contains adult material
+  and is not fit for product use without additional safety mechanisms and
+  considerations.
+- No additional measures were used to deduplicate the dataset. As a result, we observe some degree of memorization for images that are duplicated in the training data.
+  The training data can be searched at [https://rom1504.github.io/clip-retrieval/](https://rom1504.github.io/clip-retrieval/) to possibly assist in the detection of memorized images.
+### Bias
+While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
+Stable Diffusion v1 was primarily trained on subsets of [LAION-2B(en)](https://laion.ai/blog/laion-5b/),
+which consists of images that are limited to English descriptions.
+Texts and images from communities and cultures that use other languages are likely to be insufficiently accounted for.
+This affects the overall output of the model, as white and western cultures are often set as the default. Further, the
+ability of the model to generate content with non-English prompts is significantly worse than with English-language prompts.
+Stable Diffusion v1 mirrors and exacerbates biases to such a degree that viewer discretion must be advised irrespective of the input or its intent.
+## Training
+**Training Data**
+The model developers used the following dataset for training the model:
+- LAION-5B and subsets thereof (see next section)
+**Training Procedure**
+Stable Diffusion v1 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
+- Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
+- Text prompts are encoded through a ViT-L/14 text-encoder.
+- The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
+- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.
+We currently provide the following checkpoints:
+- `sd-v1-1.ckpt`: 237k steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
+  194k steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
+- `sd-v1-2.ckpt`: Resumed from `sd-v1-1.ckpt`.
+  515k steps at resolution `512x512` on [laion-aesthetics v2 5+](https://laion.ai/blog/laion-aesthetics/) (a subset of laion2B-en with estimated aesthetics score `> 5.0`, and additionally
+filtered to images with an original size `>= 512x512`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the [LAION-5B](https://laion.ai/blog/laion-5b/) metadata, the aesthetics score is estimated using the [LAION-Aesthetics Predictor V2](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
+- `sd-v1-3.ckpt`: Resumed from `sd-v1-2.ckpt`. 195k steps at resolution `512x512` on "laion-aesthetics v2 5+" and 10\% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
+- `sd-v1-4.ckpt`: Resumed from `sd-v1-2.ckpt`. 225k steps at resolution `512x512` on "laion-aesthetics v2 5+" and 10\% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
+- **Hardware:** 32 x 8 x A100 GPUs
+- **Optimizer:** AdamW
+- **Gradient Accumulations**: 2
+- **Batch:** 32 x 8 x 2 x 4 = 2048
+- **Learning rate:** warmup to 0.0001 for 10,000 steps and then kept constant
+## Evaluation Results
+Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
+5.0, 6.0, 7.0, 8.0) and 50 PLMS sampling
+steps show the relative improvements of the checkpoints:
+![pareto](assets/v1-variants-scores.jpg)
+Evaluated using 50 PLMS steps and 10000 random prompts from the COCO2017 validation set, evaluated at 512x512 resolution.  Not optimized for FID scores.
+## Environmental Impact
+**Stable Diffusion v1** **Estimated Emissions**
+Based on that information, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.
+- **Hardware Type:** A100 PCIe 40GB
+- **Hours used:** 150000
+- **Cloud Provider:** AWS
+- **Compute Region:** US-east
+- **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 11250 kg CO2 eq.
+## Citation
+    @InProceedings{Rombach_2022_CVPR,
+        author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
+        title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
+        booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
+        month     = {June},
+        year      = {2022},
+        pages     = {10684-10695}
+    }
+*This model card was written by: Robin Rombach and Patrick Esser and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*