Spaces:

TencentARC
/

BrushEdit

Running on Zero

App Files Files Community

Yw22 commited on Dec 16, 2024

Commit

b2682d8

1 Parent(s): 6444ed9

brushedit demo

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +30 -20
README.md +3 -3
app/down_load_brushedit.py +13 -0
app/down_load_brushedit.sh +3 -0
app/gpt4_o/brushedit_app.py +0 -914
app/gpt4_o/instructions.py +11 -10
app/gpt4_o/requirements.txt +0 -18
app/llava/instructions.py +108 -0
app/qwen2/instructions.py +103 -0
app/{gpt4_o/run_app.sh → run_app.sh} +1 -1
app/src/aspect_ratio_template.py +88 -0
app/src/base_model_template.py +61 -0
app/{gpt4_o → src}/brushedit_all_in_one_pipeline.py +6 -13
app/src/brushedit_app.py +1690 -0
app/{gpt4_o → src}/vlm_pipeline.py +118 -34
app/src/vlm_template.py +120 -0
app/utils/GroundingDINO_SwinT_OGC.py +43 -0
assets/angel_christmas/angel_christmas.png +3 -0
assets/angel_christmas/image_edit_f15d9b45-c978-4e3d-9f5f-251e308560c3_0.png +3 -0
assets/angel_christmas/mask_f15d9b45-c978-4e3d-9f5f-251e308560c3.png +3 -0
assets/angel_christmas/masked_image_f15d9b45-c978-4e3d-9f5f-251e308560c3.png +3 -0
assets/angel_christmas/prompt.txt +3 -0
assets/anime_flower/anime_flower.png +3 -0
assets/anime_flower/image_edit_37553172-9b38-4727-bf2e-37d7e2b93461_2.png +3 -0
assets/anime_flower/mask_37553172-9b38-4727-bf2e-37d7e2b93461.png +3 -0
assets/anime_flower/masked_image_37553172-9b38-4727-bf2e-37d7e2b93461.png +3 -0
assets/anime_flower/prompt.txt +1 -0
assets/brushedit_teaser.png +3 -0
assets/chenduling/chengduling.jpg +3 -0
assets/chenduling/image_edit_68e3ff6f-da07-4b37-91df-13d6eed7b997_0.png +3 -0
assets/chenduling/mask_68e3ff6f-da07-4b37-91df-13d6eed7b997.png +3 -0
assets/chenduling/masked_image_68e3ff6f-da07-4b37-91df-13d6eed7b997.png +3 -0
assets/chenduling/prompt.txt +1 -0
assets/chinese_girl/chinese_girl.png +3 -0
assets/chinese_girl/image_edit_54759648-0989-48e0-bc82-f20e28b5ec29_1.png +3 -0
assets/chinese_girl/mask_54759648-0989-48e0-bc82-f20e28b5ec29.png +3 -0
assets/chinese_girl/masked_image_54759648-0989-48e0-bc82-f20e28b5ec29.png +3 -0
assets/chinese_girl/prompt.txt +1 -0
assets/demo_vis.png +3 -0
assets/example.png +3 -0
assets/frog/frog.jpeg +3 -0
assets/frog/image_edit_f7b350de-6f2c-49e3-b535-995c486d78e7_1.png +3 -0
assets/frog/mask_f7b350de-6f2c-49e3-b535-995c486d78e7.png +3 -0
assets/frog/masked_image_f7b350de-6f2c-49e3-b535-995c486d78e7.png +3 -0
assets/frog/prompt.txt +1 -0
assets/girl_on_sun/girl_on_sun.png +3 -0
assets/girl_on_sun/image_edit_264eac8b-8b65-479c-9755-020a60880c37_0.png +3 -0
assets/girl_on_sun/mask_264eac8b-8b65-479c-9755-020a60880c37.png +3 -0
assets/girl_on_sun/masked_image_264eac8b-8b65-479c-9755-020a60880c37.png +3 -0
assets/girl_on_sun/prompt.txt +1 -0

.gitattributes CHANGED Viewed

@@ -40,23 +40,33 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.gif filter=lfs diff=lfs merge=lfs -text
 *.bmp filter=lfs diff=lfs merge=lfs -text
 *.tiff filter=lfs diff=lfs merge=lfs -text
-assets/hedgehog_rm_fg/hedgehog.png filter=lfs diff=lfs merge=lfs -text
-assets/hedgehog_rm_fg/image_edit_82314e18-c64c-4003-9ef9-52cebf254b2f_2.png filter=lfs diff=lfs merge=lfs -text
-assets/hedgehog_rm_fg/mask_82314e18-c64c-4003-9ef9-52cebf254b2f.png filter=lfs diff=lfs merge=lfs -text
-assets/hedgehog_rm_fg/masked_image_82314e18-c64c-4003-9ef9-52cebf254b2f.png filter=lfs diff=lfs merge=lfs -text
-assets/hedgehog_rp_bg/masked_image_db7f8bf8-8349-46d3-b14e-43d67fbe25d3.png filter=lfs diff=lfs merge=lfs -text
-assets/hedgehog_rp_bg/hedgehog.png filter=lfs diff=lfs merge=lfs -text
-assets/hedgehog_rp_bg/image_edit_db7f8bf8-8349-46d3-b14e-43d67fbe25d3_3.png filter=lfs diff=lfs merge=lfs -text
-assets/hedgehog_rp_bg/mask_db7f8bf8-8349-46d3-b14e-43d67fbe25d3.png filter=lfs diff=lfs merge=lfs -text
-assets/hedgehog_rp_fg/hedgehog.png filter=lfs diff=lfs merge=lfs -text
-assets/hedgehog_rp_fg/image_edit_5cab3448-5a3a-459c-9144-35cca3d34273_0.png filter=lfs diff=lfs merge=lfs -text
-assets/hedgehog_rp_fg/mask_5cab3448-5a3a-459c-9144-35cca3d34273.png filter=lfs diff=lfs merge=lfs -text
-assets/hedgehog_rp_fg/masked_image_5cab3448-5a3a-459c-9144-35cca3d34273.png filter=lfs diff=lfs merge=lfs -text
-assets/mona_lisa/image_edit_aae09152-4495-4332-b691-a0c7bff524be_2.png filter=lfs diff=lfs merge=lfs -text
-assets/mona_lisa/mask_aae09152-4495-4332-b691-a0c7bff524be.png filter=lfs diff=lfs merge=lfs -text
-assets/mona_lisa/masked_image_aae09152-4495-4332-b691-a0c7bff524be.png filter=lfs diff=lfs merge=lfs -text
-assets/mona_lisa/mona_lisa.png filter=lfs diff=lfs merge=lfs -text
-assets/sunflower_girl/image_edit_99cc50b4-7dc4-4de5-8748-ec10772f0317_3.png filter=lfs diff=lfs merge=lfs -text
-assets/sunflower_girl/mask_99cc50b4-7dc4-4de5-8748-ec10772f0317.png filter=lfs diff=lfs merge=lfs -text
-assets/sunflower_girl/masked_image_99cc50b4-7dc4-4de5-8748-ec10772f0317.png filter=lfs diff=lfs merge=lfs -text
-assets/sunflower_girl/sunflower_girl.png filter=lfs diff=lfs merge=lfs -text

 *.gif filter=lfs diff=lfs merge=lfs -text
 *.bmp filter=lfs diff=lfs merge=lfs -text
 *.tiff filter=lfs diff=lfs merge=lfs -text
+assets/angel_christmas/angel_christmas.png filter=lfs diff=lfs merge=lfs -text
+assets/angel_christmas/image_edit_f15d9b45-c978-4e3d-9f5f-251e308560c3_0.png filter=lfs diff=lfs merge=lfs -text
+assets/angel_christmas/masked_image_f15d9b45-c978-4e3d-9f5f-251e308560c3.png filter=lfs diff=lfs merge=lfs -text
+assets/angel_christmas/mask_f15d9b45-c978-4e3d-9f5f-251e308560c3.png filter=lfs diff=lfs merge=lfs -text
+assets/angel_christmas/prompt.txt filter=lfs diff=lfs merge=lfs -text
+assets/pigeon_rm filter=lfs diff=lfs merge=lfs -text
+assets/brushedit_teaser.png filter=lfs diff=lfs merge=lfs -text
+assets/chenduling filter=lfs diff=lfs merge=lfs -text
+assets/chinese_girl filter=lfs diff=lfs merge=lfs -text
+assets/example.png filter=lfs diff=lfs merge=lfs -text
+assets/frog filter=lfs diff=lfs merge=lfs -text
+assets/hedgehog_rm_fg filter=lfs diff=lfs merge=lfs -text
+assets/hedgehog_rp_fg filter=lfs diff=lfs merge=lfs -text
+assets/spider_man_curl filter=lfs diff=lfs merge=lfs -text
+assets/spider_man_cowboy_hat filter=lfs diff=lfs merge=lfs -text
+assets/spider_man_crown filter=lfs diff=lfs merge=lfs -text
+assets/spider_man_rm filter=lfs diff=lfs merge=lfs -text
+assets/angel_christmas filter=lfs diff=lfs merge=lfs -text
+assets/anime_flower filter=lfs diff=lfs merge=lfs -text
+assets/logo_brushedit.png filter=lfs diff=lfs merge=lfs -text
+assets/spider_man_devil_horn filter=lfs diff=lfs merge=lfs -text
+assets/sunflower_girl filter=lfs diff=lfs merge=lfs -text
+assets/upload.png filter=lfs diff=lfs merge=lfs -text
+assets/demo_vis.png filter=lfs diff=lfs merge=lfs -text
+assets/girl_on_sun filter=lfs diff=lfs merge=lfs -text
+assets/hedgehog_rp_bg filter=lfs diff=lfs merge=lfs -text
+assets/mona_lisa filter=lfs diff=lfs merge=lfs -text
+assets/olsen filter=lfs diff=lfs merge=lfs -text
+assets/spider_man_cat_ears filter=lfs diff=lfs merge=lfs -text
+assets/spider_man_witch_hat filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -5,7 +5,7 @@ colorFrom: indigo
 colorTo: gray
 sdk: gradio
 sdk_version: 4.38.1
-app_file: app/gpt4_o/brushedit_app.py
 pinned: false
-python_version: 3.1
----

 colorTo: gray
 sdk: gradio
 sdk_version: 4.38.1
+app_file: app/src/brushedit_app.py
 pinned: false
+python_version: 3.10
+---

app/down_load_brushedit.py ADDED Viewed

	@@ -0,0 +1,13 @@

+import os
+from huggingface_hub import snapshot_download
+# download hf models
+BrushEdit_path = "models/"
+if not os.path.exists(BrushEdit_path):
+    BrushEdit_path = snapshot_download(
+        repo_id="TencentARC/BrushEdit",
+        local_dir=BrushEdit_path,
+        token=os.getenv("HF_TOKEN"),
+    )
+print("Downloaded BrushEdit to ", BrushEdit_path)

app/down_load_brushedit.sh ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ export PYTHONPATH=.:$PYTHONPATH
2	+
3	+ python app/down_load_brushedit.py

app/gpt4_o/brushedit_app.py DELETED Viewed

@@ -1,914 +0,0 @@
-##!/usr/bin/python3
-# -*- coding: utf-8 -*-
-import os, random
-import numpy as np
-import torch
-import gradio as gr
-import spaces
-from PIL import Image
-from huggingface_hub import hf_hub_download
-from segment_anything import SamPredictor, build_sam, SamAutomaticMaskGenerator
-from diffusers import StableDiffusionBrushNetPipeline, BrushNetModel, UniPCMultistepScheduler
-from scipy.ndimage import binary_dilation, binary_erosion
-from app.gpt4_o.vlm_pipeline import (
-    vlm_response_editing_type,
-    vlm_response_object_wait_for_edit,
-    vlm_response_mask,
-    vlm_response_prompt_after_apply_instruction
-)
-from app.gpt4_o.brushedit_all_in_one_pipeline import BrushEdit_Pipeline
-from app.utils.utils import load_grounding_dino_model
-#### Description ####
-head = r"""
-<div style="text-align: center;">
-    <h1> BrushEdit: All-In-One Image Inpainting and Editing</h1>
-    <div style="display: flex; justify-content: center; align-items: center; text-align: center;">
-        <a href='https://tencentarc.github.io/BrushNet/'><img src='https://img.shields.io/badge/Project_Page-BrushNet-green' alt='Project Page'></a>
-        <a href='https://github.com/TencentARC/BrushNet/blob/main/InstructionGuidedEditing/CVPR2024workshop_technique_report.pdf'><img src='https://img.shields.io/badge/Paper-Arxiv-blue'></a>
-        <a href='https://github.com/TencentARC/BrushNet'><img src='https://img.shields.io/badge/Code-Github-orange'></a>
-    </div>
-    </br>
-</div>
-"""
-descriptions = r"""
-Official Gradio Demo for <a href='https://tencentarc.github.io/BrushNet/'><b>BrushEdit: All-In-One Image Inpainting and Editing</b></a><br>
-🧙 BrushEdit enables precise, user-friendly instruction-based image editing via a inpainting model.<br>
-"""
-instructions = r"""
-Currently, we support two modes: <b>fully automated command editing</b> and <b>interactive command editing</b>.
-🛠️ <b>Fully automated instruction-based editing</b>:
-<ul>
-    <li> ⭐️ <b>step1:</b> Upload or select one image from Example. </li>
-    <li> ⭐️ <b>step2:</b> Input the instructions (supports addition, deletion, and modification), e.g. remove xxx .</li>
-    <li> ⭐️ <b>step3:</b> Click <b>Run</b> button to automatic edit image.</li>
-</ul>
-🛠️ <b>Interactive instruction-based editing</b>:
-<ul>
-    <li> ⭐️ <b>step1:</b> Upload or select one image from Example. </li>
-    <li> ⭐️ <b>step2:</b> Use a brush to outline the area you want to edit. </li>
-    <li> ⭐️ <b>step3:</b> Input the instructions. </li>
-    <li> ⭐️ <b>step4:</b> Click <b>Run</b> button to automatic edit image. </li>
-</ul>
-💡 <b>Some tips</b>:
-<ul>
-    <li> 🤠 After input the instructions, you can click the <b>Generate Mask</b> button. The mask generated by VLM will be displayed in the preview panel on the right side. </li>
-    <li> 🤠 After generating the mask or when you use the brush to draw the mask, you can perform operations such as  <b>randomization</b>,  <b>dilation</b>,  <b>erosion</b>, and  <b>movement</b>. </li>
-    <li> 🤠 After input the instructions, you can click the <b>Generate Target Prompt</b> button. The target prompt will be displayed in the text box, and you can modify it according to your ideas. </li>
-</ul>
-☕️ Have fun!
-            """
-# - - - - - examples  - - - - -  #
-EXAMPLES = [
-    # [
-    # {"background": Image.open("assets/mona_lisa/image_edit_aae09152-4495-4332-b691-a0c7bff524be_2.png").convert("RGBA"),
-    #  "layers": [Image.new("RGBA", (Image.open("assets/mona_lisa/image_edit_aae09152-4495-4332-b691-a0c7bff524be_2.png").width, Image.open("assets/mona_lisa/image_edit_aae09152-4495-4332-b691-a0c7bff524be_2.png").height), (0, 0, 0, 0))],
-    #  "composite": Image.open("assets/mona_lisa/image_edit_aae09152-4495-4332-b691-a0c7bff524be_2.png").convert("RGBA")},
-    # #  Image.open("assets/mona_lisa/image_edit_aae09152-4495-4332-b691-a0c7bff524be_2.png").convert("RGBA"),
-    #  "add a shining necklace",
-    # #  [Image.open("assets/mona_lisa/image_edit_aae09152-4495-4332-b691-a0c7bff524be_2.jpg")],
-    # #  [Image.open("assets/mona_lisa/mask_aae09152-4495-4332-b691-a0c7bff524be.png")],
-    # #  [Image.open("assets/mona_lisa/masked_image_aae09152-4495-4332-b691-a0c7bff524be.png")]
-    # ],
-    [
-    # load_image_from_url("https://github.com/liyaowei-stu/BrushEdit/blob/main/assets/mona_lisa/mona_lisa.png"),
-    Image.open("assets/mona_lisa/mona_lisa.png").convert("RGBA"),
-     "add a shining necklace",
-    #  [Image.open("assets/mona_lisa/image_edit_aae09152-4495-4332-b691-a0c7bff524be_2.jpg")],
-    #  [Image.open("assets/mona_lisa/mask_aae09152-4495-4332-b691-a0c7bff524be.png")],
-    #  [Image.open("assets/mona_lisa/masked_image_aae09152-4495-4332-b691-a0c7bff524be.png")]
-    ],
-]
-## init VLM
-from openai import OpenAI
-OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
-os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY
-vlm = OpenAI(base_url="http://v2.open.venus.oa.com/llmproxy")
-device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-# download hf models
-base_model_path = hf_hub_download(
-    repo_id="Yw22/BrushEdit",
-    subfolder="base_model/realisticVisionV60B1_v51VAE",
-    token=os.getenv("HF_TOKEN"),
-)
-brushnet_path = hf_hub_download(
-    repo_id="Yw22/BrushEdit",
-    subfolder="brushnetX",
-    token=os.getenv("HF_TOKEN"),
-)
-sam_path = hf_hub_download(
-    repo_id="Yw22/BrushEdit",
-    subfolder="sam",
-    filename="sam_vit_h_4b8939.pth",
-    token=os.getenv("HF_TOKEN"),
-)
-groundingdino_path = hf_hub_download(
-    repo_id="Yw22/BrushEdit",
-    subfolder="grounding_dino",
-    filename="groundingdino_swint_ogc.pth",
-    token=os.getenv("HF_TOKEN"),
-)
-# input brushnetX ckpt path
-brushnet = BrushNetModel.from_pretrained(brushnet_path, torch_dtype=torch.float16)
-pipe = StableDiffusionBrushNetPipeline.from_pretrained(
-        base_model_path, brushnet=brushnet, torch_dtype=torch.float16, low_cpu_mem_usage=False
-    )
-# speed up diffusion process with faster scheduler and memory optimization
-pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
-# remove following line if xformers is not installed or when using Torch 2.0.
-# pipe.enable_xformers_memory_efficient_attention()
-pipe.enable_model_cpu_offload()
-## init SAM
-sam = build_sam(checkpoint=sam_path)
-sam.to(device=device)
-sam_predictor = SamPredictor(sam)
-sam_automask_generator = SamAutomaticMaskGenerator(sam)
-## init groundingdino_model
-config_file = 'third_party/Grounded-Segment-Anything/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py'
-groundingdino_model = load_grounding_dino_model(config_file, groundingdino_path, device=device)
-## Ordinary function
-def crop_and_resize(image: Image.Image,
-                    target_width: int,
-                    target_height: int) -> Image.Image:
-    """
-    Crops and resizes an image while preserving the aspect ratio.
-    Args:
-        image (Image.Image): Input PIL image to be cropped and resized.
-        target_width (int): Target width of the output image.
-        target_height (int): Target height of the output image.
-    Returns:
-        Image.Image: Cropped and resized image.
-    """
-    # Original dimensions
-    original_width, original_height = image.size
-    original_aspect = original_width / original_height
-    target_aspect = target_width / target_height
-    # Calculate crop box to maintain aspect ratio
-    if original_aspect > target_aspect:
-        # Crop horizontally
-        new_width = int(original_height * target_aspect)
-        new_height = original_height
-        left = (original_width - new_width) / 2
-        top = 0
-        right = left + new_width
-        bottom = original_height
-    else:
-        # Crop vertically
-        new_width = original_width
-        new_height = int(original_width / target_aspect)
-        left = 0
-        top = (original_height - new_height) / 2
-        right = original_width
-        bottom = top + new_height
-    # Crop and resize
-    cropped_image = image.crop((left, top, right, bottom))
-    resized_image = cropped_image.resize((target_width, target_height), Image.NEAREST)
-    return resized_image
-def move_mask_func(mask, direction, units):
-    binary_mask = mask.squeeze()>0
-    rows, cols = binary_mask.shape
-    moved_mask = np.zeros_like(binary_mask, dtype=bool)
-    if direction == 'down':
-        # move down
-        moved_mask[max(0, units):, :] = binary_mask[:rows - units, :]
-    elif direction == 'up':
-        # move up
-        moved_mask[:rows - units, :] = binary_mask[units:, :]
-    elif direction == 'right':
-        # move left
-        moved_mask[:, max(0, units):] = binary_mask[:, :cols - units]
-    elif direction == 'left':
-        # move right
-        moved_mask[:, :cols - units] = binary_mask[:, units:]
-    return moved_mask
-def random_mask_func(mask, dilation_type='square'):
-    # Randomly select the size of dilation
-    dilation_size = np.random.randint(20, 40)  # Randomly select the size of dilation
-    binary_mask = mask.squeeze()>0
-    if dilation_type == 'square_dilation':
-        structure = np.ones((dilation_size, dilation_size), dtype=bool)
-        dilated_mask = binary_dilation(binary_mask, structure=structure)
-    elif dilation_type == 'square_erosion':
-        structure = np.ones((dilation_size, dilation_size), dtype=bool)
-        dilated_mask = binary_erosion(binary_mask, structure=structure)
-    elif dilation_type == 'bounding_box':
-        # find the most left top and left bottom point
-        rows, cols = np.where(binary_mask)
-        if len(rows) == 0 or len(cols) == 0:
-            return mask  # return original mask if no valid points
-        min_row = np.min(rows)
-        max_row = np.max(rows)
-        min_col = np.min(cols)
-        max_col = np.max(cols)
-        # create a bounding box
-        dilated_mask = np.zeros_like(binary_mask, dtype=bool)
-        dilated_mask[min_row:max_row + 1, min_col:max_col + 1] = True
-    elif dilation_type == 'bounding_ellipse':
-        # find the most left top and left bottom point
-        rows, cols = np.where(binary_mask)
-        if len(rows) == 0 or len(cols) == 0:
-            return mask  # return original mask if no valid points
-        min_row = np.min(rows)
-        max_row = np.max(rows)
-        min_col = np.min(cols)
-        max_col = np.max(cols)
-        # calculate the center and axis length of the ellipse
-        center = ((min_col + max_col) // 2, (min_row + max_row) // 2)
-        a = (max_col - min_col) // 2  # half long axis
-        b = (max_row - min_row) // 2  # half short axis
-        # create a bounding ellipse
-        y, x = np.ogrid[:mask.shape[0], :mask.shape[1]]
-        ellipse_mask = ((x - center[0])**2 / a**2 + (y - center[1])**2 / b**2) <= 1
-        dilated_mask = np.zeros_like(binary_mask, dtype=bool)
-        dilated_mask[ellipse_mask] = True
-    else:
-        raise ValueError("dilation_type must be 'square' or 'ellipse'")
-    # use binary dilation
-    dilated_mask =  np.uint8(dilated_mask[:,:,np.newaxis]) * 255
-    return dilated_mask
-## Gradio component function
-@spaces.GPU(duration=180)
-def process(input_image,
-    original_image,
-    original_mask,
-    prompt,
-    negative_prompt,
-    control_strength,
-    seed,
-    randomize_seed,
-    guidance_scale,
-    num_inference_steps,
-    num_samples,
-    blending,
-    category,
-    target_prompt,
-    resize_and_crop):
-    import ipdb; ipdb.set_trace()
-    if original_image is None:
-        raise gr.Error('Please upload the input image')
-    if prompt is None:
-        raise gr.Error("Please input your instructions, e.g., remove the xxx")
-    alpha_mask = input_image["layers"][0].split()[3]
-    input_mask = np.asarray(alpha_mask)
-    if resize_and_crop:
-        original_image = crop_and_resize(Image.fromarray(original_image), target_width=640, target_height=640)
-        input_mask = crop_and_resize(Image.fromarray(input_mask), target_width=640, target_height=640)
-        original_image = np.array(original_image)
-        input_mask = np.array(input_mask)
-    if input_mask.max() == 0:
-        original_mask = original_mask
-    else:
-        original_mask = input_mask[:,:,None]
-    # load example image
-    # if isinstance(original_image, str):
-    #     # image_name = image_examples[original_image][0]
-    #     # original_image = cv2.imread(image_name)
-    #     # original_image = cv2.cvtColor(original_image, cv2.COLOR_BGR2RGB)
-    #     original_image = input_image
-    #     num_samples = 1
-    #     blending = True
-    if category is not None:
-        pass
-    else:
-        category = vlm_response_editing_type(vlm, original_image, prompt)
-    if original_mask is not None:
-        original_mask = np.clip(original_mask, 0, 255).astype(np.uint8)
-    else:
-        object_wait_for_edit = vlm_response_object_wait_for_edit(vlm,
-                                                                 category,
-                                                                 prompt)
-        original_mask = vlm_response_mask(vlm,
-                                          category,
-                                          original_image,
-                                          prompt,
-                                          object_wait_for_edit,
-                                          sam,
-                                          sam_predictor,
-                                          sam_automask_generator,
-                                          groundingdino_model,
-                                          )[:,:,None]
-    if len(target_prompt) <= 1:
-        prompt_after_apply_instruction = vlm_response_prompt_after_apply_instruction(vlm,
-                                                                                 original_image,
-                                                                                 prompt)
-    else:
-        prompt_after_apply_instruction = target_prompt
-    generator = torch.Generator("cuda").manual_seed(random.randint(0, 2147483647) if randomize_seed else seed)
-    image, mask_image = BrushEdit_Pipeline(pipe,
-                                    prompt_after_apply_instruction,
-                                    original_mask,
-                                    original_image,
-                                    generator,
-                                    num_inference_steps,
-                                    guidance_scale,
-                                    control_strength,
-                                    negative_prompt,
-                                    num_samples,
-                                    blending)
-    masked_image = original_image * (1 - (original_mask>0))
-    masked_image = masked_image.astype(np.uint8)
-    masked_image = Image.fromarray(masked_image)
-    import uuid
-    uuid = str(uuid.uuid4())
-    image[0].save(f"outputs/image_edit_{uuid}_0.png")
-    image[1].save(f"outputs/image_edit_{uuid}_1.png")
-    image[2].save(f"outputs/image_edit_{uuid}_2.png")
-    image[3].save(f"outputs/image_edit_{uuid}_3.png")
-    mask_image.save(f"outputs/mask_{uuid}.png")
-    masked_image.save(f"outputs/masked_image_{uuid}.png")
-    return image, [mask_image], [masked_image], ''
-def generate_target_prompt(input_image,
-                           original_image,
-                           prompt):
-    # load example image
-    if isinstance(original_image, str):
-        original_image = input_image
-    prompt_after_apply_instruction = vlm_response_prompt_after_apply_instruction(vlm,
-                                                            original_image,
-                                                            prompt)
-    return prompt_after_apply_instruction
-def process_mask(input_image,
-    original_image,
-    prompt,
-    resize_and_crop):
-    if original_image is None:
-        raise gr.Error('Please upload the input image')
-    if prompt is None:
-        raise gr.Error("Please input your instructions, e.g., remove the xxx")
-    ## load mask
-    alpha_mask = input_image["layers"][0].split()[3]
-    input_mask = np.array(alpha_mask)
-    # load example image
-    if isinstance(original_image, str):
-        original_image = input_image["background"]
-    if resize_and_crop:
-        original_image = crop_and_resize(Image.fromarray(original_image), target_width=640, target_height=640)
-        input_mask = crop_and_resize(Image.fromarray(input_mask), target_width=640, target_height=640)
-        original_image = np.array(original_image)
-        input_mask = np.array(input_mask)
-    if input_mask.max() == 0:
-        category = vlm_response_editing_type(vlm, original_image, prompt)
-        object_wait_for_edit = vlm_response_object_wait_for_edit(vlm,
-                                                                category,
-                                                                prompt)
-        # original mask: h,w,1 [0, 255]
-        original_mask = vlm_response_mask(
-            vlm,
-            category,
-            original_image,
-            prompt,
-            object_wait_for_edit,
-            sam,
-            sam_predictor,
-            sam_automask_generator,
-            groundingdino_model,
-            )[:,:,None]
-    else:
-        original_mask = input_mask[:,:,None]
-        category = None
-    mask_image = Image.fromarray(original_mask.squeeze().astype(np.uint8)).convert("RGB")
-    masked_image = original_image * (1 - (original_mask>0))
-    masked_image = masked_image.astype(np.uint8)
-    masked_image = Image.fromarray(masked_image)
-    ## not work for image editor
-    # background = input_image["background"]
-    # mask_array = original_mask.squeeze()
-    # layer_rgba = np.array(input_image['layers'][0])
-    # layer_rgba[mask_array > 0] = [0, 0, 0, 255]
-    # layer_rgba = Image.fromarray(layer_rgba, 'RGBA')
-    # black_image = Image.new("RGBA", layer_rgba.size, (0, 0, 0, 255))
-    # composite = Image.composite(black_image, background, layer_rgba)
-    # output_base =  {"layers": [layer_rgba], "background": background, "composite": composite}
-    return [masked_image], [mask_image], original_mask.astype(np.uint8), category
-def process_random_mask(input_image, original_image, original_mask, resize_and_crop):
-    alpha_mask = input_image["layers"][0].split()[3]
-    input_mask = np.asarray(alpha_mask)
-    if resize_and_crop:
-        original_image = crop_and_resize(Image.fromarray(original_image), target_width=640, target_height=640)
-        input_mask = crop_and_resize(Image.fromarray(input_mask), target_width=640, target_height=640)
-        original_image = np.array(original_image)
-        input_mask = np.array(input_mask)
-    if input_mask.max() == 0:
-        if original_mask is None:
-            raise gr.Error('Please generate mask first')
-        original_mask = original_mask
-    else:
-        original_mask = input_mask[:,:,None]
-    dilation_type = np.random.choice(['bounding_box', 'bounding_ellipse'])
-    random_mask = random_mask_func(original_mask, dilation_type).squeeze()
-    mask_image = Image.fromarray(random_mask.astype(np.uint8)).convert("RGB")
-    masked_image = original_image * (1 - (random_mask[:,:,None]>0))
-    masked_image = masked_image.astype(original_image.dtype)
-    masked_image = Image.fromarray(masked_image)
-    return [masked_image], [mask_image], random_mask[:,:,None].astype(np.uint8)
-def process_dilation_mask(input_image, original_image, original_mask, resize_and_crop):
-    alpha_mask = input_image["layers"][0].split()[3]
-    input_mask = np.asarray(alpha_mask)
-    if resize_and_crop:
-        original_image = crop_and_resize(Image.fromarray(original_image), target_width=640, target_height=640)
-        input_mask = crop_and_resize(Image.fromarray(input_mask), target_width=640, target_height=640)
-        original_image = np.array(original_image)
-        input_mask = np.array(input_mask)
-    if input_mask.max() == 0:
-        if original_mask is None:
-            raise gr.Error('Please generate mask first')
-        original_mask = original_mask
-    else:
-        original_mask = input_mask[:,:,None]
-    dilation_type = np.random.choice(['square_dilation'])
-    random_mask = random_mask_func(original_mask, dilation_type).squeeze()
-    mask_image = Image.fromarray(random_mask.astype(np.uint8)).convert("RGB")
-    masked_image = original_image * (1 - (random_mask[:,:,None]>0))
-    masked_image = masked_image.astype(original_image.dtype)
-    masked_image = Image.fromarray(masked_image)
-    return [masked_image], [mask_image], random_mask[:,:,None].astype(np.uint8)
-def process_erosion_mask(input_image, original_image, original_mask, resize_and_crop):
-    alpha_mask = input_image["layers"][0].split()[3]
-    input_mask = np.asarray(alpha_mask)
-    if resize_and_crop:
-        original_image = crop_and_resize(Image.fromarray(original_image), target_width=640, target_height=640)
-        input_mask = crop_and_resize(Image.fromarray(input_mask), target_width=640, target_height=640)
-        original_image = np.array(original_image)
-        input_mask = np.array(input_mask)
-    if input_mask.max() == 0:
-        if original_mask is None:
-            raise gr.Error('Please generate mask first')
-        original_mask = original_mask
-    else:
-        original_mask = input_mask[:,:,None]
-    dilation_type = np.random.choice(['square_erosion'])
-    random_mask = random_mask_func(original_mask, dilation_type).squeeze()
-    mask_image = Image.fromarray(random_mask.astype(np.uint8)).convert("RGB")
-    masked_image = original_image * (1 - (random_mask[:,:,None]>0))
-    masked_image = masked_image.astype(original_image.dtype)
-    masked_image = Image.fromarray(masked_image)
-    return [masked_image], [mask_image], random_mask[:,:,None].astype(np.uint8)
-def move_mask_left(input_image, original_image, original_mask, moving_pixels, resize_and_crop):
-    alpha_mask = input_image["layers"][0].split()[3]
-    input_mask = np.asarray(alpha_mask)
-    if resize_and_crop:
-        original_image = crop_and_resize(Image.fromarray(original_image), target_width=640, target_height=640)
-        input_mask = crop_and_resize(Image.fromarray(input_mask), target_width=640, target_height=640)
-        original_image = np.array(original_image)
-        input_mask = np.array(input_mask)
-    if input_mask.max() == 0:
-        if original_mask is None:
-            raise gr.Error('Please generate mask first')
-        original_mask = original_mask
-    else:
-        original_mask = input_mask[:,:,None]
-    moved_mask = move_mask_func(original_mask, 'left', int(moving_pixels)).squeeze()
-    mask_image = Image.fromarray(((moved_mask>0).astype(np.uint8)*255)).convert("RGB")
-    masked_image = original_image * (1 - (moved_mask[:,:,None]>0))
-    masked_image = masked_image.astype(original_image.dtype)
-    masked_image = Image.fromarray(masked_image)
-    if moved_mask.max() <= 1:
-        moved_mask = ((moved_mask * 255)[:,:,None]).astype(np.uint8)
-        original_mask = moved_mask
-    return [masked_image], [mask_image], original_mask.astype(np.uint8)
-def move_mask_right(input_image, original_image, original_mask, moving_pixels, resize_and_crop):
-    alpha_mask = input_image["layers"][0].split()[3]
-    input_mask = np.asarray(alpha_mask)
-    if resize_and_crop:
-        original_image = crop_and_resize(Image.fromarray(original_image), target_width=640, target_height=640)
-        input_mask = crop_and_resize(Image.fromarray(input_mask), target_width=640, target_height=640)
-        original_image = np.array(original_image)
-        input_mask = np.array(input_mask)
-    if input_mask.max() == 0:
-        if original_mask is None:
-            raise gr.Error('Please generate mask first')
-        original_mask = original_mask
-    else:
-        original_mask = input_mask[:,:,None]
-    moved_mask = move_mask_func(original_mask, 'right', int(moving_pixels)).squeeze()
-    mask_image = Image.fromarray(((moved_mask>0).astype(np.uint8)*255)).convert("RGB")
-    masked_image = original_image * (1 - (moved_mask[:,:,None]>0))
-    masked_image = masked_image.astype(original_image.dtype)
-    masked_image = Image.fromarray(masked_image)
-    if moved_mask.max() <= 1:
-        moved_mask = ((moved_mask * 255)[:,:,None]).astype(np.uint8)
-        original_mask = moved_mask
-    return [masked_image], [mask_image], original_mask.astype(np.uint8)
-def move_mask_up(input_image, original_image, original_mask, moving_pixels, resize_and_crop):
-    alpha_mask = input_image["layers"][0].split()[3]
-    input_mask = np.asarray(alpha_mask)
-    if resize_and_crop:
-        original_image = crop_and_resize(Image.fromarray(original_image), target_width=640, target_height=640)
-        input_mask = crop_and_resize(Image.fromarray(input_mask), target_width=640, target_height=640)
-        original_image = np.array(original_image)
-        input_mask = np.array(input_mask)
-    if input_mask.max() == 0:
-        if original_mask is None:
-            raise gr.Error('Please generate mask first')
-        original_mask = original_mask
-    else:
-        original_mask = input_mask[:,:,None]
-    moved_mask = move_mask_func(original_mask, 'up', int(moving_pixels)).squeeze()
-    mask_image = Image.fromarray(((moved_mask>0).astype(np.uint8)*255)).convert("RGB")
-    masked_image = original_image * (1 - (moved_mask[:,:,None]>0))
-    masked_image = masked_image.astype(original_image.dtype)
-    masked_image = Image.fromarray(masked_image)
-    if moved_mask.max() <= 1:
-        moved_mask = ((moved_mask * 255)[:,:,None]).astype(np.uint8)
-        original_mask = moved_mask
-    return [masked_image], [mask_image], original_mask.astype(np.uint8)
-def move_mask_down(input_image, original_image, original_mask, moving_pixels, resize_and_crop):
-    alpha_mask = input_image["layers"][0].split()[3]
-    input_mask = np.asarray(alpha_mask)
-    if resize_and_crop:
-        original_image = crop_and_resize(Image.fromarray(original_image), target_width=640, target_height=640)
-        input_mask = crop_and_resize(Image.fromarray(input_mask), target_width=640, target_height=640)
-        original_image = np.array(original_image)
-        input_mask = np.array(input_mask)
-    if input_mask.max() == 0:
-        if original_mask is None:
-            raise gr.Error('Please generate mask first')
-        original_mask = original_mask
-    else:
-        original_mask = input_mask[:,:,None]
-    moved_mask = move_mask_func(original_mask, 'down', int(moving_pixels)).squeeze()
-    mask_image = Image.fromarray(((moved_mask>0).astype(np.uint8)*255)).convert("RGB")
-    masked_image = original_image * (1 - (moved_mask[:,:,None]>0))
-    masked_image = masked_image.astype(original_image.dtype)
-    masked_image = Image.fromarray(masked_image)
-    if moved_mask.max() <= 1:
-        moved_mask = ((moved_mask * 255)[:,:,None]).astype(np.uint8)
-        original_mask = moved_mask
-    return [masked_image], [mask_image], original_mask.astype(np.uint8)
-def store_img(base):
-    import ipdb; ipdb.set_trace()
-    image_pil = base["background"].convert("RGB")
-    original_image = np.array(image_pil)
-    # import ipdb; ipdb.set_trace()
-    if max(original_image.shape[0], original_image.shape[1]) * 1.0 / min(original_image.shape[0], original_image.shape[1])>2.0:
-        raise gr.Error('image aspect ratio cannot be larger than 2.0')
-    return base, original_image, None, "", None, None, None, None, None
-def reset_func(input_image, original_image, original_mask, prompt, target_prompt):
-    input_image = None
-    original_image = None
-    original_mask = None
-    prompt = ''
-    mask_gallery = []
-    masked_gallery = []
-    result_gallery = []
-    target_prompt = ''
-    return input_image, original_image, original_mask, prompt, mask_gallery, masked_gallery, result_gallery, target_prompt
-block = gr.Blocks(
-        theme=gr.themes.Soft(
-             radius_size=gr.themes.sizes.radius_none,
-             text_size=gr.themes.sizes.text_md
-         )
-        ).queue()
-with block as demo:
-    with gr.Row():
-        with gr.Column():
-            gr.HTML(head)
-    gr.Markdown(descriptions)
-    with gr.Accordion(label="🧭 Instructions:", open=True, elem_id="accordion"):
-        with gr.Row(equal_height=True):
-            gr.Markdown(instructions)
-    original_image = gr.State(value=None)
-    original_mask = gr.State(value=None)
-    category = gr.State(value=None)
-    with gr.Row():
-        with gr.Column():
-            with gr.Row():
-                input_image = gr.ImageEditor(
-                    label="Input Image",
-                    type="pil",
-                    brush=gr.Brush(colors=["#000000"], default_size = 30, color_mode="fixed"),
-                    layers = False,
-                    interactive=True,
-                    height=800,
-                    # transforms=("crop"),
-                    # crop_size=(640, 640),
-                    )
-            prompt = gr.Textbox(label="Prompt", placeholder="Please input your instruction.",value='',lines=1)
-            with gr.Row():
-                mask_button = gr.Button("Generate Mask")
-                random_mask_button = gr.Button("Random Generated Mask")
-            with gr.Row():
-                dilation_mask_button = gr.Button("Dilation Generated Mask")
-                erosion_mask_button = gr.Button("Erosion Generated Mask")
-            with gr.Row():
-                generate_target_prompt_button = gr.Button("Generate Target Prompt")
-                run_button = gr.Button("Run")
-            target_prompt = gr.Text(
-                        label="Target prompt",
-                        max_lines=5,
-                        placeholder="VLM-generated target prompt, you can first generate if and then modify it (optional)",
-                        value='',
-                        lines=2
-                    )
-            resize_and_crop = gr.Checkbox(label="Resize and Crop (640 x 640)", value=False)
-            with gr.Accordion("More input params (highly-recommended)", open=False, elem_id="accordion1"):
-                negative_prompt = gr.Text(
-                        label="Negative Prompt",
-                        max_lines=5,
-                        placeholder="Please input your negative prompt",
-                        value='ugly, low quality',lines=1
-                    )
-                control_strength = gr.Slider(
-                    label="Control Strength: ", show_label=True, minimum=0, maximum=1.1, value=1, step=0.01
-                    )
-                with gr.Group():
-                    seed = gr.Slider(
-                        label="Seed: ", minimum=0, maximum=2147483647, step=1, value=648464818
-                    )
-                    randomize_seed = gr.Checkbox(label="Randomize seed", value=False)
-                blending = gr.Checkbox(label="Blending mode", value=True)
-                num_samples = gr.Slider(
-                    label="Num samples", minimum=0, maximum=4, step=1, value=4
-                )
-                with gr.Group():
-                    with gr.Row():
-                        guidance_scale = gr.Slider(
-                            label="Guidance scale",
-                            minimum=1,
-                            maximum=12,
-                            step=0.1,
-                            value=7.5,
-                        )
-                        num_inference_steps = gr.Slider(
-                            label="Number of inference steps",
-                            minimum=1,
-                            maximum=50,
-                            step=1,
-                            value=50,
-                        )
-        with gr.Column():
-            with gr.Row():
-                with gr.Tabs(elem_classes=["feedback"]):
-                    with gr.TabItem("Mask"):
-                        mask_gallery = gr.Gallery(label='Mask', show_label=False, elem_id="gallery", preview=True, height=360)
-                with gr.Tabs(elem_classes=["feedback"]):
-                    with gr.TabItem("Masked Image"):
-                        masked_gallery = gr.Gallery(label='Masked Image', show_label=False, elem_id="gallery", preview=True, height=360)
-            moving_pixels = gr.Slider(
-                    label="Moving pixels:", show_label=True, minimum=0, maximum=50, value=4, step=1
-                    )
-            with gr.Row():
-                move_left_button = gr.Button("Move Left")
-                move_right_button = gr.Button("Move Right")
-            with gr.Row():
-                move_up_button = gr.Button("Move Up")
-                move_down_button = gr.Button("Move Down")
-            with gr.Tabs(elem_classes=["feedback"]):
-                with gr.TabItem("Outputs"):
-                    result_gallery = gr.Gallery(label='Output', show_label=False, elem_id="gallery", preview=True, height=360)
-            reset_button = gr.Button("Reset")
-    with gr.Row():
-    #     # example = gr.Examples(
-    #     #     label="Quick Example",
-    #     #     examples=EXAMPLES,
-    #     #     inputs=[prompt, seed, result_gallery, mask_gallery, masked_gallery],
-    #     #     examples_per_page=10,
-    #     #     cache_examples=False,
-    #     # )
-        example = gr.Examples(
-            label="Quick Example",
-            examples=EXAMPLES,
-            inputs=[input_image, prompt],
-            examples_per_page=10,
-            cache_examples=False,
-        )
-        # def process_example(prompt, seed, eg_output):
-        #     import ipdb; ipdb.set_trace()
-        #     eg_output_path = os.path.join("assets/", eg_output)
-        #     return prompt, seed, [Image.open(eg_output_path)]
-        # example = gr.Examples(
-        #     label="Quick Example",
-        #     examples=EXAMPLES,
-        #     inputs=[prompt, seed, eg_output],
-        #     outputs=[prompt, seed, result_gallery],
-        #     fn=process_example,
-        #     examples_per_page=10,
-        #     run_on_click=True,
-        #     cache_examples=False,
-        # )
-    input_image.upload(
-        store_img,
-        [input_image],
-        [input_image, original_image, original_mask, prompt, mask_gallery, masked_gallery, result_gallery, target_prompt]
-    )
-    ips=[input_image,
-         original_image,
-         original_mask,
-         prompt,
-         negative_prompt,
-         control_strength,
-         seed,
-         randomize_seed,
-         guidance_scale,
-         num_inference_steps,
-         num_samples,
-         blending,
-         category,
-         target_prompt,
-         resize_and_crop]
-    ## run brushedit
-    run_button.click(fn=process, inputs=ips, outputs=[result_gallery, mask_gallery, masked_gallery, target_prompt])
-    ## mask func
-    mask_button.click(fn=process_mask, inputs=[input_image, original_image, prompt, resize_and_crop], outputs=[masked_gallery, mask_gallery, original_mask, category])
-    random_mask_button.click(fn=process_random_mask, inputs=[input_image, original_image, original_mask, resize_and_crop], outputs=[masked_gallery, mask_gallery, original_mask])
-    dilation_mask_button.click(fn=process_dilation_mask, inputs=[input_image, original_image, original_mask, resize_and_crop], outputs=[ masked_gallery, mask_gallery, original_mask])
-    erosion_mask_button.click(fn=process_erosion_mask, inputs=[input_image, original_image, original_mask, resize_and_crop], outputs=[ masked_gallery, mask_gallery, original_mask])
-    ## move mask func
-    move_left_button.click(fn=move_mask_left, inputs=[input_image, original_image, original_mask, moving_pixels, resize_and_crop], outputs=[masked_gallery, mask_gallery, original_mask])
-    move_right_button.click(fn=move_mask_right, inputs=[input_image, original_image, original_mask, moving_pixels, resize_and_crop], outputs=[masked_gallery, mask_gallery, original_mask])
-    move_up_button.click(fn=move_mask_up, inputs=[input_image, original_image, original_mask, moving_pixels, resize_and_crop], outputs=[masked_gallery, mask_gallery, original_mask])
-    move_down_button.click(fn=move_mask_down, inputs=[input_image, original_image, original_mask, moving_pixels, resize_and_crop], outputs=[masked_gallery, mask_gallery, original_mask])
-    ## prompt func
-    generate_target_prompt_button.click(fn=generate_target_prompt, inputs=[input_image, original_image, prompt], outputs=[target_prompt])
-    ## reset func
-    reset_button.click(fn=reset_func, inputs=[input_image, original_image, original_mask, prompt, target_prompt], outputs=[input_image, original_image, original_mask, prompt, mask_gallery, masked_gallery, result_gallery, target_prompt])
-demo.launch(server_name="0.0.0.0")

app/gpt4_o/instructions.py CHANGED Viewed

@@ -1,15 +1,16 @@
-def create_editing_category_messages(editing_prompt):
     messages = [{
             "role": "system",
             "content": [
                 {
                 "type": "text",
-                "text": "I will give you an image and an editing instruction of the image. Please output which type of editing category it is in. You can choose from the following categories: \n\
-    1. Addition: Adding new objects within the images, e.g., add a bird to the image \n\
-    2. Remove: Removing objects, e.g., remove the mask \n\
-    3. Local: Replace local parts of an object and later the object's attributes (e.g., make it smile) or alter an object's visual appearance without affecting its structure (e.g., change the cat to a dog) \n\
-    4. Global: Edit the entire image, e.g., let's see it in winter \n\
-    5. Background: Change the scene's background, e.g., have her walk on water, change the background to a beach, make the hedgehog in France, etc.",
                 },]
             },
             {
@@ -24,7 +25,7 @@ def create_editing_category_messages(editing_prompt):
     return messages
-def create_ori_object_messages(editing_prompt):
     messages =  [
                 {
@@ -49,7 +50,7 @@ def create_ori_object_messages(editing_prompt):
     return messages
-def create_add_object_messages(editing_prompt, base64_image, height=640, width=640):
     size_str = f"The image size is height {height}px and width {width}px. The top - left corner is coordinate [0 , 0]. The bottom - right corner is coordinnate [{height} , {width}]. "
@@ -77,7 +78,7 @@ def create_add_object_messages(editing_prompt, base64_image, height=640, width=6
     return messages
-def create_apply_editing_messages(editing_prompt, base64_image):
     messages =  [
             {
             "role": "system",

+def create_editing_category_messages_gpt4o(editing_prompt):
     messages = [{
             "role": "system",
             "content": [
                 {
                 "type": "text",
+                "text": "I will give you an editing instruction of the image. Please output which type of editing category it is in. You can choose from the following categories: \n\
+                    1. Addition: Adding new objects within the images, e.g., add a bird \n\
+                    2. Remove: Removing objects, e.g., remove the mask \n\
+                    3. Local: Replace local parts of an object and later the object's attributes (e.g., make it smile) or alter an object's visual appearance without affecting its structure (e.g., change the cat to a dog) \n\
+                    4. Global: Edit the entire image, e.g., let's see it in winter \n\
+                    5. Background: Change the scene's background, e.g., have her walk on water, change the background to a beach, make the hedgehog in France, etc. \n\
+                    Only output a single word, e.g., 'Addition'.",
                 },]
             },
             {
     return messages
+def create_ori_object_messages_gpt4o(editing_prompt):
     messages =  [
                 {
     return messages
+def create_add_object_messages_gpt4o(editing_prompt, base64_image, height=640, width=640):
     size_str = f"The image size is height {height}px and width {width}px. The top - left corner is coordinate [0 , 0]. The bottom - right corner is coordinnate [{height} , {width}]. "
     return messages
+def create_apply_editing_messages_gpt4o(editing_prompt, base64_image):
     messages =  [
             {
             "role": "system",

app/gpt4_o/requirements.txt DELETED Viewed

@@ -1,18 +0,0 @@
-torchvision
-transformers>=4.25.1
-ftfy
-tensorboard
-datasets
-Pillow==9.5.0
-opencv-python
-imgaug
-accelerate==0.20.3
-image-reward
-hpsv2
-torchmetrics
-open-clip-torch
-clip
-# gradio==4.44.1
-gradio==4.38.1
-segment_anything
-openai

app/llava/instructions.py ADDED Viewed

	@@ -0,0 +1,108 @@

+def create_editing_category_messages_llava(editing_prompt):
+    messages = [{
+            "role": "system",
+            "content": [
+                {
+                "type": "text",
+                "text": "I will give you an image and an editing instruction of the image. Please output which type of editing category it is in. You can choose from the following categories: \n\
+                    1. Addition: Adding new objects within the images, e.g., add a bird \n\
+                    2. Remove: Removing objects, e.g., remove the mask \n\
+                    3. Local: Replace local parts of an object and later the object's attributes (e.g., make it smile) or alter an object's visual appearance without affecting its structure (e.g., change the cat to a dog) \n\
+                    4. Global: Edit the entire image, e.g., let's see it in winter \n\
+                    5. Background: Change the scene's background, e.g., have her walk on water, change the background to a beach, make the hedgehog in France, etc. \n\
+                    Only output a single word, e.g., 'Addition'.",
+                },]
+            },
+            {
+            "role": "user",
+            "content": [
+                {
+                "type": "image"
+                },
+                {
+                "type": "text",
+                "text": editing_prompt
+                },
+            ]
+            }]
+    return messages
+def create_ori_object_messages_llava(editing_prompt):
+    messages =  [
+                {
+                "role": "system",
+                "content": [
+                    {
+                    "type": "text",
+                    "text": "I will give you an editing instruction of the image. Please output the object needed to be edited. You only need to output the basic description of the object in no more than 5 words.  The output should only contain one noun. \n \
+                    For example, the editing instruction is 'Change the white cat to a black dog'. Then you need to output: 'white cat'. Only output the new content. Do not output anything else."
+                    },]
+                },
+                {
+                "role": "user",
+                "content": [
+                    {
+                    "type": "image"
+                    },
+                    {
+                    "type": "text",
+                    "text": editing_prompt
+                    }
+                ]
+                }
+            ]
+    return messages
+def create_add_object_messages_llava(editing_prompt, height=640, width=640):
+    size_str = f"The image size is height {height}px and width {width}px. The top - left corner is coordinate [0 , 0]. The bottom - right corner is coordinnate [{height} , {width}]. "
+    messages = [
+                {
+                "role": "user",
+                "content": [
+                    {
+                    "type": "image"
+                    },
+                    {
+                    "type": "text",
+                    "text": "I need to add an object to the image following the instruction: " + editing_prompt + ". " + size_str + " \n \
+                    Can you give me a possible bounding box of the location for the added object? Please output with the format of [top - left x coordinate , top - left y coordinate , box width , box height]. You should only output the bounding box position and nothing else. Please refer to the example below for the desired format.\n\
+                    [Examples]\n \
+                    [19, 101, 32, 153]\n  \
+                    [54, 12, 242, 96]"
+                    },
+                        ]
+                        }
+                    ]
+    return messages
+def create_apply_editing_messages_llava(editing_prompt):
+    messages =  [
+            {
+            "role": "system",
+            "content": [
+                {
+                "type": "text",
+                "text": "I will provide an image along with an editing instruction. Please describe the new content that should be present in the image after applying the instruction. \n \
+                    For example, if the original image content shows a grandmother wearing a mask and the instruction is 'remove the mask', your output should be: 'a grandmother'. The output should only include elements that remain in the image after the edit and should not mention elements that have been changed or removed, such as 'mask' in this example. Do not output 'sorry, xxx', even if it's a guess, directly output the answer you think is correct."
+                },]
+            },
+            {
+            "role": "user",
+            "content": [
+                {
+                "type": "image"
+                },
+                {
+                "type": "text",
+                "text": editing_prompt
+                },
+                ]
+                },
+            ]
+    return messages

app/qwen2/instructions.py ADDED Viewed

	@@ -0,0 +1,103 @@

+def create_editing_category_messages_qwen2(editing_prompt):
+    messages = [{
+            "role": "system",
+            "content": [
+                {
+                "type": "text",
+                "text": "I will give you an image and an editing instruction of the image. Please output which type of editing category it is in. You can choose from the following categories: \n\
+    1. Addition: Adding new objects within the images, e.g., add a bird to the image \n\
+    2. Remove: Removing objects, e.g., remove the mask \n\
+    3. Local: Replace local parts of an object and later the object's attributes (e.g., make it smile) or alter an object's visual appearance without affecting its structure (e.g., change the cat to a dog) \n\
+    4. Global: Edit the entire image, e.g., let's see it in winter \n\
+    5. Background: Change the scene's background, e.g., have her walk on water, change the background to a beach, make the hedgehog in France, etc.",
+                },]
+            },
+            {
+            "role": "user",
+            "content": [
+                {
+                "type": "text",
+                "text": editing_prompt
+                },
+            ]
+            }]
+    return messages
+def create_ori_object_messages_qwen2(editing_prompt):
+    messages =  [
+                {
+                "role": "system",
+                "content": [
+                    {
+                    "type": "text",
+                    "text": "I will give you an editing instruction of the image. Please output the object needed to be edited. You only need to output the basic description of the object in no more than 5 words.  The output should only contain one noun. \n \
+                    For example, the editing instruction is 'Change the white cat to a black dog'. Then you need to output: 'white cat'. Only output the new content. Do not output anything else."
+                    },]
+                },
+                {
+                "role": "user",
+                "content": [
+                    {
+                    "type": "text",
+                    "text": editing_prompt
+                    }
+                ]
+                }
+            ]
+    return messages
+def create_add_object_messages_qwen2(editing_prompt, base64_image, height=640, width=640):
+    size_str = f"The image size is height {height}px and width {width}px. The top - left corner is coordinate [0 , 0]. The bottom - right corner is coordinnate [{height} , {width}]. "
+    messages = [
+                {
+                "role": "user",
+                "content": [
+                    {
+                    "type": "text",
+                    "text": "I need to add an object to the image following the instruction: " + editing_prompt + ". " + size_str + " \n \
+                    Can you give me a possible bounding box of the location for the added object? Please output with the format of [top - left x coordinate , top - left y coordinate , box width , box height]. You should only output the bounding box position and nothing else. Please refer to the example below for the desired format.\n\
+                    [Examples]\n \
+                    [19, 101, 32, 153]\n  \
+                    [54, 12, 242, 96]"
+                    },
+                    {
+                    "type": "image",
+                    "image": f"data:image;base64,{base64_image}",
+                    }
+                        ]
+                        }
+                    ]
+    return messages
+def create_apply_editing_messages_qwen2(editing_prompt, base64_image):
+    messages =  [
+            {
+            "role": "system",
+            "content": [
+                {
+                "type": "text",
+                "text": "I will provide an image along with an editing instruction. Please describe the new content that should be present in the image after applying the instruction. \n \
+                    For example, if the original image content shows a grandmother wearing a mask and the instruction is 'remove the mask', your output should be: 'a grandmother'. The output should only include elements that remain in the image after the edit and should not mention elements that have been changed or removed, such as 'mask' in this example. Do not output 'sorry, xxx', even if it's a guess, directly output the answer you think is correct."
+                },]
+            },
+            {
+            "role": "user",
+            "content": [
+                {
+                "type": "text",
+                "text": editing_prompt
+                },
+                {
+                    "type": "image",
+                    "image": f"data:image;base64,{base64_image}",
+                },
+            ]
+            }
+        ]
+    return messages

app/{gpt4_o/run_app.sh → run_app.sh} RENAMED Viewed

@@ -2,4 +2,4 @@ export PYTHONPATH=.:$PYTHONPATH
 export CUDA_VISIBLE_DEVICES=0
-python app/gpt4_o/brushedit_app.py


2
3	export CUDA_VISIBLE_DEVICES=0
4
5	+ python app/src/brushedit_app.py

app/src/aspect_ratio_template.py ADDED Viewed

	@@ -0,0 +1,88 @@

+# From https://github.com/TencentARC/PhotoMaker/pull/120 written by https://github.com/DiscoNova
+# Note: Since output width & height need to be divisible by 8, the w & h -values do
+#       not exactly match the stated aspect ratios... but they are "close enough":)
+aspect_ratio_list = [
+    {
+        "name": "Small Square (1:1)",
+        "w": 640,
+        "h": 640,
+    },
+    {
+        "name": "Custom resolution",
+        "w": "",
+        "h": "",
+    },
+    {
+        "name": "Instagram (1:1)",
+        "w": 1024,
+        "h": 1024,
+    },
+    {
+        "name": "35mm film / Landscape (3:2)",
+        "w": 1024,
+        "h": 680,
+    },
+    {
+        "name": "35mm film / Portrait (2:3)",
+        "w": 680,
+        "h": 1024,
+    },
+    {
+        "name": "CRT Monitor / Landscape (4:3)",
+        "w": 1024,
+        "h": 768,
+    },
+    {
+        "name": "CRT Monitor / Portrait (3:4)",
+        "w": 768,
+        "h": 1024,
+    },
+    {
+        "name": "Widescreen TV / Landscape (16:9)",
+        "w": 1024,
+        "h": 576,
+    },
+    {
+        "name": "Widescreen TV / Portrait (9:16)",
+        "w": 576,
+        "h": 1024,
+    },
+    {
+        "name": "Widescreen Monitor / Landscape (16:10)",
+        "w": 1024,
+        "h": 640,
+    },
+    {
+        "name": "Widescreen Monitor / Portrait (10:16)",
+        "w": 640,
+        "h": 1024,
+    },
+    {
+        "name": "Cinemascope (2.39:1)",
+        "w": 1024,
+        "h": 424,
+    },
+    {
+        "name": "Widescreen Movie (1.85:1)",
+        "w": 1024,
+        "h": 552,
+    },
+    {
+        "name": "Academy Movie (1.37:1)",
+        "w": 1024,
+        "h": 744,
+    },
+    {
+        "name": "Sheet-print (A-series) / Landscape (297:210)",
+        "w": 1024,
+        "h": 720,
+    },
+    {
+        "name": "Sheet-print (A-series) / Portrait (210:297)",
+        "w": 720,
+        "h": 1024,
+    },
+]
+aspect_ratios = {k["name"]: (k["w"], k["h"]) for k in aspect_ratio_list}

app/src/base_model_template.py ADDED Viewed

	@@ -0,0 +1,61 @@

+import os
+import torch
+from huggingface_hub import snapshot_download
+from diffusers import StableDiffusionBrushNetPipeline, BrushNetModel, UniPCMultistepScheduler
+torch_dtype = torch.float16
+device = "cpu"
+BrushEdit_path = "models/"
+if not os.path.exists(BrushEdit_path):
+    BrushEdit_path = snapshot_download(
+        repo_id="TencentARC/BrushEdit",
+        local_dir=BrushEdit_path,
+        token=os.getenv("HF_TOKEN"),
+    )
+brushnet_path = os.path.join(BrushEdit_path, "brushnetX")
+brushnet = BrushNetModel.from_pretrained(brushnet_path, torch_dtype=torch_dtype)
+base_models_list = [
+    {
+        "name": "dreamshaper_8 (Preload)",
+        "local_path": "models/base_model/dreamshaper_8",
+        "pipe": StableDiffusionBrushNetPipeline.from_pretrained(
+            "models/base_model/dreamshaper_8", brushnet=brushnet, torch_dtype=torch_dtype, low_cpu_mem_usage=False
+        ).to(device)
+    },
+    {
+        "name": "epicrealism (Preload)",
+        "local_path": "models/base_model/epicrealism_naturalSinRC1VAE",
+        "pipe": StableDiffusionBrushNetPipeline.from_pretrained(
+            "models/base_model/epicrealism_naturalSinRC1VAE", brushnet=brushnet, torch_dtype=torch_dtype, low_cpu_mem_usage=False
+        ).to(device)
+    },
+    {
+        "name": "henmixReal (Preload)",
+        "local_path": "models/base_model/henmixReal_v5c",
+        "pipe": StableDiffusionBrushNetPipeline.from_pretrained(
+            "models/base_model/henmixReal_v5c", brushnet=brushnet, torch_dtype=torch_dtype, low_cpu_mem_usage=False
+        ).to(device)
+    },
+    {
+        "name": "meinamix (Preload)",
+        "local_path": "models/base_model/meinamix_meinaV11",
+        "pipe": StableDiffusionBrushNetPipeline.from_pretrained(
+            "models/base_model/meinamix_meinaV11", brushnet=brushnet, torch_dtype=torch_dtype, low_cpu_mem_usage=False
+        ).to(device)
+    },
+    {
+        "name": "realisticVision (Default)",
+        "local_path": "models/base_model/realisticVisionV60B1_v51VAE",
+        "pipe": StableDiffusionBrushNetPipeline.from_pretrained(
+            "models/base_model/realisticVisionV60B1_v51VAE", brushnet=brushnet, torch_dtype=torch_dtype, low_cpu_mem_usage=False
+        ).to(device)
+    },
+]
+base_models_template = {k["name"]: (k["local_path"], k["pipe"]) for k in base_models_list}

app/{gpt4_o → src}/brushedit_all_in_one_pipeline.py RENAMED Viewed

@@ -22,10 +22,6 @@ def BrushEdit_Pipeline(pipe,
     mask_np = mask_np / 255
     height, width = mask_np.shape[0], mask_np.shape[1]
-    # back/foreground
-    # if mask_np[94:547,94:546].sum() < mask_np.sum() - mask_np[94:547,94:546].sum() and mask_np[0,:].sum()>0 and mask_np[-1,:].sum()>0 and mask_np[:,0].sum()>0 and mask_np[:,-1].sum()>0  and mask_np[1,:].sum()>0 and mask_np[-2,:].sum()>0 and mask_np[:,1].sum()>0 and mask_np[:,-2].sum()>0 :
-    #     mask_np = 1 - mask_np
     ## resize the mask and original image to the same size which is divisible by vae_scale_factor
     image_processor = VaeImageProcessor(vae_scale_factor=pipe.vae_scale_factor, do_convert_rgb=True)
     height_new, width_new = image_processor.get_default_height_width(original_image, height, width)
@@ -53,16 +49,13 @@ def BrushEdit_Pipeline(pipe,
         height=height_new,
         width=width_new,
     ).images
     if blending:
         mask_blurred = mask_blurred * 0.5 + 0.5
-        ## convert to vae shape format, must be divisible by 8
-        original_image_pil = Image.fromarray(original_image).convert("RGB")
-        init_image_np = np.array(image_processor.preprocess(original_image_pil, height=height_new, width=width_new).squeeze())
-        init_image_np = ((init_image_np.transpose(1,2,0) + 1.) / 2.) * 255
-        init_image_np = init_image_np.astype(np.uint8)
         image_all = []
         for image_i in images:
             image_np = np.array(image_i)
@@ -75,6 +68,6 @@ def BrushEdit_Pipeline(pipe,
         image_all = images
-    return image_all, mask_image

     mask_np = mask_np / 255
     height, width = mask_np.shape[0], mask_np.shape[1]
     ## resize the mask and original image to the same size which is divisible by vae_scale_factor
     image_processor = VaeImageProcessor(vae_scale_factor=pipe.vae_scale_factor, do_convert_rgb=True)
     height_new, width_new = image_processor.get_default_height_width(original_image, height, width)
         height=height_new,
         width=width_new,
     ).images
+    ## convert to vae shape format, must be divisible by 8
+    original_image_pil = Image.fromarray(original_image).convert("RGB")
+    init_image_np = np.array(image_processor.preprocess(original_image_pil, height=height_new, width=width_new).squeeze())
+    init_image_np = ((init_image_np.transpose(1,2,0) + 1.) / 2.) * 255
+    init_image_np = init_image_np.astype(np.uint8)
     if blending:
         mask_blurred = mask_blurred * 0.5 + 0.5
         image_all = []
         for image_i in images:
             image_np = np.array(image_i)
         image_all = images
+    return image_all, mask_image, mask_np, init_image_np

app/src/brushedit_app.py ADDED Viewed

	@@ -0,0 +1,1690 @@

+##!/usr/bin/python3
+# -*- coding: utf-8 -*-
+import os, random, sys
+import numpy as np
+import requests
+import torch
+import spaces
+import gradio as gr
+from PIL import Image
+from huggingface_hub import hf_hub_download, snapshot_download
+from scipy.ndimage import binary_dilation, binary_erosion
+from transformers import (LlavaNextProcessor, LlavaNextForConditionalGeneration,
+                        Qwen2VLForConditionalGeneration, Qwen2VLProcessor)
+from segment_anything import SamPredictor, build_sam, SamAutomaticMaskGenerator
+from diffusers import StableDiffusionBrushNetPipeline, BrushNetModel, UniPCMultistepScheduler
+from diffusers.image_processor  import VaeImageProcessor
+from app.src.vlm_pipeline import (
+    vlm_response_editing_type,
+    vlm_response_object_wait_for_edit,
+    vlm_response_mask,
+    vlm_response_prompt_after_apply_instruction
+)
+from app.src.brushedit_all_in_one_pipeline import BrushEdit_Pipeline
+from app.utils.utils import load_grounding_dino_model
+from app.src.vlm_template import vlms_template
+from app.src.base_model_template import base_models_template
+from app.src.aspect_ratio_template import aspect_ratios
+from openai import OpenAI
+# base_openai_url = ""
+#### Description ####
+logo = r"""
+<center><img src='./assets/logo_brushedit.png' alt='BrushEdit logo' style="width:80px; margin-bottom:10px"></center>
+"""
+head = r"""
+<div style="text-align: center;">
+    <h1> BrushEdit: All-In-One Image Inpainting and Editing</h1>
+    <div style="display: flex; justify-content: center; align-items: center; text-align: center;">
+        <a href='https://liyaowei-stu.github.io/project/BrushEdit/'><img src='https://img.shields.io/badge/Project_Page-BrushEdit-green' alt='Project Page'></a>
+        <a href='https://arxiv.org/abs/2412.10316'><img src='https://img.shields.io/badge/Paper-Arxiv-blue'></a>
+        <a href='https://github.com/TencentARC/BrushEdit'><img src='https://img.shields.io/badge/Code-Github-orange'></a>
+    </div>
+    </br>
+</div>
+"""
+descriptions = r"""
+Official Gradio Demo for <a href='https://tencentarc.github.io/BrushNet/'><b>BrushEdit: All-In-One Image Inpainting and Editing</b></a><br>
+🧙 BrushEdit enables precise, user-friendly instruction-based image editing via a inpainting model.<br>
+"""
+instructions = r"""
+Currently, we support two modes: <b>fully automated command editing</b> and <b>interactive command editing</b>.
+🛠️ <b>Fully automated instruction-based editing</b>:
+<ul>
+    <li> ⭐️ <b>1.Choose Image: </b> Upload <img src="https://github.com/user-attachments/assets/f2dca1e6-31f9-4716-ae84-907f24415bac" alt="upload" style="display:inline; height:1em; vertical-align:middle;"> or select <img src="https://github.com/user-attachments/assets/de808f7d-c74a-44c7-9cbf-f0dbfc2c1abf" alt="example" style="display:inline; height:1em; vertical-align:middle;">  one image from Example. </li>
+    <li> ⭐️ <b>2.Input ⌨️ Instructions: </b> Input the instructions (supports addition, deletion, and modification), e.g. remove xxx .</li>
+    <li> ⭐️ <b>3.Run: </b> Click <b>💫 Run</b> button to automatic edit image.</li>
+</ul>
+🛠️ <b>Interactive instruction-based editing</b>:
+<ul>
+    <li> ⭐️ <b>1.Choose Image: </b> Upload <img src="https://github.com/user-attachments/assets/f2dca1e6-31f9-4716-ae84-907f24415bac" alt="upload" style="display:inline; height:1em; vertical-align:middle;"> or select <img src="https://github.com/user-attachments/assets/de808f7d-c74a-44c7-9cbf-f0dbfc2c1abf" alt="example" style="display:inline; height:1em; vertical-align:middle;">  one image from Example. </li>
+    <li> ⭐️ <b>2.Finely Brushing: </b> Use a brush <img src="https://github.com/user-attachments/assets/c466c5cc-ac8f-4b4a-9bc5-04c4737fe1ef" alt="brush" style="display:inline; height:1em; vertical-align:middle;"> to outline the area you want to edit. And You can also use the eraser <img src="https://github.com/user-attachments/assets/b6370369-b080-4550-b0d0-830ff22d9068" alt="eraser" style="display:inline; height:1em; vertical-align:middle;">  to restore. </li>
+    <li> ⭐️ <b>3.Input ⌨️ Instructions: </b> Input the instructions. </li>
+    <li> ⭐️ <b>4.Run: </b> Click <b>💫 Run</b> button to automatic edit image. </li>
+</ul>
+<b> We strongly recommend using GPT-4o for reasoning. </b> After selecting the VLM model as gpt4-o, enter the API KEY and click the Submit and Verify button. If the output is success, you can use gpt4-o normally. Secondarily, we recommend using the Qwen2VL model.
+<b> We recommend zooming out in your browser for a better viewing range and experience. </b>
+<b> For more detailed feature descriptions, see the bottom. </b>
+☕️ Have fun! 🎄 Wishing you a merry Christmas!
+            """
+tips =  r"""
+💡 <b>Some Tips</b>:
+<ul>
+    <li> 🤠 After input the instructions, you can click the <b>Generate Mask</b> button. The mask generated by VLM will be displayed in the preview panel on the right side. </li>
+    <li> 🤠 After generating the mask or when you use the brush to draw the mask, you can perform operations such as  <b>randomization</b>,  <b>dilation</b>,  <b>erosion</b>, and  <b>movement</b>. </li>
+    <li> 🤠 After input the instructions, you can click the <b>Generate Target Prompt</b> button. The target prompt will be displayed in the text box, and you can modify it according to your ideas. </li>
+</ul>
+💡 <b>Detailed Features</b>:
+<ul>
+    <li> 🎨 <b>Aspect Ratio</b>: Select the aspect ratio of the image. To prevent OOM, 1024px is the maximum resolution.</li>
+    <li> 🎨 <b>VLM Model</b>: Select the VLM model. We use preloaded models to save time. To use other VLM models, download them and uncomment the relevant lines in vlm_template.py from our GitHub repo. </li>
+    <li> 🎨 <b>Generate Mask</b>: According to the input instructions, generate a mask for the area that may need to be edited. </li>
+    <li> 🎨 <b>Square/Circle Mask</b>: Based on the existing mask, generate masks for squares and circles. (The coarse-grained mask provides more editing imagination.) </li>
+    <li> 🎨 <b>Invert Mask</b>: Invert the mask to generate a new mask. </li>
+    <li> 🎨 <b>Dilation/Erosion Mask</b>: Expand or shrink the mask to include or exclude more areas. </li>
+    <li> 🎨 <b>Move Mask</b>: Move the mask to a new position. </li>
+    <li> 🎨 <b>Generate Target Prompt</b>: Generate a target prompt based on the input instructions. </li>
+    <li> 🎨 <b>Target Prompt</b>: Description for masking area, manual input or modification can be made when the content generated by VLM does not meet expectations. </li>
+    <li> 🎨 <b>Blending</b>: Blending brushnet's output and the original input, ensuring the original image details in the unedited areas. (turn off is beeter when removing.) </li>
+    <li> 🎨 <b>Control length</b>: The intensity of editing and inpainting. </li>
+</ul>
+💡 <b>Advanced Features</b>:
+<ul>
+    <li> 🎨 <b>Base Model</b>: We use preloaded models to save time. To use other VLM models, download them and uncomment the relevant lines in vlm_template.py from our GitHub repo. </li>
+    <li> 🎨 <b>Blending</b>: Blending brushnet's output and the original input, ensuring the original image details in the unedited areas. (turn off is beeter when removing.) </li>
+    <li> 🎨 <b>Control length</b>: The intensity of editing and inpainting. </li>
+    <li> 🎨 <b>Num samples</b>: The number of samples to generate. </li>
+    <li> 🎨 <b>Negative prompt</b>: The negative prompt for the classifier-free guidance. </li>
+    <li> 🎨 <b>Guidance scale</b>: The guidance scale for the classifier-free guidance. </li>
+</ul>
+"""
+citation = r"""
+If BrushEdit is helpful, please help to ⭐ the <a href='https://github.com/TencentARC/BrushEdit' target='_blank'>Github Repo</a>. Thanks!
+[![GitHub Stars](https://img.shields.io/github/stars/TencentARC/BrushEdit?style=social)](https://github.com/TencentARC/BrushEdit)
+---
+📝 **Citation**
+<br>
+If our work is useful for your research, please consider citing:
+```bibtex
+@misc{li2024brushedit,
+  title={BrushEdit: All-In-One Image Inpainting and Editing},
+  author={Yaowei Li and Yuxuan Bian and Xuan Ju and Zhaoyang Zhang and and Junhao Zhuang and Ying Shan and Yuexian Zou and Qiang Xu},
+  year={2024},
+  eprint={2412.10316},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV}
+}
+```
+📧 **Contact**
+<br>
+If you have any questions, please feel free to reach me out at <b>liyaowei@gmail.com</b>.
+"""
+# - - - - - examples  - - - - -  #
+EXAMPLES = [
+    [
+    Image.open("./assets/frog/frog.jpeg").convert("RGBA"),
+     "add a magic hat on frog head.",
+     642087011,
+     "frog",
+     "frog",
+     True,
+     False,
+     "GPT4-o (Highly Recommended)"
+    ],
+    [
+    Image.open("./assets/chinese_girl/chinese_girl.png").convert("RGBA"),
+     "replace the background to ancient China.",
+     648464818,
+     "chinese_girl",
+     "chinese_girl",
+     True,
+     False,
+     "GPT4-o (Highly Recommended)"
+    ],
+    [
+    Image.open("./assets/angel_christmas/angel_christmas.png").convert("RGBA"),
+     "remove the deer.",
+     648464818,
+     "angel_christmas",
+     "angel_christmas",
+     False,
+     False,
+     "GPT4-o (Highly Recommended)"
+    ],
+    [
+    Image.open("./assets/sunflower_girl/sunflower_girl.png").convert("RGBA"),
+     "add a wreath on head.",
+     648464818,
+     "sunflower_girl",
+     "sunflower_girl",
+     True,
+     False,
+     "GPT4-o (Highly Recommended)"
+    ],
+    [
+    Image.open("./assets/girl_on_sun/girl_on_sun.png").convert("RGBA"),
+     "add a butterfly fairy.",
+     648464818,
+     "girl_on_sun",
+     "girl_on_sun",
+     True,
+     False,
+     "GPT4-o (Highly Recommended)"
+    ],
+    [
+    Image.open("./assets/spider_man_rm/spider_man.png").convert("RGBA"),
+     "remove the christmas hat.",
+     642087011,
+     "spider_man_rm",
+     "spider_man_rm",
+     False,
+     False,
+     "GPT4-o (Highly Recommended)"
+    ],
+    [
+    Image.open("./assets/anime_flower/anime_flower.png").convert("RGBA"),
+     "remove the flower.",
+     642087011,
+     "anime_flower",
+     "anime_flower",
+     False,
+     False,
+     "GPT4-o (Highly Recommended)"
+    ],
+    [
+    Image.open("./assets/chenduling/chengduling.jpg").convert("RGBA"),
+     "replace the clothes to a delicated floral skirt.",
+     648464818,
+     "chenduling",
+     "chenduling",
+     True,
+     False,
+     "GPT4-o (Highly Recommended)"
+    ],
+    [
+    Image.open("./assets/hedgehog_rp_bg/hedgehog.png").convert("RGBA"),
+     "make the hedgehog in Italy.",
+     648464818,
+     "hedgehog_rp_bg",
+     "hedgehog_rp_bg",
+     True,
+     False,
+     "GPT4-o (Highly Recommended)"
+    ],
+]
+INPUT_IMAGE_PATH = {
+    "frog": "./assets/frog/frog.jpeg",
+    "chinese_girl": "./assets/chinese_girl/chinese_girl.png",
+    "angel_christmas": "./assets/angel_christmas/angel_christmas.png",
+    "sunflower_girl": "./assets/sunflower_girl/sunflower_girl.png",
+    "girl_on_sun": "./assets/girl_on_sun/girl_on_sun.png",
+    "spider_man_rm": "./assets/spider_man_rm/spider_man.png",
+    "anime_flower": "./assets/anime_flower/anime_flower.png",
+    "chenduling": "./assets/chenduling/chengduling.jpg",
+    "hedgehog_rp_bg": "./assets/hedgehog_rp_bg/hedgehog.png",
+}
+MASK_IMAGE_PATH = {
+    "frog": "./assets/frog/mask_f7b350de-6f2c-49e3-b535-995c486d78e7.png",
+    "chinese_girl": "./assets/chinese_girl/mask_54759648-0989-48e0-bc82-f20e28b5ec29.png",
+    "angel_christmas": "./assets/angel_christmas/mask_f15d9b45-c978-4e3d-9f5f-251e308560c3.png",
+    "sunflower_girl": "./assets/sunflower_girl/mask_99cc50b4-7dc4-4de5-8748-ec10772f0317.png",
+    "girl_on_sun": "./assets/girl_on_sun/mask_264eac8b-8b65-479c-9755-020a60880c37.png",
+    "spider_man_rm": "./assets/spider_man_rm/mask_a5d410e6-8e8d-432f-8144-defbc3e1eae9.png",
+    "anime_flower": "./assets/anime_flower/mask_37553172-9b38-4727-bf2e-37d7e2b93461.png",
+    "chenduling": "./assets/chenduling/mask_68e3ff6f-da07-4b37-91df-13d6eed7b997.png",
+    "hedgehog_rp_bg": "./assets/hedgehog_rp_bg/mask_db7f8bf8-8349-46d3-b14e-43d67fbe25d3.png",
+}
+MASKED_IMAGE_PATH = {
+    "frog": "./assets/frog/masked_image_f7b350de-6f2c-49e3-b535-995c486d78e7.png",
+    "chinese_girl": "./assets/chinese_girl/masked_image_54759648-0989-48e0-bc82-f20e28b5ec29.png",
+    "angel_christmas": "./assets/angel_christmas/masked_image_f15d9b45-c978-4e3d-9f5f-251e308560c3.png",
+    "sunflower_girl": "./assets/sunflower_girl/masked_image_99cc50b4-7dc4-4de5-8748-ec10772f0317.png",
+    "girl_on_sun": "./assets/girl_on_sun/masked_image_264eac8b-8b65-479c-9755-020a60880c37.png",
+    "spider_man_rm": "./assets/spider_man_rm/masked_image_a5d410e6-8e8d-432f-8144-defbc3e1eae9.png",
+    "anime_flower": "./assets/anime_flower/masked_image_37553172-9b38-4727-bf2e-37d7e2b93461.png",
+    "chenduling": "./assets/chenduling/masked_image_68e3ff6f-da07-4b37-91df-13d6eed7b997.png",
+    "hedgehog_rp_bg": "./assets/hedgehog_rp_bg/masked_image_db7f8bf8-8349-46d3-b14e-43d67fbe25d3.png",
+}
+OUTPUT_IMAGE_PATH = {
+    "frog": "./assets/frog/image_edit_f7b350de-6f2c-49e3-b535-995c486d78e7_1.png",
+    "chinese_girl": "./assets/chinese_girl/image_edit_54759648-0989-48e0-bc82-f20e28b5ec29_1.png",
+    "angel_christmas": "./assets/angel_christmas/image_edit_f15d9b45-c978-4e3d-9f5f-251e308560c3_0.png",
+    "sunflower_girl": "./assets/sunflower_girl/image_edit_99cc50b4-7dc4-4de5-8748-ec10772f0317_3.png",
+    "girl_on_sun": "./assets/girl_on_sun/image_edit_264eac8b-8b65-479c-9755-020a60880c37_0.png",
+    "spider_man_rm": "./assets/spider_man_rm/image_edit_a5d410e6-8e8d-432f-8144-defbc3e1eae9_0.png",
+    "anime_flower": "./assets/anime_flower/image_edit_37553172-9b38-4727-bf2e-37d7e2b93461_2.png",
+    "chenduling": "./assets/chenduling/image_edit_68e3ff6f-da07-4b37-91df-13d6eed7b997_0.png",
+    "hedgehog_rp_bg": "./assets/hedgehog_rp_bg/image_edit_db7f8bf8-8349-46d3-b14e-43d67fbe25d3_3.png",
+}
+# os.environ['GRADIO_TEMP_DIR'] = 'gradio_temp_dir'
+# os.makedirs('gradio_temp_dir', exist_ok=True)
+VLM_MODEL_NAMES = list(vlms_template.keys())
+DEFAULT_VLM_MODEL_NAME = "Qwen2-VL-7B-Instruct (Default)"
+BASE_MODELS = list(base_models_template.keys())
+DEFAULT_BASE_MODEL = "realisticVision (Default)"
+ASPECT_RATIO_LABELS = list(aspect_ratios)
+DEFAULT_ASPECT_RATIO = ASPECT_RATIO_LABELS[0]
+## init device
+try:
+    if torch.cuda.is_available():
+        device = "cuda"
+    elif sys.platform == "darwin" and torch.backends.mps.is_available():
+        device = "mps"
+    else:
+        device = "cpu"
+except:
+    device = "cpu"
+# ## init torch dtype
+# if torch.cuda.is_available() and torch.cuda.is_bf16_supported():
+#     torch_dtype = torch.bfloat16
+# else:
+#     torch_dtype = torch.float16
+# if device == "mps":
+#     torch_dtype = torch.float16
+torch_dtype = torch.float16
+# download hf models
+BrushEdit_path = "models/"
+if not os.path.exists(BrushEdit_path):
+    BrushEdit_path = snapshot_download(
+        repo_id="TencentARC/BrushEdit",
+        local_dir=BrushEdit_path,
+        token=os.getenv("HF_TOKEN"),
+    )
+## init default VLM
+vlm_type, vlm_local_path, vlm_processor, vlm_model = vlms_template[DEFAULT_VLM_MODEL_NAME]
+if vlm_processor != "" and vlm_model != "":
+    vlm_model.to(device)
+else:
+    gr.Error("Please Download default VLM model "+ DEFAULT_VLM_MODEL_NAME +" first.")
+## init base model
+base_model_path = os.path.join(BrushEdit_path, "base_model/realisticVisionV60B1_v51VAE")
+brushnet_path = os.path.join(BrushEdit_path, "brushnetX")
+sam_path = os.path.join(BrushEdit_path, "sam/sam_vit_h_4b8939.pth")
+groundingdino_path = os.path.join(BrushEdit_path, "grounding_dino/groundingdino_swint_ogc.pth")
+# input brushnetX ckpt path
+brushnet = BrushNetModel.from_pretrained(brushnet_path, torch_dtype=torch_dtype)
+pipe = StableDiffusionBrushNetPipeline.from_pretrained(
+        base_model_path, brushnet=brushnet, torch_dtype=torch_dtype, low_cpu_mem_usage=False
+    )
+# speed up diffusion process with faster scheduler and memory optimization
+pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
+# remove following line if xformers is not installed or when using Torch 2.0.
+# pipe.enable_xformers_memory_efficient_attention()
+pipe.enable_model_cpu_offload()
+## init SAM
+sam = build_sam(checkpoint=sam_path)
+sam.to(device=device)
+sam_predictor = SamPredictor(sam)
+sam_automask_generator = SamAutomaticMaskGenerator(sam)
+## init groundingdino_model
+config_file = 'app/utils/GroundingDINO_SwinT_OGC.py'
+groundingdino_model = load_grounding_dino_model(config_file, groundingdino_path, device=device)
+## Ordinary function
+def crop_and_resize(image: Image.Image,
+                    target_width: int,
+                    target_height: int) -> Image.Image:
+    """
+    Crops and resizes an image while preserving the aspect ratio.
+    Args:
+        image (Image.Image): Input PIL image to be cropped and resized.
+        target_width (int): Target width of the output image.
+        target_height (int): Target height of the output image.
+    Returns:
+        Image.Image: Cropped and resized image.
+    """
+    # Original dimensions
+    original_width, original_height = image.size
+    original_aspect = original_width / original_height
+    target_aspect = target_width / target_height
+    # Calculate crop box to maintain aspect ratio
+    if original_aspect > target_aspect:
+        # Crop horizontally
+        new_width = int(original_height * target_aspect)
+        new_height = original_height
+        left = (original_width - new_width) / 2
+        top = 0
+        right = left + new_width
+        bottom = original_height
+    else:
+        # Crop vertically
+        new_width = original_width
+        new_height = int(original_width / target_aspect)
+        left = 0
+        top = (original_height - new_height) / 2
+        right = original_width
+        bottom = top + new_height
+    # Crop and resize
+    cropped_image = image.crop((left, top, right, bottom))
+    resized_image = cropped_image.resize((target_width, target_height), Image.NEAREST)
+    return resized_image
+## Ordinary function
+def resize(image: Image.Image,
+                    target_width: int,
+                    target_height: int) -> Image.Image:
+    """
+    Crops and resizes an image while preserving the aspect ratio.
+    Args:
+        image (Image.Image): Input PIL image to be cropped and resized.
+        target_width (int): Target width of the output image.
+        target_height (int): Target height of the output image.
+    Returns:
+        Image.Image: Cropped and resized image.
+    """
+    # Original dimensions
+    resized_image = image.resize((target_width, target_height), Image.NEAREST)
+    return resized_image
+def move_mask_func(mask, direction, units):
+    binary_mask = mask.squeeze()>0
+    rows, cols = binary_mask.shape
+    moved_mask = np.zeros_like(binary_mask, dtype=bool)
+    if direction == 'down':
+        # move down
+        moved_mask[max(0, units):, :] = binary_mask[:rows - units, :]
+    elif direction == 'up':
+        # move up
+        moved_mask[:rows - units, :] = binary_mask[units:, :]
+    elif direction == 'right':
+        # move left
+        moved_mask[:, max(0, units):] = binary_mask[:, :cols - units]
+    elif direction == 'left':
+        # move right
+        moved_mask[:, :cols - units] = binary_mask[:, units:]
+    return moved_mask
+def random_mask_func(mask, dilation_type='square', dilation_size=20):
+    # Randomly select the size of dilation
+    binary_mask = mask.squeeze()>0
+    if dilation_type == 'square_dilation':
+        structure = np.ones((dilation_size, dilation_size), dtype=bool)
+        dilated_mask = binary_dilation(binary_mask, structure=structure)
+    elif dilation_type == 'square_erosion':
+        structure = np.ones((dilation_size, dilation_size), dtype=bool)
+        dilated_mask = binary_erosion(binary_mask, structure=structure)
+    elif dilation_type == 'bounding_box':
+        # find the most left top and left bottom point
+        rows, cols = np.where(binary_mask)
+        if len(rows) == 0 or len(cols) == 0:
+            return mask  # return original mask if no valid points
+        min_row = np.min(rows)
+        max_row = np.max(rows)
+        min_col = np.min(cols)
+        max_col = np.max(cols)
+        # create a bounding box
+        dilated_mask = np.zeros_like(binary_mask, dtype=bool)
+        dilated_mask[min_row:max_row + 1, min_col:max_col + 1] = True
+    elif dilation_type == 'bounding_ellipse':
+        # find the most left top and left bottom point
+        rows, cols = np.where(binary_mask)
+        if len(rows) == 0 or len(cols) == 0:
+            return mask  # return original mask if no valid points
+        min_row = np.min(rows)
+        max_row = np.max(rows)
+        min_col = np.min(cols)
+        max_col = np.max(cols)
+        # calculate the center and axis length of the ellipse
+        center = ((min_col + max_col) // 2, (min_row + max_row) // 2)
+        a = (max_col - min_col) // 2  # half long axis
+        b = (max_row - min_row) // 2  # half short axis
+        # create a bounding ellipse
+        y, x = np.ogrid[:mask.shape[0], :mask.shape[1]]
+        ellipse_mask = ((x - center[0])**2 / a**2 + (y - center[1])**2 / b**2) <= 1
+        dilated_mask = np.zeros_like(binary_mask, dtype=bool)
+        dilated_mask[ellipse_mask] = True
+    else:
+        raise ValueError("dilation_type must be 'square' or 'ellipse'")
+    # use binary dilation
+    dilated_mask =  np.uint8(dilated_mask[:,:,np.newaxis]) * 255
+    return dilated_mask
+## Gradio component function
+def update_vlm_model(vlm_name):
+    global vlm_model, vlm_processor
+    if vlm_model is not None:
+        del vlm_model
+        torch.cuda.empty_cache()
+    vlm_type, vlm_local_path, vlm_processor, vlm_model = vlms_template[vlm_name]
+    ## we recommend using preload models, otherwise it will take a long time to download the model. you can edit the code via vlm_template.py
+    if vlm_type == "llava-next":
+        if vlm_processor != "" and vlm_model != "":
+            vlm_model.to(device)
+            return vlm_model_dropdown
+        else:
+            if os.path.exists(vlm_local_path):
+                vlm_processor = LlavaNextProcessor.from_pretrained(vlm_local_path)
+                vlm_model = LlavaNextForConditionalGeneration.from_pretrained(vlm_local_path, torch_dtype="auto", device_map="auto")
+            else:
+                if vlm_name == "llava-v1.6-mistral-7b-hf (Preload)":
+                    vlm_processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
+                    vlm_model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype="auto", device_map="auto")
+                elif vlm_name == "llama3-llava-next-8b-hf (Preload)":
+                    vlm_processor = LlavaNextProcessor.from_pretrained("llava-hf/llama3-llava-next-8b-hf")
+                    vlm_model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llama3-llava-next-8b-hf", torch_dtype="auto", device_map="auto")
+                elif vlm_name == "llava-v1.6-vicuna-13b-hf (Preload)":
+                    vlm_processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-vicuna-13b-hf")
+                    vlm_model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-vicuna-13b-hf", torch_dtype="auto", device_map="auto")
+                elif vlm_name == "llava-v1.6-34b-hf (Preload)":
+                    vlm_processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-34b-hf")
+                    vlm_model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-34b-hf", torch_dtype="auto", device_map="auto")
+                elif vlm_name == "llava-next-72b-hf (Preload)":
+                    vlm_processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-next-72b-hf")
+                    vlm_model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-next-72b-hf", torch_dtype="auto", device_map="auto")
+    elif vlm_type == "qwen2-vl":
+        if vlm_processor != "" and vlm_model != "":
+            vlm_model.to(device)
+            return vlm_model_dropdown
+        else:
+            if os.path.exists(vlm_local_path):
+                vlm_processor = Qwen2VLProcessor.from_pretrained(vlm_local_path)
+                vlm_model = Qwen2VLForConditionalGeneration.from_pretrained(vlm_local_path, torch_dtype="auto", device_map="auto")
+            else:
+                if vlm_name == "qwen2-vl-2b-instruct (Preload)":
+                    vlm_processor = Qwen2VLProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")
+                    vlm_model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto")
+                elif vlm_name == "qwen2-vl-7b-instruct (Preload)":
+                    vlm_processor = Qwen2VLProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
+                    vlm_model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto")
+                elif vlm_name == "qwen2-vl-72b-instruct (Preload)":
+                    vlm_processor = Qwen2VLProcessor.from_pretrained("Qwen/Qwen2-VL-72B-Instruct")
+                    vlm_model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto")
+    elif vlm_type == "openai":
+        pass
+    return "success"
+def update_base_model(base_model_name):
+    global pipe
+    ## we recommend using preload models, otherwise it will take a long time to download the model. you can edit the code via base_model_template.py
+    if pipe is not None:
+        del pipe
+        torch.cuda.empty_cache()
+    base_model_path, pipe = base_models_template[base_model_name]
+    if pipe != "":
+        pipe.to(device)
+    else:
+        if os.path.exists(base_model_path):
+            pipe = StableDiffusionBrushNetPipeline.from_pretrained(
+                base_model_path, brushnet=brushnet, torch_dtype=torch_dtype, low_cpu_mem_usage=False
+            )
+            # pipe.enable_xformers_memory_efficient_attention()
+            pipe.enable_model_cpu_offload()
+        else:
+            raise gr.Error(f"The base model {base_model_name} does not exist")
+    return "success"
+def submit_GPT4o_KEY(GPT4o_KEY):
+    global vlm_model, vlm_processor
+    if vlm_model is not None:
+        del vlm_model
+        torch.cuda.empty_cache()
+    try:
+        vlm_model = OpenAI(api_key=GPT4o_KEY)
+        vlm_processor = ""
+        response = vlm_model.chat.completions.create(
+                model="gpt-4o-2024-08-06",
+                messages=[
+                    {"role": "system", "content": "You are a helpful assistant."},
+                    {"role": "user", "content": "Say this is a test"}
+                ]
+            )
+        response_str = response.choices[0].message.content
+        return "Success, " + response_str, "GPT4-o (Highly Recommended)"
+    except Exception as e:
+        return "Invalid GPT4o API Key", "GPT4-o (Highly Recommended)"
+@spaces.GPU(duration=180)
+def process(input_image,
+    original_image,
+    original_mask,
+    prompt,
+    negative_prompt,
+    control_strength,
+    seed,
+    randomize_seed,
+    guidance_scale,
+    num_inference_steps,
+    num_samples,
+    blending,
+    category,
+    target_prompt,
+    resize_default,
+    aspect_ratio_name,
+    invert_mask_state):
+    if original_image is None:
+        if input_image is None:
+            raise gr.Error('Please upload the input image')
+        else:
+            image_pil = input_image["background"].convert("RGB")
+            original_image = np.array(image_pil)
+    if prompt is None or prompt == "":
+        raise gr.Error("Please input your instructions, e.g., remove the xxx")
+    alpha_mask = input_image["layers"][0].split()[3]
+    input_mask = np.asarray(alpha_mask)
+    output_w, output_h = aspect_ratios[aspect_ratio_name]
+    if output_w == "" or output_h == "":
+        output_h, output_w = original_image.shape[:2]
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+            original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+            original_image = np.array(original_image)
+            if input_mask is not None:
+                input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+                input_mask = np.array(input_mask)
+            if original_mask is not None:
+                original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+                original_mask = np.array(original_mask)
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        else:
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+            pass
+    else:
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+        gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+        original_image = np.array(original_image)
+        if input_mask is not None:
+            input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+            input_mask = np.array(input_mask)
+        if original_mask is not None:
+            original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+            original_mask = np.array(original_mask)
+    if invert_mask_state:
+        original_mask = original_mask
+    else:
+        if input_mask.max() == 0:
+            original_mask = original_mask
+        else:
+            original_mask = input_mask
+    if category is not None:
+        pass
+    else:
+        category = vlm_response_editing_type(vlm_processor, vlm_model, original_image, prompt, device)
+    if original_mask is not None:
+        original_mask = np.clip(original_mask, 0, 255).astype(np.uint8)
+    else:
+        object_wait_for_edit = vlm_response_object_wait_for_edit(
+                                                vlm_processor,
+                                                vlm_model,
+                                                original_image,
+                                                category,
+                                                prompt,
+                                                device)
+        original_mask = vlm_response_mask(vlm_processor,
+                                          vlm_model,
+                                          category,
+                                          original_image,
+                                          prompt,
+                                          object_wait_for_edit,
+                                          sam,
+                                          sam_predictor,
+                                          sam_automask_generator,
+                                          groundingdino_model,
+                                          device)
+    if original_mask.ndim == 2:
+        original_mask = original_mask[:,:,None]
+    if len(target_prompt) <= 1:
+        prompt_after_apply_instruction = vlm_response_prompt_after_apply_instruction(
+                                                                    vlm_processor,
+                                                                    vlm_model,
+                                                                    original_image,
+                                                                    prompt,
+                                                                    device)
+    else:
+        prompt_after_apply_instruction = target_prompt
+    generator = torch.Generator(device).manual_seed(random.randint(0, 2147483647) if randomize_seed else seed)
+    with torch.autocast(device):
+        image, mask_image, mask_np, init_image_np = BrushEdit_Pipeline(pipe,
+                                    prompt_after_apply_instruction,
+                                    original_mask,
+                                    original_image,
+                                    generator,
+                                    num_inference_steps,
+                                    guidance_scale,
+                                    control_strength,
+                                    negative_prompt,
+                                    num_samples,
+                                    blending)
+    original_image = np.array(init_image_np)
+    masked_image = original_image * (1 - (mask_np>0))
+    masked_image = masked_image.astype(np.uint8)
+    masked_image = Image.fromarray(masked_image)
+    # Save the images (optional)
+    # import uuid
+    # uuid = str(uuid.uuid4())
+    # image[0].save(f"outputs/image_edit_{uuid}_0.png")
+    # image[1].save(f"outputs/image_edit_{uuid}_1.png")
+    # image[2].save(f"outputs/image_edit_{uuid}_2.png")
+    # image[3].save(f"outputs/image_edit_{uuid}_3.png")
+    # mask_image.save(f"outputs/mask_{uuid}.png")
+    # masked_image.save(f"outputs/masked_image_{uuid}.png")
+    return image, [mask_image], [masked_image], prompt, '', prompt_after_apply_instruction, False
+def generate_target_prompt(input_image,
+                           original_image,
+                           prompt):
+    # load example image
+    if isinstance(original_image, str):
+        original_image = input_image
+    prompt_after_apply_instruction = vlm_response_prompt_after_apply_instruction(
+                                                            vlm_processor,
+                                                            vlm_model,
+                                                            original_image,
+                                                            prompt,
+                                                            device)
+    return prompt_after_apply_instruction, prompt_after_apply_instruction
+def process_mask(input_image,
+    original_image,
+    prompt,
+    resize_default,
+    aspect_ratio_name):
+    if original_image is None:
+        raise gr.Error('Please upload the input image')
+    if prompt is None:
+        raise gr.Error("Please input your instructions, e.g., remove the xxx")
+    ## load mask
+    alpha_mask = input_image["layers"][0].split()[3]
+    input_mask = np.array(alpha_mask)
+    # load example image
+    if isinstance(original_image, str):
+        original_image = input_image["background"]
+    if input_mask.max() == 0:
+        category = vlm_response_editing_type(vlm_processor, vlm_model, original_image, prompt, device)
+        object_wait_for_edit = vlm_response_object_wait_for_edit(vlm_processor,
+                                                                vlm_model,
+                                                                original_image,
+                                                                category,
+                                                                prompt,
+                                                                device)
+        # original mask: h,w,1 [0, 255]
+        original_mask = vlm_response_mask(
+                                vlm_processor,
+                                vlm_model,
+                                category,
+                                original_image,
+                                prompt,
+                                object_wait_for_edit,
+                                sam,
+                                sam_predictor,
+                                sam_automask_generator,
+                                groundingdino_model,
+                                device)
+    else:
+        original_mask = input_mask
+        category = None
+    ## resize mask if needed
+    output_w, output_h = aspect_ratios[aspect_ratio_name]
+    if output_w == "" or output_h == "":
+        output_h, output_w = original_image.shape[:2]
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+            original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+            original_image = np.array(original_image)
+            if input_mask is not None:
+                input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+                input_mask = np.array(input_mask)
+            if original_mask is not None:
+                original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+                original_mask = np.array(original_mask)
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        else:
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+            pass
+    else:
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+        gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+        original_image = np.array(original_image)
+        if input_mask is not None:
+            input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+            input_mask = np.array(input_mask)
+        if original_mask is not None:
+            original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+            original_mask = np.array(original_mask)
+    if original_mask.ndim == 2:
+        original_mask = original_mask[:,:,None]
+    mask_image = Image.fromarray(original_mask.squeeze().astype(np.uint8)).convert("RGB")
+    masked_image = original_image * (1 - (original_mask>0))
+    masked_image = masked_image.astype(np.uint8)
+    masked_image = Image.fromarray(masked_image)
+    return [masked_image], [mask_image], original_mask.astype(np.uint8), category
+def process_random_mask(input_image,
+                         original_image,
+                         original_mask,
+                         resize_default,
+                         aspect_ratio_name,
+                         ):
+    alpha_mask = input_image["layers"][0].split()[3]
+    input_mask = np.asarray(alpha_mask)
+    output_w, output_h = aspect_ratios[aspect_ratio_name]
+    if output_w == "" or output_h == "":
+        output_h, output_w = original_image.shape[:2]
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+            original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+            original_image = np.array(original_image)
+            if input_mask is not None:
+                input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+                input_mask = np.array(input_mask)
+            if original_mask is not None:
+                original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+                original_mask = np.array(original_mask)
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        else:
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+            pass
+    else:
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+        gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+        original_image = np.array(original_image)
+        if input_mask is not None:
+            input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+            input_mask = np.array(input_mask)
+        if original_mask is not None:
+            original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+            original_mask = np.array(original_mask)
+    if input_mask.max() == 0:
+        original_mask = original_mask
+    else:
+        original_mask = input_mask
+    if original_mask is None:
+        raise gr.Error('Please generate mask first')
+    if original_mask.ndim == 2:
+        original_mask = original_mask[:,:,None]
+    dilation_type = np.random.choice(['bounding_box', 'bounding_ellipse'])
+    random_mask = random_mask_func(original_mask, dilation_type).squeeze()
+    mask_image = Image.fromarray(random_mask.astype(np.uint8)).convert("RGB")
+    masked_image = original_image * (1 - (random_mask[:,:,None]>0))
+    masked_image = masked_image.astype(original_image.dtype)
+    masked_image = Image.fromarray(masked_image)
+    return [masked_image], [mask_image], random_mask[:,:,None].astype(np.uint8)
+def process_dilation_mask(input_image,
+                          original_image,
+                          original_mask,
+                          resize_default,
+                          aspect_ratio_name,
+                          dilation_size=20):
+    alpha_mask = input_image["layers"][0].split()[3]
+    input_mask = np.asarray(alpha_mask)
+    output_w, output_h = aspect_ratios[aspect_ratio_name]
+    if output_w == "" or output_h == "":
+        output_h, output_w = original_image.shape[:2]
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+            original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+            original_image = np.array(original_image)
+            if input_mask is not None:
+                input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+                input_mask = np.array(input_mask)
+            if original_mask is not None:
+                original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+                original_mask = np.array(original_mask)
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        else:
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+            pass
+    else:
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+        gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+        original_image = np.array(original_image)
+        if input_mask is not None:
+            input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+            input_mask = np.array(input_mask)
+        if original_mask is not None:
+            original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+            original_mask = np.array(original_mask)
+    if input_mask.max() == 0:
+        original_mask = original_mask
+    else:
+        original_mask = input_mask
+    if original_mask is None:
+        raise gr.Error('Please generate mask first')
+    if original_mask.ndim == 2:
+        original_mask = original_mask[:,:,None]
+    dilation_type = np.random.choice(['square_dilation'])
+    random_mask = random_mask_func(original_mask, dilation_type, dilation_size).squeeze()
+    mask_image = Image.fromarray(random_mask.astype(np.uint8)).convert("RGB")
+    masked_image = original_image * (1 - (random_mask[:,:,None]>0))
+    masked_image = masked_image.astype(original_image.dtype)
+    masked_image = Image.fromarray(masked_image)
+    return [masked_image], [mask_image], random_mask[:,:,None].astype(np.uint8)
+def process_erosion_mask(input_image,
+                         original_image,
+                         original_mask,
+                         resize_default,
+                         aspect_ratio_name,
+                         dilation_size=20):
+    alpha_mask = input_image["layers"][0].split()[3]
+    input_mask = np.asarray(alpha_mask)
+    output_w, output_h = aspect_ratios[aspect_ratio_name]
+    if output_w == "" or output_h == "":
+        output_h, output_w = original_image.shape[:2]
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+            original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+            original_image = np.array(original_image)
+            if input_mask is not None:
+                input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+                input_mask = np.array(input_mask)
+            if original_mask is not None:
+                original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+                original_mask = np.array(original_mask)
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        else:
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+            pass
+    else:
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+        gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+        original_image = np.array(original_image)
+        if input_mask is not None:
+            input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+            input_mask = np.array(input_mask)
+        if original_mask is not None:
+            original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+            original_mask = np.array(original_mask)
+    if input_mask.max() == 0:
+        original_mask = original_mask
+    else:
+        original_mask = input_mask
+    if original_mask is None:
+        raise gr.Error('Please generate mask first')
+    if original_mask.ndim == 2:
+        original_mask = original_mask[:,:,None]
+    dilation_type = np.random.choice(['square_erosion'])
+    random_mask = random_mask_func(original_mask, dilation_type, dilation_size).squeeze()
+    mask_image = Image.fromarray(random_mask.astype(np.uint8)).convert("RGB")
+    masked_image = original_image * (1 - (random_mask[:,:,None]>0))
+    masked_image = masked_image.astype(original_image.dtype)
+    masked_image = Image.fromarray(masked_image)
+    return [masked_image], [mask_image], random_mask[:,:,None].astype(np.uint8)
+def move_mask_left(input_image,
+                   original_image,
+                   original_mask,
+                   moving_pixels,
+                   resize_default,
+                   aspect_ratio_name):
+    alpha_mask = input_image["layers"][0].split()[3]
+    input_mask = np.asarray(alpha_mask)
+    output_w, output_h = aspect_ratios[aspect_ratio_name]
+    if output_w == "" or output_h == "":
+        output_h, output_w = original_image.shape[:2]
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+            original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+            original_image = np.array(original_image)
+            if input_mask is not None:
+                input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+                input_mask = np.array(input_mask)
+            if original_mask is not None:
+                original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+                original_mask = np.array(original_mask)
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        else:
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+            pass
+    else:
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+        gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+        original_image = np.array(original_image)
+        if input_mask is not None:
+            input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+            input_mask = np.array(input_mask)
+        if original_mask is not None:
+            original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+            original_mask = np.array(original_mask)
+    if input_mask.max() == 0:
+        original_mask = original_mask
+    else:
+        original_mask = input_mask
+    if original_mask is None:
+        raise gr.Error('Please generate mask first')
+    if original_mask.ndim == 2:
+        original_mask = original_mask[:,:,None]
+    moved_mask = move_mask_func(original_mask, 'left', int(moving_pixels)).squeeze()
+    mask_image = Image.fromarray(((moved_mask>0).astype(np.uint8)*255)).convert("RGB")
+    masked_image = original_image * (1 - (moved_mask[:,:,None]>0))
+    masked_image = masked_image.astype(original_image.dtype)
+    masked_image = Image.fromarray(masked_image)
+    if moved_mask.max() <= 1:
+        moved_mask = ((moved_mask * 255)[:,:,None]).astype(np.uint8)
+        original_mask = moved_mask
+    return [masked_image], [mask_image], original_mask.astype(np.uint8)
+def move_mask_right(input_image,
+                    original_image,
+                    original_mask,
+                    moving_pixels,
+                    resize_default,
+                    aspect_ratio_name):
+    alpha_mask = input_image["layers"][0].split()[3]
+    input_mask = np.asarray(alpha_mask)
+    output_w, output_h = aspect_ratios[aspect_ratio_name]
+    if output_w == "" or output_h == "":
+        output_h, output_w = original_image.shape[:2]
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+            original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+            original_image = np.array(original_image)
+            if input_mask is not None:
+                input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+                input_mask = np.array(input_mask)
+            if original_mask is not None:
+                original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+                original_mask = np.array(original_mask)
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        else:
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+            pass
+    else:
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+        gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+        original_image = np.array(original_image)
+        if input_mask is not None:
+            input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+            input_mask = np.array(input_mask)
+        if original_mask is not None:
+            original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+            original_mask = np.array(original_mask)
+    if input_mask.max() == 0:
+        original_mask = original_mask
+    else:
+        original_mask = input_mask
+    if original_mask is None:
+        raise gr.Error('Please generate mask first')
+    if original_mask.ndim == 2:
+        original_mask = original_mask[:,:,None]
+    moved_mask = move_mask_func(original_mask, 'right', int(moving_pixels)).squeeze()
+    mask_image = Image.fromarray(((moved_mask>0).astype(np.uint8)*255)).convert("RGB")
+    masked_image = original_image * (1 - (moved_mask[:,:,None]>0))
+    masked_image = masked_image.astype(original_image.dtype)
+    masked_image = Image.fromarray(masked_image)
+    if moved_mask.max() <= 1:
+        moved_mask = ((moved_mask * 255)[:,:,None]).astype(np.uint8)
+        original_mask = moved_mask
+    return [masked_image], [mask_image], original_mask.astype(np.uint8)
+def move_mask_up(input_image,
+                 original_image,
+                 original_mask,
+                 moving_pixels,
+                 resize_default,
+                 aspect_ratio_name):
+    alpha_mask = input_image["layers"][0].split()[3]
+    input_mask = np.asarray(alpha_mask)
+    output_w, output_h = aspect_ratios[aspect_ratio_name]
+    if output_w == "" or output_h == "":
+        output_h, output_w = original_image.shape[:2]
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+            original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+            original_image = np.array(original_image)
+            if input_mask is not None:
+                input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+                input_mask = np.array(input_mask)
+            if original_mask is not None:
+                original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+                original_mask = np.array(original_mask)
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        else:
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+            pass
+    else:
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+        gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+        original_image = np.array(original_image)
+        if input_mask is not None:
+            input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+            input_mask = np.array(input_mask)
+        if original_mask is not None:
+            original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+            original_mask = np.array(original_mask)
+    if input_mask.max() == 0:
+        original_mask = original_mask
+    else:
+        original_mask = input_mask
+    if original_mask is None:
+        raise gr.Error('Please generate mask first')
+    if original_mask.ndim == 2:
+        original_mask = original_mask[:,:,None]
+    moved_mask = move_mask_func(original_mask, 'up', int(moving_pixels)).squeeze()
+    mask_image = Image.fromarray(((moved_mask>0).astype(np.uint8)*255)).convert("RGB")
+    masked_image = original_image * (1 - (moved_mask[:,:,None]>0))
+    masked_image = masked_image.astype(original_image.dtype)
+    masked_image = Image.fromarray(masked_image)
+    if moved_mask.max() <= 1:
+        moved_mask = ((moved_mask * 255)[:,:,None]).astype(np.uint8)
+        original_mask = moved_mask
+    return [masked_image], [mask_image], original_mask.astype(np.uint8)
+def move_mask_down(input_image,
+                   original_image,
+                   original_mask,
+                   moving_pixels,
+                   resize_default,
+                   aspect_ratio_name):
+    alpha_mask = input_image["layers"][0].split()[3]
+    input_mask = np.asarray(alpha_mask)
+    output_w, output_h = aspect_ratios[aspect_ratio_name]
+    if output_w == "" or output_h == "":
+        output_h, output_w = original_image.shape[:2]
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+            original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+            original_image = np.array(original_image)
+            if input_mask is not None:
+                input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+                input_mask = np.array(input_mask)
+            if original_mask is not None:
+                original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+                original_mask = np.array(original_mask)
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        else:
+            gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+            pass
+    else:
+        if resize_default:
+            short_side = min(output_w, output_h)
+            scale_ratio = 640 / short_side
+            output_w = int(output_w * scale_ratio)
+            output_h = int(output_h * scale_ratio)
+        gr.Info(f"Output aspect ratio: {output_w}:{output_h}")
+        original_image = resize(Image.fromarray(original_image), target_width=int(output_w), target_height=int(output_h))
+        original_image = np.array(original_image)
+        if input_mask is not None:
+            input_mask = resize(Image.fromarray(np.squeeze(input_mask)), target_width=int(output_w), target_height=int(output_h))
+            input_mask = np.array(input_mask)
+        if original_mask is not None:
+            original_mask = resize(Image.fromarray(np.squeeze(original_mask)), target_width=int(output_w), target_height=int(output_h))
+            original_mask = np.array(original_mask)
+    if input_mask.max() == 0:
+        original_mask = original_mask
+    else:
+        original_mask = input_mask
+    if original_mask is None:
+        raise gr.Error('Please generate mask first')
+    if original_mask.ndim == 2:
+        original_mask = original_mask[:,:,None]
+    moved_mask = move_mask_func(original_mask, 'down', int(moving_pixels)).squeeze()
+    mask_image = Image.fromarray(((moved_mask>0).astype(np.uint8)*255)).convert("RGB")
+    masked_image = original_image * (1 - (moved_mask[:,:,None]>0))
+    masked_image = masked_image.astype(original_image.dtype)
+    masked_image = Image.fromarray(masked_image)
+    if moved_mask.max() <= 1:
+        moved_mask = ((moved_mask * 255)[:,:,None]).astype(np.uint8)
+        original_mask = moved_mask
+    return [masked_image], [mask_image], original_mask.astype(np.uint8)
+def invert_mask(input_image,
+                original_image,
+                original_mask,
+                ):
+    alpha_mask = input_image["layers"][0].split()[3]
+    input_mask = np.asarray(alpha_mask)
+    if input_mask.max() == 0:
+        original_mask = 1 - (original_mask>0).astype(np.uint8)
+    else:
+        original_mask = 1 - (input_mask>0).astype(np.uint8)
+    if original_mask is None:
+        raise gr.Error('Please generate mask first')
+    original_mask = original_mask.squeeze()
+    mask_image = Image.fromarray(original_mask*255).convert("RGB")
+    if original_mask.ndim == 2:
+        original_mask = original_mask[:,:,None]
+    if original_mask.max() <= 1:
+        original_mask = (original_mask * 255).astype(np.uint8)
+    masked_image = original_image * (1 - (original_mask>0))
+    masked_image = masked_image.astype(original_image.dtype)
+    masked_image = Image.fromarray(masked_image)
+    return [masked_image], [mask_image], original_mask, True
+def init_img(base,
+             init_type,
+             prompt,
+             aspect_ratio,
+             example_change_times
+             ):
+    image_pil = base["background"].convert("RGB")
+    original_image = np.array(image_pil)
+    if max(original_image.shape[0], original_image.shape[1]) * 1.0 / min(original_image.shape[0], original_image.shape[1])>2.0:
+        raise gr.Error('image aspect ratio cannot be larger than 2.0')
+    if init_type in MASK_IMAGE_PATH.keys() and example_change_times < 2:
+        mask_gallery = [Image.open(MASK_IMAGE_PATH[init_type]).convert("L")]
+        masked_gallery = [Image.open(MASKED_IMAGE_PATH[init_type]).convert("RGB")]
+        result_gallery = [Image.open(OUTPUT_IMAGE_PATH[init_type]).convert("RGB")]
+        width, height = image_pil.size
+        image_processor = VaeImageProcessor(vae_scale_factor=pipe.vae_scale_factor, do_convert_rgb=True)
+        height_new, width_new = image_processor.get_default_height_width(image_pil, height, width)
+        image_pil = image_pil.resize((width_new, height_new))
+        mask_gallery[0] = mask_gallery[0].resize((width_new, height_new))
+        masked_gallery[0] = masked_gallery[0].resize((width_new, height_new))
+        result_gallery[0] = result_gallery[0].resize((width_new, height_new))
+        original_mask = np.array(mask_gallery[0]).astype(np.uint8)[:,:,None] # h,w,1
+        return base, original_image, original_mask, prompt, mask_gallery, masked_gallery, result_gallery, "", "", "", "Custom resolution", False, False, example_change_times
+    else:
+        return base, original_image, None, "", None, None, None, "", "", "", aspect_ratio, True, False, 0
+def reset_func(input_image,
+               original_image,
+               original_mask,
+               prompt,
+               target_prompt,
+               target_prompt_output):
+    input_image = None
+    original_image = None
+    original_mask = None
+    prompt = ''
+    mask_gallery = []
+    masked_gallery = []
+    result_gallery = []
+    target_prompt = ''
+    target_prompt_output = ''
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+    return input_image, original_image, original_mask, prompt, mask_gallery, masked_gallery, result_gallery, target_prompt, target_prompt_output, True, False
+def update_example(example_type,
+                   prompt,
+                   example_change_times):
+    input_image = INPUT_IMAGE_PATH[example_type]
+    image_pil = Image.open(input_image).convert("RGB")
+    mask_gallery = [Image.open(MASK_IMAGE_PATH[example_type]).convert("L")]
+    masked_gallery = [Image.open(MASKED_IMAGE_PATH[example_type]).convert("RGB")]
+    result_gallery = [Image.open(OUTPUT_IMAGE_PATH[example_type]).convert("RGB")]
+    width, height = image_pil.size
+    image_processor = VaeImageProcessor(vae_scale_factor=pipe.vae_scale_factor, do_convert_rgb=True)
+    height_new, width_new = image_processor.get_default_height_width(image_pil, height, width)
+    image_pil = image_pil.resize((width_new, height_new))
+    mask_gallery[0] = mask_gallery[0].resize((width_new, height_new))
+    masked_gallery[0] = masked_gallery[0].resize((width_new, height_new))
+    result_gallery[0] = result_gallery[0].resize((width_new, height_new))
+    original_image = np.array(image_pil)
+    original_mask = np.array(mask_gallery[0]).astype(np.uint8)[:,:,None] # h,w,1
+    aspect_ratio = "Custom resolution"
+    example_change_times += 1
+    return input_image, prompt, original_image, original_mask, mask_gallery, masked_gallery, result_gallery, aspect_ratio, "", "", False, example_change_times
+block = gr.Blocks(
+        theme=gr.themes.Soft(
+             radius_size=gr.themes.sizes.radius_none,
+             text_size=gr.themes.sizes.text_md
+         )
+        ).queue()
+with block as demo:
+    with gr.Row():
+        with gr.Column():
+            gr.HTML(head)
+    gr.Markdown(descriptions)
+    with gr.Accordion(label="🧭 Instructions:", open=True, elem_id="accordion"):
+        with gr.Row(equal_height=True):
+            gr.Markdown(instructions)
+    original_image = gr.State(value=None)
+    original_mask = gr.State(value=None)
+    category = gr.State(value=None)
+    status = gr.State(value=None)
+    invert_mask_state = gr.State(value=False)
+    example_change_times = gr.State(value=0)
+    with gr.Row():
+        with gr.Column():
+            with gr.Row():
+                input_image = gr.ImageEditor(
+                    label="Input Image",
+                    type="pil",
+                    brush=gr.Brush(colors=["#FFFFFF"], default_size = 30, color_mode="fixed"),
+                    layers = False,
+                    interactive=True,
+                    height=1024,
+                    sources=["upload"],
+                    )
+            vlm_model_dropdown = gr.Dropdown(label="VLM model", choices=VLM_MODEL_NAMES, value=DEFAULT_VLM_MODEL_NAME, interactive=True)
+            with gr.Group():
+                with gr.Row():
+                    GPT4o_KEY = gr.Textbox(label="GPT4o API Key", placeholder="Please input your GPT4o API Key when use GPT4o VLM (highly recommended).", value="", lines=1)
+                    GPT4o_KEY_submit = gr.Button("Submit and Verify")
+            aspect_ratio = gr.Dropdown(label="Output aspect ratio", choices=ASPECT_RATIO_LABELS, value=DEFAULT_ASPECT_RATIO)
+            resize_default = gr.Checkbox(label="Short edge resize to 640px", value=True)
+            prompt = gr.Textbox(label="⌨️ Instruction", placeholder="Please input your instruction.", value="",lines=1)
+            run_button = gr.Button("💫 Run")
+            with gr.Row():
+                mask_button = gr.Button("Generate Mask")
+                random_mask_button = gr.Button("Square/Circle Mask ")
+            with gr.Row():
+                generate_target_prompt_button = gr.Button("Generate Target Prompt")
+            target_prompt = gr.Text(
+                        label="Input Target Prompt",
+                        max_lines=5,
+                        placeholder="VLM-generated target prompt, you can first generate if and then modify it (optional)",
+                        value='',
+                        lines=2
+                    )
+            with gr.Accordion("Advanced Options", open=False, elem_id="accordion1"):
+                base_model_dropdown = gr.Dropdown(label="Base model", choices=BASE_MODELS, value=DEFAULT_BASE_MODEL, interactive=True)
+                negative_prompt = gr.Text(
+                        label="Negative Prompt",
+                        max_lines=5,
+                        placeholder="Please input your negative prompt",
+                        value='ugly, low quality',lines=1
+                    )
+                control_strength = gr.Slider(
+                    label="Control Strength: ", show_label=True, minimum=0, maximum=1.1, value=1, step=0.01
+                    )
+                with gr.Group():
+                    seed = gr.Slider(
+                        label="Seed: ", minimum=0, maximum=2147483647, step=1, value=648464818
+                    )
+                    randomize_seed = gr.Checkbox(label="Randomize seed", value=False)
+                blending = gr.Checkbox(label="Blending mode", value=True)
+                num_samples = gr.Slider(
+                    label="Num samples", minimum=0, maximum=4, step=1, value=4
+                )
+                with gr.Group():
+                    with gr.Row():
+                        guidance_scale = gr.Slider(
+                            label="Guidance scale",
+                            minimum=1,
+                            maximum=12,
+                            step=0.1,
+                            value=7.5,
+                        )
+                        num_inference_steps = gr.Slider(
+                            label="Number of inference steps",
+                            minimum=1,
+                            maximum=50,
+                            step=1,
+                            value=50,
+                        )
+        with gr.Column():
+            with gr.Row():
+                with gr.Tab(elem_classes="feedback", label="Masked Image"):
+                    masked_gallery = gr.Gallery(label='Masked Image', show_label=True, elem_id="gallery", preview=True, height=360)
+                with gr.Tab(elem_classes="feedback", label="Mask"):
+                    mask_gallery = gr.Gallery(label='Mask', show_label=True, elem_id="gallery", preview=True, height=360)
+            invert_mask_button = gr.Button("Invert Mask")
+            dilation_size = gr.Slider(
+                        label="Dilation size: ", minimum=0, maximum=50, step=1, value=20
+                    )
+            with gr.Row():
+                dilation_mask_button = gr.Button("Dilation Generated Mask")
+                erosion_mask_button = gr.Button("Erosion Generated Mask")
+            moving_pixels = gr.Slider(
+                    label="Moving pixels:", show_label=True, minimum=0, maximum=50, value=4, step=1
+                    )
+            with gr.Row():
+                move_left_button = gr.Button("Move Left")
+                move_right_button = gr.Button("Move Right")
+            with gr.Row():
+                move_up_button = gr.Button("Move Up")
+                move_down_button = gr.Button("Move Down")
+            with gr.Tab(elem_classes="feedback", label="Output"):
+                result_gallery = gr.Gallery(label='Output', show_label=True, elem_id="gallery", preview=True, height=400)
+            target_prompt_output = gr.Text(label="Output Target Prompt", value="", lines=1, interactive=False)
+            reset_button = gr.Button("Reset")
+            init_type = gr.Textbox(label="Init Name", value="", visible=False)
+            example_type = gr.Textbox(label="Example Name", value="", visible=False)
+    with gr.Row():
+        example = gr.Examples(
+            label="Quick Example",
+            examples=EXAMPLES,
+            inputs=[input_image, prompt, seed, init_type, example_type, blending, resize_default, vlm_model_dropdown],
+            examples_per_page=10,
+            cache_examples=False,
+        )
+    with gr.Accordion(label="🎬 Feature Details:", open=True, elem_id="accordion"):
+        with gr.Row(equal_height=True):
+            gr.Markdown(tips)
+    with gr.Row():
+        gr.Markdown(citation)
+    ## gr.examples can not be used to update the gr.Gallery, so we need to use the following two functions to update the gr.Gallery.
+    ## And we need to solve the conflict between the upload and change example functions.
+    input_image.upload(
+        init_img,
+        [input_image, init_type, prompt, aspect_ratio, example_change_times],
+        [input_image, original_image, original_mask, prompt, mask_gallery, masked_gallery, result_gallery, target_prompt, target_prompt_output, init_type, aspect_ratio, resize_default, invert_mask_state, example_change_times]
+    )
+    example_type.change(fn=update_example, inputs=[example_type, prompt, example_change_times], outputs=[input_image, prompt, original_image, original_mask, mask_gallery, masked_gallery, result_gallery, aspect_ratio, target_prompt, target_prompt_output, invert_mask_state, example_change_times])
+    ## vlm and base model dropdown
+    vlm_model_dropdown.change(fn=update_vlm_model, inputs=[vlm_model_dropdown], outputs=[status])
+    base_model_dropdown.change(fn=update_base_model, inputs=[base_model_dropdown], outputs=[status])
+    GPT4o_KEY_submit.click(fn=submit_GPT4o_KEY, inputs=[GPT4o_KEY], outputs=[GPT4o_KEY, vlm_model_dropdown])
+    invert_mask_button.click(fn=invert_mask, inputs=[input_image, original_image, original_mask], outputs=[masked_gallery, mask_gallery, original_mask, invert_mask_state])
+    ips=[input_image,
+         original_image,
+         original_mask,
+         prompt,
+         negative_prompt,
+         control_strength,
+         seed,
+         randomize_seed,
+         guidance_scale,
+         num_inference_steps,
+         num_samples,
+         blending,
+         category,
+         target_prompt,
+         resize_default,
+         aspect_ratio,
+         invert_mask_state]
+    ## run brushedit
+    run_button.click(fn=process, inputs=ips, outputs=[result_gallery, mask_gallery, masked_gallery, prompt, target_prompt, target_prompt_output, invert_mask_state])
+    ## mask func
+    mask_button.click(fn=process_mask, inputs=[input_image, original_image, prompt, resize_default, aspect_ratio], outputs=[masked_gallery, mask_gallery, original_mask, category])
+    random_mask_button.click(fn=process_random_mask, inputs=[input_image, original_image, original_mask, resize_default, aspect_ratio], outputs=[masked_gallery, mask_gallery, original_mask])
+    dilation_mask_button.click(fn=process_dilation_mask, inputs=[input_image, original_image, original_mask, resize_default, aspect_ratio, dilation_size], outputs=[ masked_gallery, mask_gallery, original_mask])
+    erosion_mask_button.click(fn=process_erosion_mask, inputs=[input_image, original_image, original_mask, resize_default, aspect_ratio, dilation_size], outputs=[ masked_gallery, mask_gallery, original_mask])
+    ## move mask func
+    move_left_button.click(fn=move_mask_left, inputs=[input_image, original_image, original_mask, moving_pixels, resize_default, aspect_ratio], outputs=[masked_gallery, mask_gallery, original_mask])
+    move_right_button.click(fn=move_mask_right, inputs=[input_image, original_image, original_mask, moving_pixels, resize_default, aspect_ratio], outputs=[masked_gallery, mask_gallery, original_mask])
+    move_up_button.click(fn=move_mask_up, inputs=[input_image, original_image, original_mask, moving_pixels, resize_default, aspect_ratio], outputs=[masked_gallery, mask_gallery, original_mask])
+    move_down_button.click(fn=move_mask_down, inputs=[input_image, original_image, original_mask, moving_pixels, resize_default, aspect_ratio], outputs=[masked_gallery, mask_gallery, original_mask])
+    ## prompt func
+    generate_target_prompt_button.click(fn=generate_target_prompt, inputs=[input_image, original_image, prompt], outputs=[target_prompt, target_prompt_output])
+    ## reset func
+    reset_button.click(fn=reset_func, inputs=[input_image, original_image, original_mask, prompt, target_prompt, target_prompt_output], outputs=[input_image, original_image, original_mask, prompt, mask_gallery, masked_gallery, result_gallery, target_prompt, target_prompt_output, resize_default, invert_mask_state])
+demo.launch(server_name="0.0.0.0", server_port=12345, share=False)

app/{gpt4_o → src}/vlm_pipeline.py RENAMED Viewed

@@ -7,13 +7,28 @@ from io import BytesIO
 import numpy as np
 import gradio as gr
 from app.gpt4_o.instructions import (
-    create_editing_category_messages,
-    create_ori_object_messages,
-    create_add_object_messages,
-    create_apply_editing_messages)
 from app.utils.utils import run_grounded_sam
@@ -25,46 +40,96 @@ def encode_image(img):
     return base64.b64encode(img_bytes).decode('utf-8')
-def run_gpt4o_vl_inference(vlm,
                            messages):
-    response = vlm.chat.completions.create(
         model="gpt-4o-2024-08-06",
         messages=messages
     )
     response_str = response.choices[0].message.content
     return response_str
-def vlm_response_editing_type(vlm,
-                              image,
-                              editing_prompt):
-    base64_image = encode_image(image)
-    messages = create_editing_category_messages(editing_prompt)
-    response_str = run_gpt4o_vl_inference(vlm, messages)
     for category_name in ["Addition","Remove","Local","Global","Background"]:
         if category_name.lower() in response_str.lower():
             return category_name
-    raise ValueError("Please input correct commands, including add, delete, and modify commands.")
-def vlm_response_object_wait_for_edit(vlm,
                                       category,
-                                      editing_prompt):
     if category in ["Background", "Global", "Addition"]:
         edit_object = "nan"
         return edit_object
-    messages = create_ori_object_messages(editing_prompt)
-    response_str = run_gpt4o_vl_inference(vlm, messages)
     return response_str
-def vlm_response_mask(vlm,
                       category,
                       image,
                       editing_prompt,
@@ -73,16 +138,25 @@ def vlm_response_mask(vlm,
                       sam_predictor=None,
                       sam_automask_generator=None,
                       groundingdino_model=None,
                       ):
     mask = None
     if editing_prompt is None or len(editing_prompt)==0:
         raise gr.Error("Please input the editing instruction!")
     height, width = image.shape[:2]
     if category=="Addition":
-        base64_image = encode_image(image)
-        messages = create_add_object_messages(editing_prompt, base64_image, height=height, width=width)
         try:
-            response_str = run_gpt4o_vl_inference(vlm, messages)
             pattern = r'\[\d{1,3}(?:,\s*\d{1,3}){3}\]'
             box = re.findall(pattern, response_str)
             box = box[0][1:-1].split(",")
@@ -92,7 +166,7 @@ def vlm_response_mask(vlm,
             cus_mask[box[1]: box[1]+box[3], box[0]: box[0]+box[2]]=255
             mask = cus_mask
         except:
-            raise gr.Error("Please set the mask manually, MLLM cannot output the mask!")
     elif category=="Background":
         labels = "background"
@@ -104,7 +178,6 @@ def vlm_response_mask(vlm,
     if mask is None:
         for thresh in [0.3,0.25,0.2,0.15,0.1,0.05,0]:
             try:
-                device = "cuda" if torch.cuda.is_available() else "cpu"
                 detections = run_grounded_sam(
                     input_image={"image":Image.fromarray(image.astype('uint8')),
                                  "mask":None},
@@ -128,11 +201,22 @@ def vlm_response_mask(vlm,
     return mask
-def vlm_response_prompt_after_apply_instruction(vlm,
                                                 image,
-                                                editing_prompt):
-    base64_image = encode_image(image)
-    messages = create_apply_editing_messages(editing_prompt, base64_image)
-    response_str = run_gpt4o_vl_inference(vlm, messages)
-    return response_str

 import numpy as np
 import gradio as gr
+from openai import OpenAI
+from transformers import (LlavaNextForConditionalGeneration, Qwen2VLForConditionalGeneration)
+from qwen_vl_utils import process_vision_info
 from app.gpt4_o.instructions import (
+    create_editing_category_messages_gpt4o,
+    create_ori_object_messages_gpt4o,
+    create_add_object_messages_gpt4o,
+    create_apply_editing_messages_gpt4o)
+from app.llava.instructions import (
+    create_editing_category_messages_llava,
+    create_ori_object_messages_llava,
+    create_add_object_messages_llava,
+    create_apply_editing_messages_llava)
+from app.qwen2.instructions import (
+    create_editing_category_messages_qwen2,
+    create_ori_object_messages_qwen2,
+    create_add_object_messages_qwen2,
+    create_apply_editing_messages_qwen2)
 from app.utils.utils import run_grounded_sam
     return base64.b64encode(img_bytes).decode('utf-8')
+def run_gpt4o_vl_inference(vlm_model,
                            messages):
+    response = vlm_model.chat.completions.create(
         model="gpt-4o-2024-08-06",
         messages=messages
     )
     response_str = response.choices[0].message.content
     return response_str
+def run_llava_next_inference(vlm_processor, vlm_model, messages, image, device="cuda"):
+    prompt = vlm_processor.apply_chat_template(messages, add_generation_prompt=True)
+    inputs = vlm_processor(images=image, text=prompt, return_tensors="pt").to(device)
+    output = vlm_model.generate(**inputs, max_new_tokens=200)
+    generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, output)
+    ]
+    response_str = vlm_processor.decode(generated_ids_trimmed[0], skip_special_tokens=True)
+    return response_str
+def run_qwen2_vl_inference(vlm_processor, vlm_model, messages, image, device="cuda"):
+    text = vlm_processor.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    image_inputs, video_inputs = process_vision_info(messages)
+    inputs = vlm_processor(
+        text=[text],
+        images=image_inputs,
+        videos=video_inputs,
+        padding=True,
+        return_tensors="pt",
+    )
+    inputs = inputs.to(device)
+    generated_ids = vlm_model.generate(**inputs, max_new_tokens=128)
+    generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+    ]
+    response_str = vlm_processor.decode(generated_ids_trimmed[0], skip_special_tokens=True)
+    return response_str
+### response editing type
+def vlm_response_editing_type(vlm_processor,
+                              vlm_model,
+                              image,
+                              editing_prompt,
+                              device):
+    if isinstance(vlm_model, OpenAI):
+        messages = create_editing_category_messages_gpt4o(editing_prompt)
+        response_str = run_gpt4o_vl_inference(vlm_model, messages)
+    elif isinstance(vlm_model, LlavaNextForConditionalGeneration):
+        messages = create_editing_category_messages_llava(editing_prompt)
+        response_str = run_llava_next_inference(vlm_processor, vlm_model, messages, image, device=device)
+    elif isinstance(vlm_model, Qwen2VLForConditionalGeneration):
+        messages = create_editing_category_messages_qwen2(editing_prompt)
+        response_str = run_qwen2_vl_inference(vlm_processor, vlm_model, messages, image, device=device)
     for category_name in ["Addition","Remove","Local","Global","Background"]:
         if category_name.lower() in response_str.lower():
             return category_name
+    raise gr.Error("Please input correct commands, including add, delete, and modify commands. If it still does not work, please switch to a more powerful VLM.")
+### response object to be edited
+def vlm_response_object_wait_for_edit(vlm_processor,
+                                      vlm_model,
+                                      image,
                                       category,
+                                      editing_prompt,
+                                      device):
     if category in ["Background", "Global", "Addition"]:
         edit_object = "nan"
         return edit_object
+    if isinstance(vlm_model, OpenAI):
+        messages = create_ori_object_messages_gpt4o(editing_prompt)
+        response_str = run_gpt4o_vl_inference(vlm_model, messages)
+    elif isinstance(vlm_model, LlavaNextForConditionalGeneration):
+        messages = create_ori_object_messages_llava(editing_prompt)
+        response_str = run_llava_next_inference(vlm_processor, vlm_model, messages, image , device)
+    elif isinstance(vlm_model, Qwen2VLForConditionalGeneration):
+        messages = create_ori_object_messages_qwen2(editing_prompt)
+        response_str = run_qwen2_vl_inference(vlm_processor, vlm_model, messages, image, device)
     return response_str
+### response mask
+def vlm_response_mask(vlm_processor,
+                      vlm_model,
                       category,
                       image,
                       editing_prompt,
                       sam_predictor=None,
                       sam_automask_generator=None,
                       groundingdino_model=None,
+                      device=None,
                       ):
     mask = None
     if editing_prompt is None or len(editing_prompt)==0:
         raise gr.Error("Please input the editing instruction!")
     height, width = image.shape[:2]
     if category=="Addition":
         try:
+            if isinstance(vlm_model, OpenAI):
+                base64_image = encode_image(image)
+                messages = create_add_object_messages_gpt4o(editing_prompt, base64_image, height=height, width=width)
+                response_str = run_gpt4o_vl_inference(vlm_model, messages)
+            elif isinstance(vlm_model, LlavaNextForConditionalGeneration):
+                messages = create_add_object_messages_llava(editing_prompt, height=height, width=width)
+                response_str = run_llava_next_inference(vlm_processor, vlm_model, messages, image, device)
+            elif isinstance(vlm_model, Qwen2VLForConditionalGeneration):
+                base64_image = encode_image(image)
+                messages = create_add_object_messages_qwen2(editing_prompt, base64_image, height=height, width=width)
+                response_str = run_qwen2_vl_inference(vlm_processor, vlm_model, messages, image, device)
             pattern = r'\[\d{1,3}(?:,\s*\d{1,3}){3}\]'
             box = re.findall(pattern, response_str)
             box = box[0][1:-1].split(",")
             cus_mask[box[1]: box[1]+box[3], box[0]: box[0]+box[2]]=255
             mask = cus_mask
         except:
+            raise gr.Error("Please set the mask manually, currently the VLM cannot output the mask!")
     elif category=="Background":
         labels = "background"
     if mask is None:
         for thresh in [0.3,0.25,0.2,0.15,0.1,0.05,0]:
             try:
                 detections = run_grounded_sam(
                     input_image={"image":Image.fromarray(image.astype('uint8')),
                                  "mask":None},
     return mask
+def vlm_response_prompt_after_apply_instruction(vlm_processor,
+                                                vlm_model,
                                                 image,
+                                                editing_prompt,
+                                                device):
+    if isinstance(vlm_model, OpenAI):
+        base64_image = encode_image(image)
+        messages = create_apply_editing_messages_gpt4o(editing_prompt, base64_image)
+        response_str = run_gpt4o_vl_inference(vlm_model, messages)
+    elif isinstance(vlm_model, LlavaNextForConditionalGeneration):
+        messages = create_apply_editing_messages_llava(editing_prompt)
+        response_str = run_llava_next_inference(vlm_processor, vlm_model, messages, image, device)
+    elif isinstance(vlm_model, Qwen2VLForConditionalGeneration):
+        base64_image = encode_image(image)
+        messages = create_apply_editing_messages_qwen2(editing_prompt, base64_image)
+        response_str = run_qwen2_vl_inference(vlm_processor, vlm_model, messages, image, device)
+    else:
+        raise gr.Error("Please select the correct VLM model!")
+    return response_str

app/src/vlm_template.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import os
+import sys
+import torch
+from openai import OpenAI
+from transformers import (
+    LlavaNextProcessor, LlavaNextForConditionalGeneration,
+    Qwen2VLForConditionalGeneration, Qwen2VLProcessor
+)
+## init device
+device = "cpu"
+torch_dtype = torch.float16
+vlms_list = [
+    # {
+    #     "type": "llava-next",
+    #     "name": "llava-v1.6-mistral-7b-hf",
+    #     "local_path": "models/vlms/llava-v1.6-mistral-7b-hf",
+    #     "processor": LlavaNextProcessor.from_pretrained(
+    #         "models/vlms/llava-v1.6-mistral-7b-hf"
+    #     ) if os.path.exists("models/vlms/llava-v1.6-mistral-7b-hf") else LlavaNextProcessor.from_pretrained(
+    #         "llava-hf/llava-v1.6-mistral-7b-hf"
+    #     ),
+    #     "model": LlavaNextForConditionalGeneration.from_pretrained(
+    #         "models/vlms/llava-v1.6-mistral-7b-hf", torch_dtype=torch_dtype, device_map=device
+    #     ).to("cpu") if os.path.exists("models/vlms/llava-v1.6-mistral-7b-hf") else
+    #         LlavaNextForConditionalGeneration.from_pretrained(
+    #             "llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch_dtype, device_map=device
+    #         ).to("cpu"),
+    # },
+    {
+        "type": "llava-next",
+        "name": "llama3-llava-next-8b-hf (Preload)",
+        "local_path": "models/vlms/llama3-llava-next-8b-hf",
+        "processor": LlavaNextProcessor.from_pretrained(
+            "models/vlms/llama3-llava-next-8b-hf"
+        ) if os.path.exists("models/vlms/llama3-llava-next-8b-hf") else LlavaNextProcessor.from_pretrained(
+            "llava-hf/llama3-llava-next-8b-hf"
+        ),
+        "model": LlavaNextForConditionalGeneration.from_pretrained(
+            "models/vlms/llama3-llava-next-8b-hf", torch_dtype=torch_dtype, device_map=device
+        ).to("cpu") if os.path.exists("models/vlms/llama3-llava-next-8b-hf") else
+            LlavaNextForConditionalGeneration.from_pretrained(
+                "llava-hf/llama3-llava-next-8b-hf", torch_dtype=torch_dtype, device_map=device
+            ).to("cpu"),
+    },
+    # {
+    #     "type": "llava-next",
+    #     "name": "llava-v1.6-vicuna-13b-hf",
+    #     "local_path": "models/vlms/llava-v1.6-vicuna-13b-hf",
+    #     "processor": LlavaNextProcessor.from_pretrained(
+    #         "models/vlms/llava-v1.6-vicuna-13b-hf"
+    #     ) if os.path.exists("models/vlms/llava-v1.6-vicuna-13b-hf") else LlavaNextProcessor.from_pretrained(
+    #         "llava-hf/llava-v1.6-vicuna-13b-hf"
+    #     ),
+    #     "model": LlavaNextForConditionalGeneration.from_pretrained(
+    #         "models/vlms/llava-v1.6-vicuna-13b-hf", torch_dtype=torch_dtype, device_map=device
+    #     ).to("cpu") if os.path.exists("models/vlms/llava-v1.6-vicuna-13b-hf") else
+    #         LlavaNextForConditionalGeneration.from_pretrained(
+    #             "llava-hf/llava-v1.6-vicuna-13b-hf", torch_dtype=torch_dtype, device_map=device
+    #         ).to("cpu"),
+    # },
+    # {
+    #     "type": "llava-next",
+    #     "name": "llava-v1.6-34b-hf",
+    #     "local_path": "models/vlms/llava-v1.6-34b-hf",
+    #     "processor": LlavaNextProcessor.from_pretrained(
+    #         "models/vlms/llava-v1.6-34b-hf"
+    #     ) if os.path.exists("models/vlms/llava-v1.6-34b-hf") else LlavaNextProcessor.from_pretrained(
+    #         "llava-hf/llava-v1.6-34b-hf"
+    #     ),
+    #     "model": LlavaNextForConditionalGeneration.from_pretrained(
+    #         "models/vlms/llava-v1.6-34b-hf", torch_dtype=torch_dtype, device_map=device
+    #     ).to("cpu") if os.path.exists("models/vlms/llava-v1.6-34b-hf") else
+    #         LlavaNextForConditionalGeneration.from_pretrained(
+    #             "llava-hf/llava-v1.6-34b-hf", torch_dtype=torch_dtype, device_map=device
+    #         ).to("cpu"),
+    # },
+    # {
+    #     "type": "qwen2-vl",
+    #     "name": "Qwen2-VL-2B-Instruct",
+    #     "local_path": "models/vlms/Qwen2-VL-2B-Instruct",
+    #     "processor": Qwen2VLProcessor.from_pretrained(
+    #         "models/vlms/Qwen2-VL-2B-Instruct"
+    #     ) if os.path.exists("models/vlms/Qwen2-VL-2B-Instruct") else Qwen2VLProcessor.from_pretrained(
+    #         "Qwen/Qwen2-VL-2B-Instruct"
+    #     ),
+    #     "model": Qwen2VLForConditionalGeneration.from_pretrained(
+    #         "models/vlms/Qwen2-VL-2B-Instruct", torch_dtype=torch_dtype, device_map=device
+    #     ).to("cpu") if os.path.exists("models/vlms/Qwen2-VL-2B-Instruct") else
+    #         Qwen2VLForConditionalGeneration.from_pretrained(
+    #             "Qwen/Qwen2-VL-2B-Instruct", torch_dtype=torch_dtype, device_map=device
+    #         ).to("cpu"),
+    # },
+    {
+        "type": "qwen2-vl",
+        "name": "Qwen2-VL-7B-Instruct (Default)",
+        "local_path": "models/vlms/Qwen2-VL-7B-Instruct",
+        "processor": Qwen2VLProcessor.from_pretrained(
+            "models/vlms/Qwen2-VL-7B-Instruct"
+        ) if os.path.exists("models/vlms/Qwen2-VL-7B-Instruct") else Qwen2VLProcessor.from_pretrained(
+            "Qwen/Qwen2-VL-7B-Instruct"
+        ),
+        "model": Qwen2VLForConditionalGeneration.from_pretrained(
+            "models/vlms/Qwen2-VL-7B-Instruct", torch_dtype=torch_dtype, device_map=device
+        ).to("cpu") if os.path.exists("models/vlms/Qwen2-VL-7B-Instruct") else
+            Qwen2VLForConditionalGeneration.from_pretrained(
+                "Qwen/Qwen2-VL-7B-Instruct", torch_dtype=torch_dtype, device_map=device
+            ).to("cpu"),
+    },
+    {
+        "type": "openai",
+        "name": "GPT4-o (Highly Recommended)",
+        "local_path": "",
+        "processor": "",
+        "model": ""
+    },
+]
+vlms_template = {k["name"]: (k["type"], k["local_path"], k["processor"], k["model"]) for k in vlms_list}

app/utils/GroundingDINO_SwinT_OGC.py ADDED Viewed

	@@ -0,0 +1,43 @@

+batch_size = 1
+modelname = "groundingdino"
+backbone = "swin_T_224_1k"
+position_embedding = "sine"
+pe_temperatureH = 20
+pe_temperatureW = 20
+return_interm_indices = [1, 2, 3]
+backbone_freeze_keywords = None
+enc_layers = 6
+dec_layers = 6
+pre_norm = False
+dim_feedforward = 2048
+hidden_dim = 256
+dropout = 0.0
+nheads = 8
+num_queries = 900
+query_dim = 4
+num_patterns = 0
+num_feature_levels = 4
+enc_n_points = 4
+dec_n_points = 4
+two_stage_type = "standard"
+two_stage_bbox_embed_share = False
+two_stage_class_embed_share = False
+transformer_activation = "relu"
+dec_pred_bbox_embed_share = True
+dn_box_noise_scale = 1.0
+dn_label_noise_ratio = 0.5
+dn_label_coef = 1.0
+dn_bbox_coef = 1.0
+embed_init_tgt = True
+dn_labelbook_size = 2000
+max_text_len = 256
+text_encoder_type = "bert-base-uncased"
+use_text_enhancer = True
+use_fusion_layer = True
+use_checkpoint = True
+use_transformer_ckpt = True
+use_text_cross_attention = True
+text_dropout = 0.0
+fusion_dropout = 0.0
+fusion_droppath = 0.1
+sub_sentence_present = True

assets/angel_christmas/angel_christmas.png ADDED Viewed

Git LFS Details

SHA256: 90efa52308e2dc16274ddaef147d89979bf6bdb2c1f2b06f639b4e43fb96f8db
Pointer size: 132 Bytes
Size of remote file: 1.47 MB

assets/angel_christmas/image_edit_f15d9b45-c978-4e3d-9f5f-251e308560c3_0.png ADDED Viewed

Git LFS Details

SHA256: a259c2958d665532dfdf459ccb8d808967eee2d2f6e87dadd51ca1a01b590b44
Pointer size: 132 Bytes
Size of remote file: 1.43 MB

assets/angel_christmas/mask_f15d9b45-c978-4e3d-9f5f-251e308560c3.png ADDED Viewed

Git LFS Details

SHA256: 14318679567d391ee5e08d96dae249ed1bca1a0f349b76f725cc70288ce04030
Pointer size: 129 Bytes
Size of remote file: 3.99 kB

assets/angel_christmas/masked_image_f15d9b45-c978-4e3d-9f5f-251e308560c3.png ADDED Viewed

Git LFS Details

SHA256: be7745d023596428d3ff449f48f3aad3aa8ae00a42c089a7b1311cdae3e39b70
Pointer size: 132 Bytes
Size of remote file: 1.43 MB

assets/angel_christmas/prompt.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:89ed635310c87d5f2d8f32d813018acd5040edd745a29c5bf84a435916525789
+size 27

assets/anime_flower/anime_flower.png ADDED Viewed

Git LFS Details

SHA256: 1adc101088a1428361410fe3c637155da48d0eb21b3782377dd258a0a5df576a
Pointer size: 132 Bytes
Size of remote file: 1.32 MB

assets/anime_flower/image_edit_37553172-9b38-4727-bf2e-37d7e2b93461_2.png ADDED Viewed

Git LFS Details

SHA256: 3912e75e44c89a7ec8d0f6e34c90d4ea2212f80e5c2a12e6ba3dac405ca7be6c
Pointer size: 131 Bytes
Size of remote file: 930 kB

assets/anime_flower/mask_37553172-9b38-4727-bf2e-37d7e2b93461.png ADDED Viewed

Git LFS Details

SHA256: 9f07ca0c719ffc09f282c424c66c869a9a31a1fc6386dba679e994f0b34bf51c
Pointer size: 129 Bytes
Size of remote file: 4.22 kB

assets/anime_flower/masked_image_37553172-9b38-4727-bf2e-37d7e2b93461.png ADDED Viewed

Git LFS Details

SHA256: 73cac864b127f287579a8d259ff1165841d4c5e63731b4ac54e872567137e5e6
Pointer size: 131 Bytes
Size of remote file: 967 kB

assets/anime_flower/prompt.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ 648464818: remove the flower.

assets/brushedit_teaser.png ADDED Viewed

Git LFS Details

SHA256: bcd1d0c9f6fc083a33ec4565d98120bf1099914a0d2c0247eaa462052911ea59
Pointer size: 132 Bytes
Size of remote file: 3.45 MB

assets/chenduling/chengduling.jpg ADDED Viewed

Git LFS Details

SHA256: 0df06c0394583181e5cdf92f997c1276deb27cf96dd36b6443fe9d347a1e013a
Pointer size: 131 Bytes
Size of remote file: 168 kB

assets/chenduling/image_edit_68e3ff6f-da07-4b37-91df-13d6eed7b997_0.png ADDED Viewed

Git LFS Details

SHA256: 7bb60b8093291c9720e61160f7e598aadfc02f62bc08ad825d1ba9f2e8431b6a
Pointer size: 132 Bytes
Size of remote file: 1.39 MB

assets/chenduling/mask_68e3ff6f-da07-4b37-91df-13d6eed7b997.png ADDED Viewed

Git LFS Details

SHA256: db47c2d9f18b0e25041c894945f6b52d3fcff473a0f0496b89dc2ac7d36536fc
Pointer size: 129 Bytes
Size of remote file: 5.68 kB

assets/chenduling/masked_image_68e3ff6f-da07-4b37-91df-13d6eed7b997.png ADDED Viewed

Git LFS Details

SHA256: 0200a970fc55930f9bcc9910cea3126b96c28e2bddca10b2b1969cbc979092be
Pointer size: 132 Bytes
Size of remote file: 1.11 MB

assets/chenduling/prompt.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ 648464818: replace the clothes to a delicated floral skirt

assets/chinese_girl/chinese_girl.png ADDED Viewed

Git LFS Details

SHA256: 52f7dfc3333b48f677035180506650fb4ee9911a31426adb83c7e13fd5ac6693
Pointer size: 132 Bytes
Size of remote file: 1.26 MB

assets/chinese_girl/image_edit_54759648-0989-48e0-bc82-f20e28b5ec29_1.png ADDED Viewed

Git LFS Details

SHA256: b330e926da856f1027dd09e2bb3dc5910bb0a2dc9bc4a402b107c3f7b18b7de0
Pointer size: 131 Bytes
Size of remote file: 881 kB

assets/chinese_girl/mask_54759648-0989-48e0-bc82-f20e28b5ec29.png ADDED Viewed

Git LFS Details

SHA256: d46957b4cd0af13f57ace1cf181a13c8da7feebf9a9f37e8e5d582086a337843
Pointer size: 130 Bytes
Size of remote file: 10.4 kB

assets/chinese_girl/masked_image_54759648-0989-48e0-bc82-f20e28b5ec29.png ADDED Viewed

Git LFS Details

SHA256: f26a9d923a432b80e91b09d711f4000e9b1afe7edece788c9b8b86a3cce45855
Pointer size: 131 Bytes
Size of remote file: 412 kB

assets/chinese_girl/prompt.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ 648464818: replace the background to ancient China.

assets/demo_vis.png ADDED Viewed

Git LFS Details

SHA256: 755ecfc61a70da9eb3abde0d4353590c0344a0d55e5d3622da4fe58837ca457b
Pointer size: 132 Bytes
Size of remote file: 1.04 MB

assets/example.png ADDED Viewed

Git LFS Details

SHA256: 1e86dbd1cb8d4c787a910d17400b081fa1d0daac35645f808c088fd316d1861b
Pointer size: 129 Bytes
Size of remote file: 3.22 kB

assets/frog/frog.jpeg ADDED Viewed

Git LFS Details

SHA256: bff47418f10bcbebdced638256fce1e075d93ccedc3b44ca83d04f7c7145ab1e
Pointer size: 131 Bytes
Size of remote file: 896 kB

assets/frog/image_edit_f7b350de-6f2c-49e3-b535-995c486d78e7_1.png ADDED Viewed

Git LFS Details

SHA256: c9c1dfe00fd70e1cee76037941876c03a64863b3d598f925e7d0a39f3065db89
Pointer size: 131 Bytes
Size of remote file: 923 kB

assets/frog/mask_f7b350de-6f2c-49e3-b535-995c486d78e7.png ADDED Viewed

Git LFS Details

SHA256: 2df1f32f92028ef8dbd677d039af09acb82db62d60bf4dea7812eefab340f553
Pointer size: 129 Bytes
Size of remote file: 3.34 kB

assets/frog/masked_image_f7b350de-6f2c-49e3-b535-995c486d78e7.png ADDED Viewed

Git LFS Details

SHA256: bd9730b6c718a44614cfc8873e65bf660183adb4fbf2352f6488a33be5d4d7a1
Pointer size: 131 Bytes
Size of remote file: 881 kB

assets/frog/prompt.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ 648464818: add a magic hat on frog head.

assets/girl_on_sun/girl_on_sun.png ADDED Viewed

Git LFS Details

SHA256: ec304a50b692e2b898b59b22c84cda84663738aacf5e9bf64cdfed1cde853e2a
Pointer size: 132 Bytes
Size of remote file: 1.59 MB

assets/girl_on_sun/image_edit_264eac8b-8b65-479c-9755-020a60880c37_0.png ADDED Viewed

Git LFS Details

SHA256: 800a5e5f953290472898247f974893506ca41a3c7acda02d7eb1a69844ad6d7c
Pointer size: 132 Bytes
Size of remote file: 1.1 MB

assets/girl_on_sun/mask_264eac8b-8b65-479c-9755-020a60880c37.png ADDED Viewed

Git LFS Details

SHA256: 9ebe3b43581718a525c258d7a1f28d7b4acc4e61150b008f5e909946717ce73f
Pointer size: 129 Bytes
Size of remote file: 3.91 kB

assets/girl_on_sun/masked_image_264eac8b-8b65-479c-9755-020a60880c37.png ADDED Viewed

Git LFS Details

SHA256: a3c165dde2730c2648191e84fc2edb7d27babb8b80ffa159f567f2d670028b18
Pointer size: 132 Bytes
Size of remote file: 1.2 MB

assets/girl_on_sun/prompt.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ 648464818: add a butterfly fairy.