Spaces:

VIPLab
/

Track-Anything

Runtime error

App Files Files Community

watchtowerss commited on Apr 20, 2023

Commit

4d1ebf3

1 Parent(s): 663e9a6

track-anything --version 1

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +3 -0
LICENSE +21 -0
README.md +47 -13
XMem-s012.pth +3 -0
app.py +362 -0
app_save.py +381 -0
app_test.py +23 -0
assets/demo_version_1.MP4 +3 -0
assets/inpainting.gif +3 -0
assets/poster_demo_version_1.png +0 -0
assets/qingming.mp4 +3 -0
demo.py +87 -0
images/groceries.jpg +0 -0
images/mask_painter.png +0 -0
images/painter_input_image.jpg +0 -0
images/painter_input_mask.jpg +0 -0
images/painter_output_image.png +0 -0
images/painter_output_image__.png +0 -0
images/point_painter.png +0 -0
images/point_painter_1.png +0 -0
images/point_painter_2.png +0 -0
images/truck.jpg +0 -0
images/truck_both.jpg +0 -0
images/truck_mask.jpg +0 -0
images/truck_point.jpg +0 -0
inpainter/.DS_Store +0 -0
inpainter/base_inpainter.py +160 -0
inpainter/config/config.yaml +4 -0
inpainter/model/e2fgvi.py +350 -0
inpainter/model/e2fgvi_hq.py +350 -0
inpainter/model/modules/feat_prop.py +149 -0
inpainter/model/modules/flow_comp.py +450 -0
inpainter/model/modules/spectral_norm.py +288 -0
inpainter/model/modules/tfocal_transformer.py +536 -0
inpainter/model/modules/tfocal_transformer_hq.py +565 -0
inpainter/util/__init__.py +0 -0
inpainter/util/tensor_util.py +24 -0
requirements.txt +17 -0
sam_vit_h_4b8939.pth +3 -0
template.html +27 -0
templates/index.html +50 -0
text_server.py +72 -0
tools/__init__.py +0 -0
tools/base_segmenter.py +129 -0
tools/interact_tools.py +265 -0
tools/mask_painter.py +288 -0
tools/painter.py +215 -0
track_anything.py +93 -0
tracker/.DS_Store +0 -0
tracker/base_tracker.py +233 -0

.gitattributes CHANGED Viewed

@@ -32,3 +32,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/demo_version_1.MP4 filter=lfs diff=lfs merge=lfs -text
+assets/inpainting.gif filter=lfs diff=lfs merge=lfs -text
+assets/qingming.mp4 filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2023 Mingqi Gao
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,13 +1,47 @@
----
-title: Track Anything
-emoji: 🐠
-colorFrom: purple
-colorTo: indigo
-sdk: gradio
-sdk_version: 3.27.0
-app_file: app.py
-pinned: false
-license: mit
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Track-Anything
+***Track-Anything*** is a flexible and interactive tool for video object tracking and segmentation. It is developed upon [Segment Anything](https://github.com/facebookresearch/segment-anything), can specify anything to track and segment via user clicks only. During tracking, users can flexibly change the objects they wanna track or correct the region of interest if there are any ambiguities. These characteristics enable ***Track-Anything*** to be suitable for:
+- Video object tracking and segmentation with shot changes.
+- Data annnotation for video object tracking and segmentation.
+- Object-centric downstream video tasks, such as video inpainting and editing.
+## Demo
+https://user-images.githubusercontent.com/28050374/232842703-8395af24-b13e-4b8e-aafb-e94b61e6c449.MP4
+### Multiple Object Tracking and Segmentation (with [XMem](https://github.com/hkchengrex/XMem))
+https://user-images.githubusercontent.com/39208339/233035206-0a151004-6461-4deb-b782-d1dbfe691493.mp4
+### Video Object Tracking and Segmentation with Shot Changes (with [XMem](https://github.com/hkchengrex/XMem))
+https://user-images.githubusercontent.com/30309970/232848349-f5e29e71-2ea4-4529-ac9a-94b9ca1e7055.mp4
+### Video Inpainting (with [E2FGVI](https://github.com/MCG-NKU/E2FGVI))
+https://user-images.githubusercontent.com/28050374/232959816-07f2826f-d267-4dda-8ae5-a5132173b8f4.mp4
+## Get Started
+#### Linux
+```bash
+# Clone the repository:
+git clone https://github.com/gaomingqi/Track-Anything.git
+cd Track-Anything
+# Install dependencies:
+pip install -r requirements.txt
+# Install dependencies for inpainting:
+pip install -U openmim
+mim install mmcv
+# Install dependencies for editing
+pip install madgrad
+# Run the Track-Anything gradio demo.
+python app.py --device cuda:0 --sam_model_type vit_h --port 12212
+```
+## Acknowledgements
+The project is based on [Segment Anything](https://github.com/facebookresearch/segment-anything), [XMem](https://github.com/hkchengrex/XMem), and [E2FGVI](https://github.com/MCG-NKU/E2FGVI). Thanks for the authors for their efforts.

XMem-s012.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:16205ad04bfc55b442bd4d7af894382e09868b35e10721c5afc09a24ea8d72d9
+size 249026057

app.py ADDED Viewed

	@@ -0,0 +1,362 @@

+import gradio as gr
+from demo import automask_image_app, automask_video_app, sahi_autoseg_app
+import argparse
+import cv2
+import time
+from PIL import Image
+import numpy as np
+import os
+import sys
+sys.path.append(sys.path[0]+"/tracker")
+sys.path.append(sys.path[0]+"/tracker/model")
+from track_anything import TrackingAnything
+from track_anything import parse_augment
+import requests
+import json
+import torchvision
+import torch
+import concurrent.futures
+import queue
+# download checkpoints
+def download_checkpoint(url, folder, filename):
+    os.makedirs(folder, exist_ok=True)
+    filepath = os.path.join(folder, filename)
+    if not os.path.exists(filepath):
+        print("download checkpoints ......")
+        response = requests.get(url, stream=True)
+        with open(filepath, "wb") as f:
+            for chunk in response.iter_content(chunk_size=8192):
+                if chunk:
+                    f.write(chunk)
+        print("download successfully!")
+    return filepath
+# convert points input to prompt state
+def get_prompt(click_state, click_input):
+    inputs = json.loads(click_input)
+    points = click_state[0]
+    labels = click_state[1]
+    for input in inputs:
+        points.append(input[:2])
+        labels.append(input[2])
+    click_state[0] = points
+    click_state[1] = labels
+    prompt = {
+        "prompt_type":["click"],
+        "input_point":click_state[0],
+        "input_label":click_state[1],
+        "multimask_output":"True",
+    }
+    return prompt
+# extract frames from upload video
+def get_frames_from_video(video_input, video_state):
+    """
+    Args:
+        video_path:str
+        timestamp:float64
+    Return
+        [[0:nearest_frame], [nearest_frame:], nearest_frame]
+    """
+    video_path = video_input
+    frames = []
+    try:
+        cap = cv2.VideoCapture(video_path)
+        fps = cap.get(cv2.CAP_PROP_FPS)
+        while cap.isOpened():
+            ret, frame = cap.read()
+            if ret == True:
+                frames.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
+            else:
+                break
+    except (OSError, TypeError, ValueError, KeyError, SyntaxError) as e:
+        print("read_frame_source:{} error. {}\n".format(video_path, str(e)))
+    # initialize video_state
+    video_state = {
+        "video_name": os.path.split(video_path)[-1],
+        "origin_images": frames,
+        "painted_images": frames.copy(),
+        "masks": [None]*len(frames),
+        "logits": [None]*len(frames),
+        "select_frame_number": 0,
+        "fps": 30
+        }
+    return video_state, gr.update(visible=True, maximum=len(frames), value=1)
+# get the select frame from gradio slider
+def select_template(image_selection_slider, video_state):
+    # images = video_state[1]
+    image_selection_slider -= 1
+    video_state["select_frame_number"] = image_selection_slider
+    # once select a new template frame, set the image in sam
+    model.samcontroler.sam_controler.reset_image()
+    model.samcontroler.sam_controler.set_image(video_state["origin_images"][image_selection_slider])
+    return video_state["painted_images"][image_selection_slider], video_state
+# use sam to get the mask
+def sam_refine(video_state, point_prompt, click_state, interactive_state, evt:gr.SelectData):
+    """
+    Args:
+        template_frame: PIL.Image
+        point_prompt: flag for positive or negative button click
+        click_state: [[points], [labels]]
+    """
+    if point_prompt == "Positive":
+        coordinate = "[[{},{},1]]".format(evt.index[0], evt.index[1])
+        interactive_state["positive_click_times"] += 1
+    else:
+        coordinate = "[[{},{},0]]".format(evt.index[0], evt.index[1])
+        interactive_state["negative_click_times"] += 1
+    # prompt for sam model
+    prompt = get_prompt(click_state=click_state, click_input=coordinate)
+    mask, logit, painted_image = model.first_frame_click(
+                                                      image=video_state["origin_images"][video_state["select_frame_number"]],
+                                                      points=np.array(prompt["input_point"]),
+                                                      labels=np.array(prompt["input_label"]),
+                                                      multimask=prompt["multimask_output"],
+                                                      )
+    video_state["masks"][video_state["select_frame_number"]] = mask
+    video_state["logits"][video_state["select_frame_number"]] = logit
+    video_state["painted_images"][video_state["select_frame_number"]] = painted_image
+    return painted_image, video_state, interactive_state
+# tracking vos
+def vos_tracking_video(video_state, interactive_state):
+    model.xmem.clear_memory()
+    following_frames = video_state["origin_images"][video_state["select_frame_number"]:]
+    template_mask = video_state["masks"][video_state["select_frame_number"]]
+    fps = video_state["fps"]
+    masks, logits, painted_images = model.generator(images=following_frames, template_mask=template_mask)
+    video_state["masks"][video_state["select_frame_number"]:] = masks
+    video_state["logits"][video_state["select_frame_number"]:] = logits
+    video_state["painted_images"][video_state["select_frame_number"]:] = painted_images
+    video_output = generate_video_from_frames(video_state["painted_images"], output_path="./result/{}".format(video_state["video_name"]), fps=fps) # import video_input to name the output video
+    interactive_state["inference_times"] += 1
+    print("For generating this tracking result, inference times: {}, click times: {}, positive: {}, negative: {}".format(interactive_state["inference_times"],
+                                                                                                                                           interactive_state["positive_click_times"]+interactive_state["negative_click_times"],
+                                                                                                                                           interactive_state["positive_click_times"],
+                                                                                                                                        interactive_state["negative_click_times"]))
+    #### shanggao code for mask save
+    if interactive_state["mask_save"]:
+        if not os.path.exists('./result/mask/{}'.format(video_state["video_name"].split('.')[0])):
+            os.makedirs('./result/mask/{}'.format(video_state["video_name"].split('.')[0]))
+        i = 0
+        print("save mask")
+        for mask in video_state["masks"]:
+            np.save(os.path.join('./result/mask/{}'.format(video_state["video_name"].split('.')[0]), '{:05d}.npy'.format(i)), mask)
+            i+=1
+        # save_mask(video_state["masks"], video_state["video_name"])
+    #### shanggao code for mask save
+    return video_output, video_state, interactive_state
+# generate video after vos inference
+def generate_video_from_frames(frames, output_path, fps=30):
+    """
+    Generates a video from a list of frames.
+    Args:
+        frames (list of numpy arrays): The frames to include in the video.
+        output_path (str): The path to save the generated video.
+        fps (int, optional): The frame rate of the output video. Defaults to 30.
+    """
+    frames = torch.from_numpy(np.asarray(frames))
+    if not os.path.exists(os.path.dirname(output_path)):
+        os.makedirs(os.path.dirname(output_path))
+    torchvision.io.write_video(output_path, frames, fps=fps, video_codec="libx264")
+    return output_path
+# check and download checkpoints if needed
+SAM_checkpoint = "sam_vit_h_4b8939.pth"
+sam_checkpoint_url = "https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth"
+xmem_checkpoint = "XMem-s012.pth"
+xmem_checkpoint_url = "https://github.com/hkchengrex/XMem/releases/download/v1.0/XMem-s012.pth"
+folder ="./checkpoints"
+SAM_checkpoint = download_checkpoint(sam_checkpoint_url, folder, SAM_checkpoint)
+xmem_checkpoint = download_checkpoint(xmem_checkpoint_url, folder, xmem_checkpoint)
+# args, defined in track_anything.py
+args = parse_augment()
+# args.port = 12212
+# args.device = "cuda:4"
+# args.mask_save = True
+model = TrackingAnything(SAM_checkpoint, xmem_checkpoint, args)
+with gr.Blocks() as iface:
+    """
+        state for
+    """
+    click_state = gr.State([[],[]])
+    interactive_state = gr.State({
+        "inference_times": 0,
+        "negative_click_times" : 0,
+        "positive_click_times": 0,
+        "mask_save": args.mask_save
+    })
+    video_state = gr.State(
+        {
+        "video_name": "",
+        "origin_images": None,
+        "painted_images": None,
+        "masks": None,
+        "logits": None,
+        "select_frame_number": 0,
+        "fps": 30
+        }
+    )
+    with gr.Row():
+        # for user video input
+        with gr.Column(scale=1.0):
+            video_input = gr.Video().style(height=360)
+            with gr.Row(scale=1):
+                # put the template frame under the radio button
+                with gr.Column(scale=0.5):
+                    # extract frames
+                    with gr.Column():
+                        extract_frames_button = gr.Button(value="Get video info", interactive=True, variant="primary")
+                     # click points settins, negative or positive, mode continuous or single
+                    with gr.Row():
+                        with gr.Row(scale=0.5):
+                            point_prompt = gr.Radio(
+                                choices=["Positive",  "Negative"],
+                                value="Positive",
+                                label="Point Prompt",
+                                interactive=True)
+                            click_mode = gr.Radio(
+                                choices=["Continuous",  "Single"],
+                                value="Continuous",
+                                label="Clicking Mode",
+                                interactive=True)
+                        with gr.Row(scale=0.5):
+                            clear_button_clike = gr.Button(value="Clear Clicks", interactive=True).style(height=160)
+                            clear_button_image = gr.Button(value="Clear Image", interactive=True)
+                    template_frame = gr.Image(type="pil",interactive=True, elem_id="template_frame").style(height=360)
+                    image_selection_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Image Selection", invisible=False)
+                with gr.Column(scale=0.5):
+                    video_output = gr.Video().style(height=360)
+                    tracking_video_predict_button = gr.Button(value="Tracking")
+    # first step: get the video information
+    extract_frames_button.click(
+        fn=get_frames_from_video,
+        inputs=[
+            video_input, video_state
+        ],
+        outputs=[video_state, image_selection_slider],
+    )
+    # second step: select images from slider
+    image_selection_slider.release(fn=select_template,
+                                   inputs=[image_selection_slider, video_state],
+                                   outputs=[template_frame, video_state], api_name="select_image")
+    template_frame.select(
+        fn=sam_refine,
+        inputs=[video_state, point_prompt, click_state, interactive_state],
+        outputs=[template_frame, video_state, interactive_state]
+    )
+    tracking_video_predict_button.click(
+        fn=vos_tracking_video,
+        inputs=[video_state, interactive_state],
+        outputs=[video_output, video_state, interactive_state]
+    )
+    # clear input
+    video_input.clear(
+        lambda: (
+        {
+        "origin_images": None,
+        "painted_images": None,
+        "masks": None,
+        "logits": None,
+        "select_frame_number": 0,
+        "fps": 30
+        },
+        {
+        "inference_times": 0,
+        "negative_click_times" : 0,
+        "positive_click_times": 0,
+        "mask_save": args.mask_save
+        },
+        [[],[]]
+                ),
+        [],
+        [
+            video_state,
+            interactive_state,
+            click_state,
+        ],
+        queue=False,
+        show_progress=False
+    )
+    clear_button_image.click(
+        lambda: (
+        {
+        "origin_images": None,
+        "painted_images": None,
+        "masks": None,
+        "logits": None,
+        "select_frame_number": 0,
+        "fps": 30
+        },
+        {
+        "inference_times": 0,
+        "negative_click_times" : 0,
+        "positive_click_times": 0,
+        "mask_save": args.mask_save
+        },
+        [[],[]]
+                ),
+        [],
+        [
+            video_state,
+            interactive_state,
+            click_state,
+        ],
+        queue=False,
+        show_progress=False
+    )
+    clear_button_clike.click(
+       lambda: ([[],[]]),
+        [],
+        [click_state],
+        queue=False,
+        show_progress=False
+    )
+iface.queue(concurrency_count=1)
+iface.launch(debug=True, enable_queue=True, server_port=args.port, server_name="0.0.0.0")

app_save.py ADDED Viewed

	@@ -0,0 +1,381 @@

+import gradio as gr
+from demo import automask_image_app, automask_video_app, sahi_autoseg_app
+import argparse
+import cv2
+import time
+from PIL import Image
+import numpy as np
+import os
+import sys
+sys.path.append(sys.path[0]+"/tracker")
+sys.path.append(sys.path[0]+"/tracker/model")
+from track_anything import TrackingAnything
+from track_anything import parse_augment
+import requests
+import json
+import torchvision
+import torch
+import concurrent.futures
+import queue
+def download_checkpoint(url, folder, filename):
+    os.makedirs(folder, exist_ok=True)
+    filepath = os.path.join(folder, filename)
+    if not os.path.exists(filepath):
+        print("download checkpoints ......")
+        response = requests.get(url, stream=True)
+        with open(filepath, "wb") as f:
+            for chunk in response.iter_content(chunk_size=8192):
+                if chunk:
+                    f.write(chunk)
+        print("download successfully!")
+    return filepath
+def pause_video(play_state):
+    print("user pause_video")
+    play_state.append(time.time())
+    return play_state
+def play_video(play_state):
+    print("user play_video")
+    play_state.append(time.time())
+    return play_state
+# convert points input to prompt state
+def get_prompt(click_state, click_input):
+    inputs = json.loads(click_input)
+    points = click_state[0]
+    labels = click_state[1]
+    for input in inputs:
+        points.append(input[:2])
+        labels.append(input[2])
+    click_state[0] = points
+    click_state[1] = labels
+    prompt = {
+        "prompt_type":["click"],
+        "input_point":click_state[0],
+        "input_label":click_state[1],
+        "multimask_output":"True",
+    }
+    return prompt
+def get_frames_from_video(video_input, play_state):
+    """
+    Args:
+        video_path:str
+        timestamp:float64
+    Return
+        [[0:nearest_frame], [nearest_frame:], nearest_frame]
+    """
+    video_path = video_input
+    # video_name = video_path.split('/')[-1]
+    try:
+        timestamp = play_state[1] - play_state[0]
+    except:
+        timestamp = 0
+    frames = []
+    try:
+        cap = cv2.VideoCapture(video_path)
+        fps = cap.get(cv2.CAP_PROP_FPS)
+        while cap.isOpened():
+            ret, frame = cap.read()
+            if ret == True:
+                frames.append(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
+            else:
+                break
+    except (OSError, TypeError, ValueError, KeyError, SyntaxError) as e:
+        print("read_frame_source:{} error. {}\n".format(video_path, str(e)))
+    # for index, frame in enumerate(frames):
+        # frames[index] = np.asarray(Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)))
+    key_frame_index = int(timestamp * fps)
+    nearest_frame = frames[key_frame_index]
+    frames_split = [frames[:key_frame_index], frames[key_frame_index:], nearest_frame]
+    # output_path='./seperate.mp4'
+    # torchvision.io.write_video(output_path, frames[1], fps=fps, video_codec="libx264")
+    # set image in sam when select the template frame
+    model.samcontroler.sam_controler.set_image(nearest_frame)
+    return frames_split, nearest_frame, nearest_frame, fps
+def generate_video_from_frames(frames, output_path, fps=30):
+    """
+    Generates a video from a list of frames.
+    Args:
+        frames (list of numpy arrays): The frames to include in the video.
+        output_path (str): The path to save the generated video.
+        fps (int, optional): The frame rate of the output video. Defaults to 30.
+    """
+    # height, width, layers = frames[0].shape
+    # fourcc = cv2.VideoWriter_fourcc(*"mp4v")
+    # video = cv2.VideoWriter(output_path, fourcc, fps, (width, height))
+    # for frame in frames:
+    #     video.write(frame)
+    # video.release()
+    frames = torch.from_numpy(np.asarray(frames))
+    output_path='./output.mp4'
+    torchvision.io.write_video(output_path, frames, fps=fps, video_codec="libx264")
+    return output_path
+def model_reset():
+    model.xmem.clear_memory()
+    return None
+def sam_refine(origin_frame, point_prompt, click_state, logit, evt:gr.SelectData):
+    """
+    Args:
+        template_frame: PIL.Image
+        point_prompt: flag for positive or negative button click
+        click_state: [[points], [labels]]
+    """
+    if point_prompt == "Positive":
+        coordinate = "[[{},{},1]]".format(evt.index[0], evt.index[1])
+    else:
+        coordinate = "[[{},{},0]]".format(evt.index[0], evt.index[1])
+    # prompt for sam model
+    prompt = get_prompt(click_state=click_state, click_input=coordinate)
+    # default value
+    # points = np.array([[evt.index[0],evt.index[1]]])
+    # labels= np.array([1])
+    if len(logit)==0:
+        logit = None
+    mask, logit, painted_image = model.first_frame_click(
+                                                      image=origin_frame,
+                                                      points=np.array(prompt["input_point"]),
+                                                      labels=np.array(prompt["input_label"]),
+                                                      multimask=prompt["multimask_output"],
+                                                      )
+    return painted_image, click_state, logit, mask
+def vos_tracking_video(video_state, template_mask,fps,video_input):
+    masks, logits, painted_images = model.generator(images=video_state[1], template_mask=template_mask)
+    video_output = generate_video_from_frames(painted_images, output_path="./output.mp4", fps=fps)
+    # image_selection_slider = gr.Slider(minimum=1, maximum=len(video_state[1]), value=1, label="Image Selection", interactive=True)
+    video_name = video_input.split('/')[-1].split('.')[0]
+    result_path = os.path.join('/hhd3/gaoshang/Track-Anything/results/'+video_name)
+    if not os.path.exists(result_path):
+        os.makedirs(result_path)
+    i=0
+    for mask in masks:
+        np.save(os.path.join(result_path,'{:05}.npy'.format(i)), mask)
+        i+=1
+    return video_output, painted_images, masks, logits
+def vos_tracking_image(image_selection_slider, painted_images):
+    # images = video_state[1]
+    percentage = image_selection_slider / 100
+    select_frame_num = int(percentage * len(painted_images))
+    return painted_images[select_frame_num], select_frame_num
+def interactive_correction(video_state, point_prompt, click_state, select_correction_frame, evt: gr.SelectData):
+    """
+    Args:
+        template_frame: PIL.Image
+        point_prompt: flag for positive or negative button click
+        click_state: [[points], [labels]]
+    """
+    refine_image = video_state[1][select_correction_frame]
+    if point_prompt == "Positive":
+        coordinate = "[[{},{},1]]".format(evt.index[0], evt.index[1])
+    else:
+        coordinate = "[[{},{},0]]".format(evt.index[0], evt.index[1])
+    # prompt for sam model
+    prompt = get_prompt(click_state=click_state, click_input=coordinate)
+    model.samcontroler.seg_again(refine_image)
+    corrected_mask, corrected_logit, corrected_painted_image = model.first_frame_click(
+                                                      image=refine_image,
+                                                      points=np.array(prompt["input_point"]),
+                                                      labels=np.array(prompt["input_label"]),
+                                                      multimask=prompt["multimask_output"],
+                                                      )
+    return corrected_painted_image, [corrected_mask, corrected_logit, corrected_painted_image]
+def correct_track(video_state, select_correction_frame, corrected_state, masks, logits, painted_images, fps, video_input):
+    model.xmem.clear_memory()
+    # inference the following images
+    following_images = video_state[1][select_correction_frame:]
+    corrected_masks, corrected_logits, corrected_painted_images = model.generator(images=following_images, template_mask=corrected_state[0])
+    masks = masks[:select_correction_frame] + corrected_masks
+    logits = logits[:select_correction_frame] + corrected_logits
+    painted_images = painted_images[:select_correction_frame] + corrected_painted_images
+    video_output = generate_video_from_frames(painted_images, output_path="./output.mp4", fps=fps)
+    video_name = video_input.split('/')[-1].split('.')[0]
+    result_path = os.path.join('/hhd3/gaoshang/Track-Anything/results/'+video_name)
+    if not os.path.exists(result_path):
+        os.makedirs(result_path)
+    i=0
+    for mask in masks:
+        np.save(os.path.join(result_path,'{:05}.npy'.format(i)), mask)
+        i+=1
+    return video_output, painted_images, logits, masks
+# check and download checkpoints if needed
+SAM_checkpoint = "sam_vit_h_4b8939.pth"
+sam_checkpoint_url = "https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth"
+xmem_checkpoint = "XMem-s012.pth"
+xmem_checkpoint_url = "https://github.com/hkchengrex/XMem/releases/download/v1.0/XMem-s012.pth"
+folder ="./checkpoints"
+SAM_checkpoint = download_checkpoint(sam_checkpoint_url, folder, SAM_checkpoint)
+xmem_checkpoint = download_checkpoint(xmem_checkpoint_url, folder, xmem_checkpoint)
+# args, defined in track_anything.py
+args = parse_augment()
+args.port = 12207
+args.device = "cuda:5"
+model = TrackingAnything(SAM_checkpoint, xmem_checkpoint, args)
+with gr.Blocks() as iface:
+    """
+        state for
+    """
+    state = gr.State([])
+    play_state = gr.State([])
+    video_state = gr.State([[],[],[]])
+    click_state = gr.State([[],[]])
+    logits = gr.State([])
+    masks = gr.State([])
+    painted_images = gr.State([])
+    origin_image = gr.State(None)
+    template_mask = gr.State(None)
+    select_correction_frame = gr.State(None)
+    corrected_state = gr.State([[],[],[]])
+    fps = gr.State([])
+    # video_name = gr.State([])
+    # queue value for image refresh, origin image, mask, logits, painted image
+    with gr.Row():
+        # for user video input
+        with gr.Column(scale=1.0):
+            video_input = gr.Video().style(height=720)
+            # listen to the user action for play and pause input video
+            video_input.play(fn=play_video, inputs=play_state, outputs=play_state, scroll_to_output=True, show_progress=True)
+            video_input.pause(fn=pause_video, inputs=play_state, outputs=play_state)
+            with gr.Row(scale=1):
+                # put the template frame under the radio button
+                with gr.Column(scale=0.5):
+                     # click points settins, negative or positive, mode continuous or single
+                    with gr.Row():
+                        with gr.Row(scale=0.5):
+                            point_prompt = gr.Radio(
+                                choices=["Positive",  "Negative"],
+                                value="Positive",
+                                label="Point Prompt",
+                                interactive=True)
+                            click_mode = gr.Radio(
+                                choices=["Continuous",  "Single"],
+                                value="Continuous",
+                                label="Clicking Mode",
+                                interactive=True)
+                        with gr.Row(scale=0.5):
+                            clear_button_clike = gr.Button(value="Clear Clicks", interactive=True).style(height=160)
+                            clear_button_image = gr.Button(value="Clear Image", interactive=True)
+                    template_frame = gr.Image(type="pil",interactive=True, elem_id="template_frame").style(height=360)
+                    with gr.Column():
+                        template_select_button = gr.Button(value="Template select", interactive=True, variant="primary")
+                with gr.Column(scale=0.5):
+                    # for intermedia result check and correction
+                    # intermedia_image = gr.Image(type="pil", interactive=True, elem_id="intermedia_frame").style(height=360)
+                    video_output = gr.Video().style(height=360)
+                    tracking_video_predict_button = gr.Button(value="Tracking")
+                    image_output = gr.Image(type="pil", interactive=True, elem_id="image_output").style(height=360)
+                    image_selection_slider = gr.Slider(minimum=0, maximum=100, step=0.1, value=0, label="Image Selection", interactive=True)
+                    correct_track_button = gr.Button(value="Interactive Correction")
+    template_frame.select(
+        fn=sam_refine,
+        inputs=[
+            origin_image, point_prompt, click_state, logits
+        ],
+        outputs=[
+            template_frame, click_state, logits, template_mask
+        ]
+    )
+    template_select_button.click(
+        fn=get_frames_from_video,
+        inputs=[
+            video_input,
+            play_state
+        ],
+        # outputs=[video_state, template_frame, origin_image, fps, video_name],
+        outputs=[video_state, template_frame, origin_image, fps],
+    )
+    tracking_video_predict_button.click(
+        fn=vos_tracking_video,
+        inputs=[video_state, template_mask, fps, video_input],
+        outputs=[video_output, painted_images, masks, logits]
+    )
+    image_selection_slider.release(fn=vos_tracking_image,
+                                   inputs=[image_selection_slider, painted_images], outputs=[image_output, select_correction_frame], api_name="select_image")
+    # correction
+    image_output.select(
+        fn=interactive_correction,
+        inputs=[video_state, point_prompt, click_state, select_correction_frame],
+        outputs=[image_output, corrected_state]
+    )
+    correct_track_button.click(
+        fn=correct_track,
+        inputs=[video_state, select_correction_frame, corrected_state, masks, logits, painted_images, fps,video_input],
+        outputs=[video_output, painted_images, logits, masks ]
+    )
+    # clear input
+    video_input.clear(
+        lambda: ([], [], [[], [], []],
+                 None, "", "", "", "", "", "", "", [[],[]],
+                 None),
+        [],
+        [ state, play_state, video_state,
+         template_frame, video_output, image_output, origin_image, template_mask, painted_images, masks, logits, click_state,
+         select_correction_frame],
+        queue=False,
+        show_progress=False
+    )
+    clear_button_image.click(
+        fn=model_reset
+    )
+    clear_button_clike.click(
+       lambda: ([[],[]]),
+        [],
+        [click_state],
+        queue=False,
+        show_progress=False
+    )
+iface.queue(concurrency_count=1)
+iface.launch(debug=True, enable_queue=True, server_port=args.port, server_name="0.0.0.0")

app_test.py ADDED Viewed

	@@ -0,0 +1,23 @@

+import gradio as gr
+def update_iframe(slider_value):
+    return f'''
+    <script>
+        window.addEventListener('message', function(event) {{
+            if (event.data.sliderValue !== undefined) {{
+                var iframe = document.getElementById("text_iframe");
+                iframe.src = "http://localhost:5001/get_text?slider_value=" + event.data.sliderValue;
+            }}
+        }}, false);
+    </script>
+    <iframe id="text_iframe" src="http://localhost:5001/get_text?slider_value={slider_value}" style="width: 100%; height: 100%; border: none;"></iframe>
+    '''
+iface = gr.Interface(
+    fn=update_iframe,
+    inputs=gr.inputs.Slider(minimum=0, maximum=100, step=1, default=50),
+    outputs=gr.outputs.HTML(),
+    allow_flagging=False,
+)
+iface.launch(server_name='0.0.0.0', server_port=12212)

assets/demo_version_1.MP4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2b61b54bc6eb0d0f7416f95aa3cd6a48d850ca7473022ec1aff48310911b0233
+size 27053146

assets/inpainting.gif ADDED Viewed

Git LFS Details

SHA256: 5e99bd697bccaed7a0dded7f00855f222031b7dcefd8f64f22f374fcdab390d2
Pointer size: 133 Bytes
Size of remote file: 22.2 MB

assets/poster_demo_version_1.png ADDED Viewed

assets/qingming.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:58b34bbce0bd0a18ab5fc5450d4046e1cfc6bd55c508046695545819d8fc46dc
+size 4483842

demo.py ADDED Viewed

	@@ -0,0 +1,87 @@

+from metaseg import SegAutoMaskPredictor, SegManualMaskPredictor, SahiAutoSegmentation, sahi_sliced_predict
+# For image
+def automask_image_app(image_path, model_type, points_per_side, points_per_batch, min_area):
+    SegAutoMaskPredictor().image_predict(
+        source=image_path,
+        model_type=model_type,  # vit_l, vit_h, vit_b
+        points_per_side=points_per_side,
+        points_per_batch=points_per_batch,
+        min_area=min_area,
+        output_path="output.png",
+        show=False,
+        save=True,
+    )
+    return "output.png"
+# For video
+def automask_video_app(video_path, model_type, points_per_side, points_per_batch, min_area):
+    SegAutoMaskPredictor().video_predict(
+        source=video_path,
+        model_type=model_type,  # vit_l, vit_h, vit_b
+        points_per_side=points_per_side,
+        points_per_batch=points_per_batch,
+        min_area=min_area,
+        output_path="output.mp4",
+    )
+    return "output.mp4"
+# For manuel box and point selection
+def manual_app(image_path, model_type, input_point, input_label, input_box, multimask_output, random_color):
+    SegManualMaskPredictor().image_predict(
+        source=image_path,
+        model_type=model_type,  # vit_l, vit_h, vit_b
+        input_point=input_point,
+        input_label=input_label,
+        input_box=input_box,
+        multimask_output=multimask_output,
+        random_color=random_color,
+        output_path="output.png",
+        show=False,
+        save=True,
+    )
+    return "output.png"
+# For sahi sliced prediction
+def sahi_autoseg_app(
+    image_path,
+    sam_model_type,
+    detection_model_type,
+    detection_model_path,
+    conf_th,
+    image_size,
+    slice_height,
+    slice_width,
+    overlap_height_ratio,
+    overlap_width_ratio,
+):
+    boxes = sahi_sliced_predict(
+        image_path=image_path,
+        detection_model_type=detection_model_type,  # yolov8, detectron2, mmdetection, torchvision
+        detection_model_path=detection_model_path,
+        conf_th=conf_th,
+        image_size=image_size,
+        slice_height=slice_height,
+        slice_width=slice_width,
+        overlap_height_ratio=overlap_height_ratio,
+        overlap_width_ratio=overlap_width_ratio,
+    )
+    SahiAutoSegmentation().predict(
+        source=image_path,
+        model_type=sam_model_type,
+        input_box=boxes,
+        multimask_output=False,
+        random_color=False,
+        show=False,
+        save=True,
+    )
+    return "output.png"

images/groceries.jpg ADDED Viewed

images/mask_painter.png ADDED Viewed

images/painter_input_image.jpg ADDED Viewed

images/painter_input_mask.jpg ADDED Viewed

images/painter_output_image.png ADDED Viewed

images/painter_output_image__.png ADDED Viewed

images/point_painter.png ADDED Viewed

images/point_painter_1.png ADDED Viewed

images/point_painter_2.png ADDED Viewed

images/truck.jpg ADDED Viewed

images/truck_both.jpg ADDED Viewed

images/truck_mask.jpg ADDED Viewed

images/truck_point.jpg ADDED Viewed

inpainter/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

inpainter/base_inpainter.py ADDED Viewed

	@@ -0,0 +1,160 @@

+import os
+import glob
+from PIL import Image
+import torch
+import yaml
+import cv2
+import importlib
+import numpy as np
+from util.tensor_util import resize_frames, resize_masks
+class BaseInpainter:
+    def __init__(self, E2FGVI_checkpoint, device) -> None:
+        """
+        E2FGVI_checkpoint: checkpoint of inpainter (version hq, with multi-resolution support)
+        """
+        net = importlib.import_module('model.e2fgvi_hq')
+        self.model = net.InpaintGenerator().to(device)
+        self.model.load_state_dict(torch.load(E2FGVI_checkpoint, map_location=device))
+        self.model.eval()
+        self.device = device
+        # load configurations
+        with open("inpainter/config/config.yaml", 'r') as stream:
+            config = yaml.safe_load(stream)
+        self.neighbor_stride = config['neighbor_stride']
+        self.num_ref = config['num_ref']
+        self.step = config['step']
+    # sample reference frames from the whole video
+    def get_ref_index(self, f, neighbor_ids, length):
+        ref_index = []
+        if self.num_ref == -1:
+            for i in range(0, length, self.step):
+                if i not in neighbor_ids:
+                    ref_index.append(i)
+        else:
+            start_idx = max(0, f - self.step * (self.num_ref // 2))
+            end_idx = min(length, f + self.step * (self.num_ref // 2))
+            for i in range(start_idx, end_idx + 1, self.step):
+                if i not in neighbor_ids:
+                    if len(ref_index) > self.num_ref:
+                        break
+                    ref_index.append(i)
+        return ref_index
+    def inpaint(self, frames, masks, dilate_radius=15, ratio=1):
+        """
+        frames: numpy array, T, H, W, 3
+        masks: numpy array, T, H, W
+        dilate_radius: radius when applying dilation on masks
+        ratio: down-sample ratio
+        Output:
+        inpainted_frames: numpy array, T, H, W, 3
+        """
+        assert frames.shape[:3] == masks.shape, 'different size between frames and masks'
+        assert ratio > 0 and ratio <= 1, 'ratio must in (0, 1]'
+        masks = masks.copy()
+        masks = np.clip(masks, 0, 1)
+        kernel = cv2.getStructuringElement(2, (dilate_radius, dilate_radius))
+        masks = np.stack([cv2.dilate(mask, kernel) for mask in masks], 0)
+        T, H, W = masks.shape
+        # size: (w, h)
+        if ratio == 1:
+            size = None
+        else:
+            size = (int(W*ratio), int(H*ratio))
+        masks = np.expand_dims(masks, axis=3)    # expand to T, H, W, 1
+        binary_masks = resize_masks(masks, size)
+        frames = resize_frames(frames, size)          # T, H, W, 3
+        # frames and binary_masks are numpy arrays
+        h, w = frames.shape[1:3]
+        video_length = T
+        # convert to tensor
+        imgs = (torch.from_numpy(frames).permute(0, 3, 1, 2).contiguous().unsqueeze(0).float().div(255)) * 2 - 1
+        masks = torch.from_numpy(binary_masks).permute(0, 3, 1, 2).contiguous().unsqueeze(0)
+        imgs, masks = imgs.to(self.device), masks.to(self.device)
+        comp_frames = [None] * video_length
+        for f in range(0, video_length, self.neighbor_stride):
+            neighbor_ids = [
+                i for i in range(max(0, f - self.neighbor_stride),
+                                min(video_length, f + self.neighbor_stride + 1))
+            ]
+            ref_ids = self.get_ref_index(f, neighbor_ids, video_length)
+            selected_imgs = imgs[:1, neighbor_ids + ref_ids, :, :, :]
+            selected_masks = masks[:1, neighbor_ids + ref_ids, :, :, :]
+            with torch.no_grad():
+                masked_imgs = selected_imgs * (1 - selected_masks)
+                mod_size_h = 60
+                mod_size_w = 108
+                h_pad = (mod_size_h - h % mod_size_h) % mod_size_h
+                w_pad = (mod_size_w - w % mod_size_w) % mod_size_w
+                masked_imgs = torch.cat(
+                    [masked_imgs, torch.flip(masked_imgs, [3])],
+                    3)[:, :, :, :h + h_pad, :]
+                masked_imgs = torch.cat(
+                    [masked_imgs, torch.flip(masked_imgs, [4])],
+                    4)[:, :, :, :, :w + w_pad]
+                pred_imgs, _ = self.model(masked_imgs, len(neighbor_ids))
+                pred_imgs = pred_imgs[:, :, :h, :w]
+                pred_imgs = (pred_imgs + 1) / 2
+                pred_imgs = pred_imgs.cpu().permute(0, 2, 3, 1).numpy() * 255
+                for i in range(len(neighbor_ids)):
+                    idx = neighbor_ids[i]
+                    img = pred_imgs[i].astype(np.uint8) * binary_masks[idx] + frames[idx] * (
+                            1 - binary_masks[idx])
+                    if comp_frames[idx] is None:
+                        comp_frames[idx] = img
+                    else:
+                        comp_frames[idx] = comp_frames[idx].astype(
+                            np.float32) * 0.5 + img.astype(np.float32) * 0.5
+        inpainted_frames = np.stack(comp_frames, 0)
+        return inpainted_frames.astype(np.uint8)
+if __name__ == '__main__':
+    frame_path = glob.glob(os.path.join('/ssd1/gaomingqi/datasets/davis/JPEGImages/480p/parkour', '*.jpg'))
+    frame_path.sort()
+    mask_path = glob.glob(os.path.join('/ssd1/gaomingqi/datasets/davis/Annotations/480p/parkour', "*.png"))
+    mask_path.sort()
+    save_path = '/ssd1/gaomingqi/results/inpainting/parkour'
+    if not os.path.exists(save_path):
+        os.mkdir(save_path)
+    frames = []
+    masks = []
+    for fid, mid in zip(frame_path, mask_path):
+        frames.append(Image.open(fid).convert('RGB'))
+        masks.append(Image.open(mid).convert('P'))
+    frames = np.stack(frames, 0)
+    masks = np.stack(masks, 0)
+    # ----------------------------------------------
+    # how to use
+    # ----------------------------------------------
+    # 1/3: set checkpoint and device
+    checkpoint = '/ssd1/gaomingqi/checkpoints/E2FGVI-HQ-CVPR22.pth'
+    device = 'cuda:6'
+    # 2/3: initialise inpainter
+    base_inpainter = BaseInpainter(checkpoint, device)
+    # 3/3: inpainting (frames: numpy array, T, H, W, 3; masks: numpy array, T, H, W)
+    # ratio: (0, 1], ratio for down sample, default value is 1
+    inpainted_frames = base_inpainter.inpaint(frames, masks, ratio=1)   # numpy array, T, H, W, 3
+    # ----------------------------------------------
+    # end
+    # ----------------------------------------------
+    # save
+    for ti, inpainted_frame in enumerate(inpainted_frames):
+        frame = Image.fromarray(inpainted_frame).convert('RGB')
+        frame.save(os.path.join(save_path, f'{ti:05d}.jpg'))

inpainter/config/config.yaml ADDED Viewed

	@@ -0,0 +1,4 @@

+# config info for E2FGVI
+neighbor_stride: 5
+num_ref: -1
+step: 10

inpainter/model/e2fgvi.py ADDED Viewed

	@@ -0,0 +1,350 @@

+''' Towards An End-to-End Framework for Video Inpainting
+'''
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from model.modules.flow_comp import SPyNet
+from model.modules.feat_prop import BidirectionalPropagation, SecondOrderDeformableAlignment
+from model.modules.tfocal_transformer import TemporalFocalTransformerBlock, SoftSplit, SoftComp
+from model.modules.spectral_norm import spectral_norm as _spectral_norm
+class BaseNetwork(nn.Module):
+    def __init__(self):
+        super(BaseNetwork, self).__init__()
+    def print_network(self):
+        if isinstance(self, list):
+            self = self[0]
+        num_params = 0
+        for param in self.parameters():
+            num_params += param.numel()
+        print(
+            'Network [%s] was created. Total number of parameters: %.1f million. '
+            'To see the architecture, do print(network).' %
+            (type(self).__name__, num_params / 1000000))
+    def init_weights(self, init_type='normal', gain=0.02):
+        '''
+        initialize network's weights
+        init_type: normal | xavier | kaiming | orthogonal
+        https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/blob/9451e70673400885567d08a9e97ade2524c700d0/models/networks.py#L39
+        '''
+        def init_func(m):
+            classname = m.__class__.__name__
+            if classname.find('InstanceNorm2d') != -1:
+                if hasattr(m, 'weight') and m.weight is not None:
+                    nn.init.constant_(m.weight.data, 1.0)
+                if hasattr(m, 'bias') and m.bias is not None:
+                    nn.init.constant_(m.bias.data, 0.0)
+            elif hasattr(m, 'weight') and (classname.find('Conv') != -1
+                                           or classname.find('Linear') != -1):
+                if init_type == 'normal':
+                    nn.init.normal_(m.weight.data, 0.0, gain)
+                elif init_type == 'xavier':
+                    nn.init.xavier_normal_(m.weight.data, gain=gain)
+                elif init_type == 'xavier_uniform':
+                    nn.init.xavier_uniform_(m.weight.data, gain=1.0)
+                elif init_type == 'kaiming':
+                    nn.init.kaiming_normal_(m.weight.data, a=0, mode='fan_in')
+                elif init_type == 'orthogonal':
+                    nn.init.orthogonal_(m.weight.data, gain=gain)
+                elif init_type == 'none':  # uses pytorch's default init method
+                    m.reset_parameters()
+                else:
+                    raise NotImplementedError(
+                        'initialization method [%s] is not implemented' %
+                        init_type)
+                if hasattr(m, 'bias') and m.bias is not None:
+                    nn.init.constant_(m.bias.data, 0.0)
+        self.apply(init_func)
+        # propagate to children
+        for m in self.children():
+            if hasattr(m, 'init_weights'):
+                m.init_weights(init_type, gain)
+class Encoder(nn.Module):
+    def __init__(self):
+        super(Encoder, self).__init__()
+        self.group = [1, 2, 4, 8, 1]
+        self.layers = nn.ModuleList([
+            nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1, groups=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(640, 512, kernel_size=3, stride=1, padding=1, groups=2),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(768, 384, kernel_size=3, stride=1, padding=1, groups=4),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(640, 256, kernel_size=3, stride=1, padding=1, groups=8),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(512, 128, kernel_size=3, stride=1, padding=1, groups=1),
+            nn.LeakyReLU(0.2, inplace=True)
+        ])
+    def forward(self, x):
+        bt, c, h, w = x.size()
+        h, w = h // 4, w // 4
+        out = x
+        for i, layer in enumerate(self.layers):
+            if i == 8:
+                x0 = out
+            if i > 8 and i % 2 == 0:
+                g = self.group[(i - 8) // 2]
+                x = x0.view(bt, g, -1, h, w)
+                o = out.view(bt, g, -1, h, w)
+                out = torch.cat([x, o], 2).view(bt, -1, h, w)
+            out = layer(out)
+        return out
+class deconv(nn.Module):
+    def __init__(self,
+                 input_channel,
+                 output_channel,
+                 kernel_size=3,
+                 padding=0):
+        super().__init__()
+        self.conv = nn.Conv2d(input_channel,
+                              output_channel,
+                              kernel_size=kernel_size,
+                              stride=1,
+                              padding=padding)
+    def forward(self, x):
+        x = F.interpolate(x,
+                          scale_factor=2,
+                          mode='bilinear',
+                          align_corners=True)
+        return self.conv(x)
+class InpaintGenerator(BaseNetwork):
+    def __init__(self, init_weights=True):
+        super(InpaintGenerator, self).__init__()
+        channel = 256
+        hidden = 512
+        # encoder
+        self.encoder = Encoder()
+        # decoder
+        self.decoder = nn.Sequential(
+            deconv(channel // 2, 128, kernel_size=3, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(128, 64, kernel_size=3, stride=1, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            deconv(64, 64, kernel_size=3, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(64, 3, kernel_size=3, stride=1, padding=1))
+        # feature propagation module
+        self.feat_prop_module = BidirectionalPropagation(channel // 2)
+        # soft split and soft composition
+        kernel_size = (7, 7)
+        padding = (3, 3)
+        stride = (3, 3)
+        output_size = (60, 108)
+        t2t_params = {
+            'kernel_size': kernel_size,
+            'stride': stride,
+            'padding': padding,
+            'output_size': output_size
+        }
+        self.ss = SoftSplit(channel // 2,
+                            hidden,
+                            kernel_size,
+                            stride,
+                            padding,
+                            t2t_param=t2t_params)
+        self.sc = SoftComp(channel // 2, hidden, output_size, kernel_size,
+                           stride, padding)
+        n_vecs = 1
+        for i, d in enumerate(kernel_size):
+            n_vecs *= int((output_size[i] + 2 * padding[i] -
+                           (d - 1) - 1) / stride[i] + 1)
+        blocks = []
+        depths = 8
+        num_heads = [4] * depths
+        window_size = [(5, 9)] * depths
+        focal_windows = [(5, 9)] * depths
+        focal_levels = [2] * depths
+        pool_method = "fc"
+        for i in range(depths):
+            blocks.append(
+                TemporalFocalTransformerBlock(dim=hidden,
+                                              num_heads=num_heads[i],
+                                              window_size=window_size[i],
+                                              focal_level=focal_levels[i],
+                                              focal_window=focal_windows[i],
+                                              n_vecs=n_vecs,
+                                              t2t_params=t2t_params,
+                                              pool_method=pool_method))
+        self.transformer = nn.Sequential(*blocks)
+        if init_weights:
+            self.init_weights()
+            # Need to initial the weights of MSDeformAttn specifically
+            for m in self.modules():
+                if isinstance(m, SecondOrderDeformableAlignment):
+                    m.init_offset()
+        # flow completion network
+        self.update_spynet = SPyNet()
+    def forward_bidirect_flow(self, masked_local_frames):
+        b, l_t, c, h, w = masked_local_frames.size()
+        # compute forward and backward flows of masked frames
+        masked_local_frames = F.interpolate(masked_local_frames.view(
+            -1, c, h, w),
+                                            scale_factor=1 / 4,
+                                            mode='bilinear',
+                                            align_corners=True,
+                                            recompute_scale_factor=True)
+        masked_local_frames = masked_local_frames.view(b, l_t, c, h // 4,
+                                                       w // 4)
+        mlf_1 = masked_local_frames[:, :-1, :, :, :].reshape(
+            -1, c, h // 4, w // 4)
+        mlf_2 = masked_local_frames[:, 1:, :, :, :].reshape(
+            -1, c, h // 4, w // 4)
+        pred_flows_forward = self.update_spynet(mlf_1, mlf_2)
+        pred_flows_backward = self.update_spynet(mlf_2, mlf_1)
+        pred_flows_forward = pred_flows_forward.view(b, l_t - 1, 2, h // 4,
+                                                     w // 4)
+        pred_flows_backward = pred_flows_backward.view(b, l_t - 1, 2, h // 4,
+                                                       w // 4)
+        return pred_flows_forward, pred_flows_backward
+    def forward(self, masked_frames, num_local_frames):
+        l_t = num_local_frames
+        b, t, ori_c, ori_h, ori_w = masked_frames.size()
+        # normalization before feeding into the flow completion module
+        masked_local_frames = (masked_frames[:, :l_t, ...] + 1) / 2
+        pred_flows = self.forward_bidirect_flow(masked_local_frames)
+        # extracting features and performing the feature propagation on local features
+        enc_feat = self.encoder(masked_frames.view(b * t, ori_c, ori_h, ori_w))
+        _, c, h, w = enc_feat.size()
+        local_feat = enc_feat.view(b, t, c, h, w)[:, :l_t, ...]
+        ref_feat = enc_feat.view(b, t, c, h, w)[:, l_t:, ...]
+        local_feat = self.feat_prop_module(local_feat, pred_flows[0],
+                                           pred_flows[1])
+        enc_feat = torch.cat((local_feat, ref_feat), dim=1)
+        # content hallucination through stacking multiple temporal focal transformer blocks
+        trans_feat = self.ss(enc_feat.view(-1, c, h, w), b)
+        trans_feat = self.transformer(trans_feat)
+        trans_feat = self.sc(trans_feat, t)
+        trans_feat = trans_feat.view(b, t, -1, h, w)
+        enc_feat = enc_feat + trans_feat
+        # decode frames from features
+        output = self.decoder(enc_feat.view(b * t, c, h, w))
+        output = torch.tanh(output)
+        return output, pred_flows
+# ######################################################################
+#  Discriminator for Temporal Patch GAN
+# ######################################################################
+class Discriminator(BaseNetwork):
+    def __init__(self,
+                 in_channels=3,
+                 use_sigmoid=False,
+                 use_spectral_norm=True,
+                 init_weights=True):
+        super(Discriminator, self).__init__()
+        self.use_sigmoid = use_sigmoid
+        nf = 32
+        self.conv = nn.Sequential(
+            spectral_norm(
+                nn.Conv3d(in_channels=in_channels,
+                          out_channels=nf * 1,
+                          kernel_size=(3, 5, 5),
+                          stride=(1, 2, 2),
+                          padding=1,
+                          bias=not use_spectral_norm), use_spectral_norm),
+            # nn.InstanceNorm2d(64, track_running_stats=False),
+            nn.LeakyReLU(0.2, inplace=True),
+            spectral_norm(
+                nn.Conv3d(nf * 1,
+                          nf * 2,
+                          kernel_size=(3, 5, 5),
+                          stride=(1, 2, 2),
+                          padding=(1, 2, 2),
+                          bias=not use_spectral_norm), use_spectral_norm),
+            # nn.InstanceNorm2d(128, track_running_stats=False),
+            nn.LeakyReLU(0.2, inplace=True),
+            spectral_norm(
+                nn.Conv3d(nf * 2,
+                          nf * 4,
+                          kernel_size=(3, 5, 5),
+                          stride=(1, 2, 2),
+                          padding=(1, 2, 2),
+                          bias=not use_spectral_norm), use_spectral_norm),
+            # nn.InstanceNorm2d(256, track_running_stats=False),
+            nn.LeakyReLU(0.2, inplace=True),
+            spectral_norm(
+                nn.Conv3d(nf * 4,
+                          nf * 4,
+                          kernel_size=(3, 5, 5),
+                          stride=(1, 2, 2),
+                          padding=(1, 2, 2),
+                          bias=not use_spectral_norm), use_spectral_norm),
+            # nn.InstanceNorm2d(256, track_running_stats=False),
+            nn.LeakyReLU(0.2, inplace=True),
+            spectral_norm(
+                nn.Conv3d(nf * 4,
+                          nf * 4,
+                          kernel_size=(3, 5, 5),
+                          stride=(1, 2, 2),
+                          padding=(1, 2, 2),
+                          bias=not use_spectral_norm), use_spectral_norm),
+            # nn.InstanceNorm2d(256, track_running_stats=False),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv3d(nf * 4,
+                      nf * 4,
+                      kernel_size=(3, 5, 5),
+                      stride=(1, 2, 2),
+                      padding=(1, 2, 2)))
+        if init_weights:
+            self.init_weights()
+    def forward(self, xs):
+        # T, C, H, W = xs.shape (old)
+        # B, T, C, H, W (new)
+        xs_t = torch.transpose(xs, 1, 2)
+        feat = self.conv(xs_t)
+        if self.use_sigmoid:
+            feat = torch.sigmoid(feat)
+        out = torch.transpose(feat, 1, 2)  # B, T, C, H, W
+        return out
+def spectral_norm(module, mode=True):
+    if mode:
+        return _spectral_norm(module)
+    return module

inpainter/model/e2fgvi_hq.py ADDED Viewed

	@@ -0,0 +1,350 @@

+''' Towards An End-to-End Framework for Video Inpainting
+'''
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from model.modules.flow_comp import SPyNet
+from model.modules.feat_prop import BidirectionalPropagation, SecondOrderDeformableAlignment
+from model.modules.tfocal_transformer_hq import TemporalFocalTransformerBlock, SoftSplit, SoftComp
+from model.modules.spectral_norm import spectral_norm as _spectral_norm
+class BaseNetwork(nn.Module):
+    def __init__(self):
+        super(BaseNetwork, self).__init__()
+    def print_network(self):
+        if isinstance(self, list):
+            self = self[0]
+        num_params = 0
+        for param in self.parameters():
+            num_params += param.numel()
+        print(
+            'Network [%s] was created. Total number of parameters: %.1f million. '
+            'To see the architecture, do print(network).' %
+            (type(self).__name__, num_params / 1000000))
+    def init_weights(self, init_type='normal', gain=0.02):
+        '''
+        initialize network's weights
+        init_type: normal | xavier | kaiming | orthogonal
+        https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/blob/9451e70673400885567d08a9e97ade2524c700d0/models/networks.py#L39
+        '''
+        def init_func(m):
+            classname = m.__class__.__name__
+            if classname.find('InstanceNorm2d') != -1:
+                if hasattr(m, 'weight') and m.weight is not None:
+                    nn.init.constant_(m.weight.data, 1.0)
+                if hasattr(m, 'bias') and m.bias is not None:
+                    nn.init.constant_(m.bias.data, 0.0)
+            elif hasattr(m, 'weight') and (classname.find('Conv') != -1
+                                           or classname.find('Linear') != -1):
+                if init_type == 'normal':
+                    nn.init.normal_(m.weight.data, 0.0, gain)
+                elif init_type == 'xavier':
+                    nn.init.xavier_normal_(m.weight.data, gain=gain)
+                elif init_type == 'xavier_uniform':
+                    nn.init.xavier_uniform_(m.weight.data, gain=1.0)
+                elif init_type == 'kaiming':
+                    nn.init.kaiming_normal_(m.weight.data, a=0, mode='fan_in')
+                elif init_type == 'orthogonal':
+                    nn.init.orthogonal_(m.weight.data, gain=gain)
+                elif init_type == 'none':  # uses pytorch's default init method
+                    m.reset_parameters()
+                else:
+                    raise NotImplementedError(
+                        'initialization method [%s] is not implemented' %
+                        init_type)
+                if hasattr(m, 'bias') and m.bias is not None:
+                    nn.init.constant_(m.bias.data, 0.0)
+        self.apply(init_func)
+        # propagate to children
+        for m in self.children():
+            if hasattr(m, 'init_weights'):
+                m.init_weights(init_type, gain)
+class Encoder(nn.Module):
+    def __init__(self):
+        super(Encoder, self).__init__()
+        self.group = [1, 2, 4, 8, 1]
+        self.layers = nn.ModuleList([
+            nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(128, 256, kernel_size=3, stride=1, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(256, 384, kernel_size=3, stride=1, padding=1, groups=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(640, 512, kernel_size=3, stride=1, padding=1, groups=2),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(768, 384, kernel_size=3, stride=1, padding=1, groups=4),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(640, 256, kernel_size=3, stride=1, padding=1, groups=8),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(512, 128, kernel_size=3, stride=1, padding=1, groups=1),
+            nn.LeakyReLU(0.2, inplace=True)
+        ])
+    def forward(self, x):
+        bt, c, _, _ = x.size()
+        # h, w = h//4, w//4
+        out = x
+        for i, layer in enumerate(self.layers):
+            if i == 8:
+                x0 = out
+                _, _, h, w = x0.size()
+            if i > 8 and i % 2 == 0:
+                g = self.group[(i - 8) // 2]
+                x = x0.view(bt, g, -1, h, w)
+                o = out.view(bt, g, -1, h, w)
+                out = torch.cat([x, o], 2).view(bt, -1, h, w)
+            out = layer(out)
+        return out
+class deconv(nn.Module):
+    def __init__(self,
+                 input_channel,
+                 output_channel,
+                 kernel_size=3,
+                 padding=0):
+        super().__init__()
+        self.conv = nn.Conv2d(input_channel,
+                              output_channel,
+                              kernel_size=kernel_size,
+                              stride=1,
+                              padding=padding)
+    def forward(self, x):
+        x = F.interpolate(x,
+                          scale_factor=2,
+                          mode='bilinear',
+                          align_corners=True)
+        return self.conv(x)
+class InpaintGenerator(BaseNetwork):
+    def __init__(self, init_weights=True):
+        super(InpaintGenerator, self).__init__()
+        channel = 256
+        hidden = 512
+        # encoder
+        self.encoder = Encoder()
+        # decoder
+        self.decoder = nn.Sequential(
+            deconv(channel // 2, 128, kernel_size=3, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(128, 64, kernel_size=3, stride=1, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            deconv(64, 64, kernel_size=3, padding=1),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv2d(64, 3, kernel_size=3, stride=1, padding=1))
+        # feature propagation module
+        self.feat_prop_module = BidirectionalPropagation(channel // 2)
+        # soft split and soft composition
+        kernel_size = (7, 7)
+        padding = (3, 3)
+        stride = (3, 3)
+        output_size = (60, 108)
+        t2t_params = {
+            'kernel_size': kernel_size,
+            'stride': stride,
+            'padding': padding
+        }
+        self.ss = SoftSplit(channel // 2,
+                            hidden,
+                            kernel_size,
+                            stride,
+                            padding,
+                            t2t_param=t2t_params)
+        self.sc = SoftComp(channel // 2, hidden, kernel_size, stride, padding)
+        n_vecs = 1
+        for i, d in enumerate(kernel_size):
+            n_vecs *= int((output_size[i] + 2 * padding[i] -
+                           (d - 1) - 1) / stride[i] + 1)
+        blocks = []
+        depths = 8
+        num_heads = [4] * depths
+        window_size = [(5, 9)] * depths
+        focal_windows = [(5, 9)] * depths
+        focal_levels = [2] * depths
+        pool_method = "fc"
+        for i in range(depths):
+            blocks.append(
+                TemporalFocalTransformerBlock(dim=hidden,
+                                              num_heads=num_heads[i],
+                                              window_size=window_size[i],
+                                              focal_level=focal_levels[i],
+                                              focal_window=focal_windows[i],
+                                              n_vecs=n_vecs,
+                                              t2t_params=t2t_params,
+                                              pool_method=pool_method))
+        self.transformer = nn.Sequential(*blocks)
+        if init_weights:
+            self.init_weights()
+            # Need to initial the weights of MSDeformAttn specifically
+            for m in self.modules():
+                if isinstance(m, SecondOrderDeformableAlignment):
+                    m.init_offset()
+        # flow completion network
+        self.update_spynet = SPyNet()
+    def forward_bidirect_flow(self, masked_local_frames):
+        b, l_t, c, h, w = masked_local_frames.size()
+        # compute forward and backward flows of masked frames
+        masked_local_frames = F.interpolate(masked_local_frames.view(
+            -1, c, h, w),
+                                            scale_factor=1 / 4,
+                                            mode='bilinear',
+                                            align_corners=True,
+                                            recompute_scale_factor=True)
+        masked_local_frames = masked_local_frames.view(b, l_t, c, h // 4,
+                                                       w // 4)
+        mlf_1 = masked_local_frames[:, :-1, :, :, :].reshape(
+            -1, c, h // 4, w // 4)
+        mlf_2 = masked_local_frames[:, 1:, :, :, :].reshape(
+            -1, c, h // 4, w // 4)
+        pred_flows_forward = self.update_spynet(mlf_1, mlf_2)
+        pred_flows_backward = self.update_spynet(mlf_2, mlf_1)
+        pred_flows_forward = pred_flows_forward.view(b, l_t - 1, 2, h // 4,
+                                                     w // 4)
+        pred_flows_backward = pred_flows_backward.view(b, l_t - 1, 2, h // 4,
+                                                       w // 4)
+        return pred_flows_forward, pred_flows_backward
+    def forward(self, masked_frames, num_local_frames):
+        l_t = num_local_frames
+        b, t, ori_c, ori_h, ori_w = masked_frames.size()
+        # normalization before feeding into the flow completion module
+        masked_local_frames = (masked_frames[:, :l_t, ...] + 1) / 2
+        pred_flows = self.forward_bidirect_flow(masked_local_frames)
+        # extracting features and performing the feature propagation on local features
+        enc_feat = self.encoder(masked_frames.view(b * t, ori_c, ori_h, ori_w))
+        _, c, h, w = enc_feat.size()
+        fold_output_size = (h, w)
+        local_feat = enc_feat.view(b, t, c, h, w)[:, :l_t, ...]
+        ref_feat = enc_feat.view(b, t, c, h, w)[:, l_t:, ...]
+        local_feat = self.feat_prop_module(local_feat, pred_flows[0],
+                                           pred_flows[1])
+        enc_feat = torch.cat((local_feat, ref_feat), dim=1)
+        # content hallucination through stacking multiple temporal focal transformer blocks
+        trans_feat = self.ss(enc_feat.view(-1, c, h, w), b, fold_output_size)
+        trans_feat = self.transformer([trans_feat, fold_output_size])
+        trans_feat = self.sc(trans_feat[0], t, fold_output_size)
+        trans_feat = trans_feat.view(b, t, -1, h, w)
+        enc_feat = enc_feat + trans_feat
+        # decode frames from features
+        output = self.decoder(enc_feat.view(b * t, c, h, w))
+        output = torch.tanh(output)
+        return output, pred_flows
+# ######################################################################
+#  Discriminator for Temporal Patch GAN
+# ######################################################################
+class Discriminator(BaseNetwork):
+    def __init__(self,
+                 in_channels=3,
+                 use_sigmoid=False,
+                 use_spectral_norm=True,
+                 init_weights=True):
+        super(Discriminator, self).__init__()
+        self.use_sigmoid = use_sigmoid
+        nf = 32
+        self.conv = nn.Sequential(
+            spectral_norm(
+                nn.Conv3d(in_channels=in_channels,
+                          out_channels=nf * 1,
+                          kernel_size=(3, 5, 5),
+                          stride=(1, 2, 2),
+                          padding=1,
+                          bias=not use_spectral_norm), use_spectral_norm),
+            # nn.InstanceNorm2d(64, track_running_stats=False),
+            nn.LeakyReLU(0.2, inplace=True),
+            spectral_norm(
+                nn.Conv3d(nf * 1,
+                          nf * 2,
+                          kernel_size=(3, 5, 5),
+                          stride=(1, 2, 2),
+                          padding=(1, 2, 2),
+                          bias=not use_spectral_norm), use_spectral_norm),
+            # nn.InstanceNorm2d(128, track_running_stats=False),
+            nn.LeakyReLU(0.2, inplace=True),
+            spectral_norm(
+                nn.Conv3d(nf * 2,
+                          nf * 4,
+                          kernel_size=(3, 5, 5),
+                          stride=(1, 2, 2),
+                          padding=(1, 2, 2),
+                          bias=not use_spectral_norm), use_spectral_norm),
+            # nn.InstanceNorm2d(256, track_running_stats=False),
+            nn.LeakyReLU(0.2, inplace=True),
+            spectral_norm(
+                nn.Conv3d(nf * 4,
+                          nf * 4,
+                          kernel_size=(3, 5, 5),
+                          stride=(1, 2, 2),
+                          padding=(1, 2, 2),
+                          bias=not use_spectral_norm), use_spectral_norm),
+            # nn.InstanceNorm2d(256, track_running_stats=False),
+            nn.LeakyReLU(0.2, inplace=True),
+            spectral_norm(
+                nn.Conv3d(nf * 4,
+                          nf * 4,
+                          kernel_size=(3, 5, 5),
+                          stride=(1, 2, 2),
+                          padding=(1, 2, 2),
+                          bias=not use_spectral_norm), use_spectral_norm),
+            # nn.InstanceNorm2d(256, track_running_stats=False),
+            nn.LeakyReLU(0.2, inplace=True),
+            nn.Conv3d(nf * 4,
+                      nf * 4,
+                      kernel_size=(3, 5, 5),
+                      stride=(1, 2, 2),
+                      padding=(1, 2, 2)))
+        if init_weights:
+            self.init_weights()
+    def forward(self, xs):
+        # T, C, H, W = xs.shape (old)
+        # B, T, C, H, W (new)
+        xs_t = torch.transpose(xs, 1, 2)
+        feat = self.conv(xs_t)
+        if self.use_sigmoid:
+            feat = torch.sigmoid(feat)
+        out = torch.transpose(feat, 1, 2)  # B, T, C, H, W
+        return out
+def spectral_norm(module, mode=True):
+    if mode:
+        return _spectral_norm(module)
+    return module

inpainter/model/modules/feat_prop.py ADDED Viewed

	@@ -0,0 +1,149 @@

+"""
+    BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment, CVPR 2022
+"""
+import torch
+import torch.nn as nn
+from mmcv.ops import ModulatedDeformConv2d, modulated_deform_conv2d
+from mmengine.model import constant_init
+from model.modules.flow_comp import flow_warp
+class SecondOrderDeformableAlignment(ModulatedDeformConv2d):
+    """Second-order deformable alignment module."""
+    def __init__(self, *args, **kwargs):
+        self.max_residue_magnitude = kwargs.pop('max_residue_magnitude', 10)
+        super(SecondOrderDeformableAlignment, self).__init__(*args, **kwargs)
+        self.conv_offset = nn.Sequential(
+            nn.Conv2d(3 * self.out_channels + 4, self.out_channels, 3, 1, 1),
+            nn.LeakyReLU(negative_slope=0.1, inplace=True),
+            nn.Conv2d(self.out_channels, self.out_channels, 3, 1, 1),
+            nn.LeakyReLU(negative_slope=0.1, inplace=True),
+            nn.Conv2d(self.out_channels, self.out_channels, 3, 1, 1),
+            nn.LeakyReLU(negative_slope=0.1, inplace=True),
+            nn.Conv2d(self.out_channels, 27 * self.deform_groups, 3, 1, 1),
+        )
+        self.init_offset()
+    def init_offset(self):
+        constant_init(self.conv_offset[-1], val=0, bias=0)
+    def forward(self, x, extra_feat, flow_1, flow_2):
+        extra_feat = torch.cat([extra_feat, flow_1, flow_2], dim=1)
+        out = self.conv_offset(extra_feat)
+        o1, o2, mask = torch.chunk(out, 3, dim=1)
+        # offset
+        offset = self.max_residue_magnitude * torch.tanh(
+            torch.cat((o1, o2), dim=1))
+        offset_1, offset_2 = torch.chunk(offset, 2, dim=1)
+        offset_1 = offset_1 + flow_1.flip(1).repeat(1,
+                                                    offset_1.size(1) // 2, 1,
+                                                    1)
+        offset_2 = offset_2 + flow_2.flip(1).repeat(1,
+                                                    offset_2.size(1) // 2, 1,
+                                                    1)
+        offset = torch.cat([offset_1, offset_2], dim=1)
+        # mask
+        mask = torch.sigmoid(mask)
+        return modulated_deform_conv2d(x, offset, mask, self.weight, self.bias,
+                                       self.stride, self.padding,
+                                       self.dilation, self.groups,
+                                       self.deform_groups)
+class BidirectionalPropagation(nn.Module):
+    def __init__(self, channel):
+        super(BidirectionalPropagation, self).__init__()
+        modules = ['backward_', 'forward_']
+        self.deform_align = nn.ModuleDict()
+        self.backbone = nn.ModuleDict()
+        self.channel = channel
+        for i, module in enumerate(modules):
+            self.deform_align[module] = SecondOrderDeformableAlignment(
+                2 * channel, channel, 3, padding=1, deform_groups=16)
+            self.backbone[module] = nn.Sequential(
+                nn.Conv2d((2 + i) * channel, channel, 3, 1, 1),
+                nn.LeakyReLU(negative_slope=0.1, inplace=True),
+                nn.Conv2d(channel, channel, 3, 1, 1),
+            )
+        self.fusion = nn.Conv2d(2 * channel, channel, 1, 1, 0)
+    def forward(self, x, flows_backward, flows_forward):
+        """
+        x shape : [b, t, c, h, w]
+        return [b, t, c, h, w]
+        """
+        b, t, c, h, w = x.shape
+        feats = {}
+        feats['spatial'] = [x[:, i, :, :, :] for i in range(0, t)]
+        for module_name in ['backward_', 'forward_']:
+            feats[module_name] = []
+            frame_idx = range(0, t)
+            flow_idx = range(-1, t - 1)
+            mapping_idx = list(range(0, len(feats['spatial'])))
+            mapping_idx += mapping_idx[::-1]
+            if 'backward' in module_name:
+                frame_idx = frame_idx[::-1]
+                flows = flows_backward
+            else:
+                flows = flows_forward
+            feat_prop = x.new_zeros(b, self.channel, h, w)
+            for i, idx in enumerate(frame_idx):
+                feat_current = feats['spatial'][mapping_idx[idx]]
+                if i > 0:
+                    flow_n1 = flows[:, flow_idx[i], :, :, :]
+                    cond_n1 = flow_warp(feat_prop, flow_n1.permute(0, 2, 3, 1))
+                    # initialize second-order features
+                    feat_n2 = torch.zeros_like(feat_prop)
+                    flow_n2 = torch.zeros_like(flow_n1)
+                    cond_n2 = torch.zeros_like(cond_n1)
+                    if i > 1:
+                        feat_n2 = feats[module_name][-2]
+                        flow_n2 = flows[:, flow_idx[i - 1], :, :, :]
+                        flow_n2 = flow_n1 + flow_warp(
+                            flow_n2, flow_n1.permute(0, 2, 3, 1))
+                        cond_n2 = flow_warp(feat_n2,
+                                            flow_n2.permute(0, 2, 3, 1))
+                    cond = torch.cat([cond_n1, feat_current, cond_n2], dim=1)
+                    feat_prop = torch.cat([feat_prop, feat_n2], dim=1)
+                    feat_prop = self.deform_align[module_name](feat_prop, cond,
+                                                               flow_n1,
+                                                               flow_n2)
+                feat = [feat_current] + [
+                    feats[k][idx]
+                    for k in feats if k not in ['spatial', module_name]
+                ] + [feat_prop]
+                feat = torch.cat(feat, dim=1)
+                feat_prop = feat_prop + self.backbone[module_name](feat)
+                feats[module_name].append(feat_prop)
+            if 'backward' in module_name:
+                feats[module_name] = feats[module_name][::-1]
+        outputs = []
+        for i in range(0, t):
+            align_feats = [feats[k].pop(0) for k in feats if k != 'spatial']
+            align_feats = torch.cat(align_feats, dim=1)
+            outputs.append(self.fusion(align_feats))
+        return torch.stack(outputs, dim=1) + x

inpainter/model/modules/flow_comp.py ADDED Viewed

	@@ -0,0 +1,450 @@

+import numpy as np
+import torch.nn as nn
+import torch.nn.functional as F
+import torch
+from mmcv.cnn import ConvModule
+from mmengine.runner import load_checkpoint
+class FlowCompletionLoss(nn.Module):
+    """Flow completion loss"""
+    def __init__(self):
+        super().__init__()
+        self.fix_spynet = SPyNet()
+        for p in self.fix_spynet.parameters():
+            p.requires_grad = False
+        self.l1_criterion = nn.L1Loss()
+    def forward(self, pred_flows, gt_local_frames):
+        b, l_t, c, h, w = gt_local_frames.size()
+        with torch.no_grad():
+            # compute gt forward and backward flows
+            gt_local_frames = F.interpolate(gt_local_frames.view(-1, c, h, w),
+                                            scale_factor=1 / 4,
+                                            mode='bilinear',
+                                            align_corners=True,
+                                            recompute_scale_factor=True)
+            gt_local_frames = gt_local_frames.view(b, l_t, c, h // 4, w // 4)
+            gtlf_1 = gt_local_frames[:, :-1, :, :, :].reshape(
+                -1, c, h // 4, w // 4)
+            gtlf_2 = gt_local_frames[:, 1:, :, :, :].reshape(
+                -1, c, h // 4, w // 4)
+            gt_flows_forward = self.fix_spynet(gtlf_1, gtlf_2)
+            gt_flows_backward = self.fix_spynet(gtlf_2, gtlf_1)
+        # calculate loss for flow completion
+        forward_flow_loss = self.l1_criterion(
+            pred_flows[0].view(-1, 2, h // 4, w // 4), gt_flows_forward)
+        backward_flow_loss = self.l1_criterion(
+            pred_flows[1].view(-1, 2, h // 4, w // 4), gt_flows_backward)
+        flow_loss = forward_flow_loss + backward_flow_loss
+        return flow_loss
+class SPyNet(nn.Module):
+    """SPyNet network structure.
+    The difference to the SPyNet in [tof.py] is that
+        1. more SPyNetBasicModule is used in this version, and
+        2. no batch normalization is used in this version.
+    Paper:
+        Optical Flow Estimation using a Spatial Pyramid Network, CVPR, 2017
+    Args:
+        pretrained (str): path for pre-trained SPyNet. Default: None.
+    """
+    def __init__(
+        self,
+        use_pretrain=True,
+        pretrained='https://download.openmmlab.com/mmediting/restorers/basicvsr/spynet_20210409-c6c1bd09.pth'
+    ):
+        super().__init__()
+        self.basic_module = nn.ModuleList(
+            [SPyNetBasicModule() for _ in range(6)])
+        if use_pretrain:
+            if isinstance(pretrained, str):
+                print("load pretrained SPyNet...")
+                load_checkpoint(self, pretrained, strict=True)
+            elif pretrained is not None:
+                raise TypeError('[pretrained] should be str or None, '
+                                f'but got {type(pretrained)}.')
+        self.register_buffer(
+            'mean',
+            torch.Tensor([0.485, 0.456, 0.406]).view(1, 3, 1, 1))
+        self.register_buffer(
+            'std',
+            torch.Tensor([0.229, 0.224, 0.225]).view(1, 3, 1, 1))
+    def compute_flow(self, ref, supp):
+        """Compute flow from ref to supp.
+        Note that in this function, the images are already resized to a
+        multiple of 32.
+        Args:
+            ref (Tensor): Reference image with shape of (n, 3, h, w).
+            supp (Tensor): Supporting image with shape of (n, 3, h, w).
+        Returns:
+            Tensor: Estimated optical flow: (n, 2, h, w).
+        """
+        n, _, h, w = ref.size()
+        # normalize the input images
+        ref = [(ref - self.mean) / self.std]
+        supp = [(supp - self.mean) / self.std]
+        # generate downsampled frames
+        for level in range(5):
+            ref.append(
+                F.avg_pool2d(input=ref[-1],
+                             kernel_size=2,
+                             stride=2,
+                             count_include_pad=False))
+            supp.append(
+                F.avg_pool2d(input=supp[-1],
+                             kernel_size=2,
+                             stride=2,
+                             count_include_pad=False))
+        ref = ref[::-1]
+        supp = supp[::-1]
+        # flow computation
+        flow = ref[0].new_zeros(n, 2, h // 32, w // 32)
+        for level in range(len(ref)):
+            if level == 0:
+                flow_up = flow
+            else:
+                flow_up = F.interpolate(input=flow,
+                                        scale_factor=2,
+                                        mode='bilinear',
+                                        align_corners=True) * 2.0
+            # add the residue to the upsampled flow
+            flow = flow_up + self.basic_module[level](torch.cat([
+                ref[level],
+                flow_warp(supp[level],
+                          flow_up.permute(0, 2, 3, 1).contiguous(),
+                          padding_mode='border'), flow_up
+            ], 1))
+        return flow
+    def forward(self, ref, supp):
+        """Forward function of SPyNet.
+        This function computes the optical flow from ref to supp.
+        Args:
+            ref (Tensor): Reference image with shape of (n, 3, h, w).
+            supp (Tensor): Supporting image with shape of (n, 3, h, w).
+        Returns:
+            Tensor: Estimated optical flow: (n, 2, h, w).
+        """
+        # upsize to a multiple of 32
+        h, w = ref.shape[2:4]
+        w_up = w if (w % 32) == 0 else 32 * (w // 32 + 1)
+        h_up = h if (h % 32) == 0 else 32 * (h // 32 + 1)
+        ref = F.interpolate(input=ref,
+                            size=(h_up, w_up),
+                            mode='bilinear',
+                            align_corners=False)
+        supp = F.interpolate(input=supp,
+                             size=(h_up, w_up),
+                             mode='bilinear',
+                             align_corners=False)
+        # compute flow, and resize back to the original resolution
+        flow = F.interpolate(input=self.compute_flow(ref, supp),
+                             size=(h, w),
+                             mode='bilinear',
+                             align_corners=False)
+        # adjust the flow values
+        flow[:, 0, :, :] *= float(w) / float(w_up)
+        flow[:, 1, :, :] *= float(h) / float(h_up)
+        return flow
+class SPyNetBasicModule(nn.Module):
+    """Basic Module for SPyNet.
+    Paper:
+        Optical Flow Estimation using a Spatial Pyramid Network, CVPR, 2017
+    """
+    def __init__(self):
+        super().__init__()
+        self.basic_module = nn.Sequential(
+            ConvModule(in_channels=8,
+                       out_channels=32,
+                       kernel_size=7,
+                       stride=1,
+                       padding=3,
+                       norm_cfg=None,
+                       act_cfg=dict(type='ReLU')),
+            ConvModule(in_channels=32,
+                       out_channels=64,
+                       kernel_size=7,
+                       stride=1,
+                       padding=3,
+                       norm_cfg=None,
+                       act_cfg=dict(type='ReLU')),
+            ConvModule(in_channels=64,
+                       out_channels=32,
+                       kernel_size=7,
+                       stride=1,
+                       padding=3,
+                       norm_cfg=None,
+                       act_cfg=dict(type='ReLU')),
+            ConvModule(in_channels=32,
+                       out_channels=16,
+                       kernel_size=7,
+                       stride=1,
+                       padding=3,
+                       norm_cfg=None,
+                       act_cfg=dict(type='ReLU')),
+            ConvModule(in_channels=16,
+                       out_channels=2,
+                       kernel_size=7,
+                       stride=1,
+                       padding=3,
+                       norm_cfg=None,
+                       act_cfg=None))
+    def forward(self, tensor_input):
+        """
+        Args:
+            tensor_input (Tensor): Input tensor with shape (b, 8, h, w).
+                8 channels contain:
+                [reference image (3), neighbor image (3), initial flow (2)].
+        Returns:
+            Tensor: Refined flow with shape (b, 2, h, w)
+        """
+        return self.basic_module(tensor_input)
+# Flow visualization code used from https://github.com/tomrunia/OpticalFlow_Visualization
+def make_colorwheel():
+    """
+    Generates a color wheel for optical flow visualization as presented in:
+        Baker et al. "A Database and Evaluation Methodology for Optical Flow" (ICCV, 2007)
+        URL: http://vision.middlebury.edu/flow/flowEval-iccv07.pdf
+    Code follows the original C++ source code of Daniel Scharstein.
+    Code follows the the Matlab source code of Deqing Sun.
+    Returns:
+        np.ndarray: Color wheel
+    """
+    RY = 15
+    YG = 6
+    GC = 4
+    CB = 11
+    BM = 13
+    MR = 6
+    ncols = RY + YG + GC + CB + BM + MR
+    colorwheel = np.zeros((ncols, 3))
+    col = 0
+    # RY
+    colorwheel[0:RY, 0] = 255
+    colorwheel[0:RY, 1] = np.floor(255 * np.arange(0, RY) / RY)
+    col = col + RY
+    # YG
+    colorwheel[col:col + YG, 0] = 255 - np.floor(255 * np.arange(0, YG) / YG)
+    colorwheel[col:col + YG, 1] = 255
+    col = col + YG
+    # GC
+    colorwheel[col:col + GC, 1] = 255
+    colorwheel[col:col + GC, 2] = np.floor(255 * np.arange(0, GC) / GC)
+    col = col + GC
+    # CB
+    colorwheel[col:col + CB, 1] = 255 - np.floor(255 * np.arange(CB) / CB)
+    colorwheel[col:col + CB, 2] = 255
+    col = col + CB
+    # BM
+    colorwheel[col:col + BM, 2] = 255
+    colorwheel[col:col + BM, 0] = np.floor(255 * np.arange(0, BM) / BM)
+    col = col + BM
+    # MR
+    colorwheel[col:col + MR, 2] = 255 - np.floor(255 * np.arange(MR) / MR)
+    colorwheel[col:col + MR, 0] = 255
+    return colorwheel
+def flow_uv_to_colors(u, v, convert_to_bgr=False):
+    """
+    Applies the flow color wheel to (possibly clipped) flow components u and v.
+    According to the C++ source code of Daniel Scharstein
+    According to the Matlab source code of Deqing Sun
+    Args:
+        u (np.ndarray): Input horizontal flow of shape [H,W]
+        v (np.ndarray): Input vertical flow of shape [H,W]
+        convert_to_bgr (bool, optional): Convert output image to BGR. Defaults to False.
+    Returns:
+        np.ndarray: Flow visualization image of shape [H,W,3]
+    """
+    flow_image = np.zeros((u.shape[0], u.shape[1], 3), np.uint8)
+    colorwheel = make_colorwheel()  # shape [55x3]
+    ncols = colorwheel.shape[0]
+    rad = np.sqrt(np.square(u) + np.square(v))
+    a = np.arctan2(-v, -u) / np.pi
+    fk = (a + 1) / 2 * (ncols - 1)
+    k0 = np.floor(fk).astype(np.int32)
+    k1 = k0 + 1
+    k1[k1 == ncols] = 0
+    f = fk - k0
+    for i in range(colorwheel.shape[1]):
+        tmp = colorwheel[:, i]
+        col0 = tmp[k0] / 255.0
+        col1 = tmp[k1] / 255.0
+        col = (1 - f) * col0 + f * col1
+        idx = (rad <= 1)
+        col[idx] = 1 - rad[idx] * (1 - col[idx])
+        col[~idx] = col[~idx] * 0.75  # out of range
+        # Note the 2-i => BGR instead of RGB
+        ch_idx = 2 - i if convert_to_bgr else i
+        flow_image[:, :, ch_idx] = np.floor(255 * col)
+    return flow_image
+def flow_to_image(flow_uv, clip_flow=None, convert_to_bgr=False):
+    """
+    Expects a two dimensional flow image of shape.
+    Args:
+        flow_uv (np.ndarray): Flow UV image of shape [H,W,2]
+        clip_flow (float, optional): Clip maximum of flow values. Defaults to None.
+        convert_to_bgr (bool, optional): Convert output image to BGR. Defaults to False.
+    Returns:
+        np.ndarray: Flow visualization image of shape [H,W,3]
+    """
+    assert flow_uv.ndim == 3, 'input flow must have three dimensions'
+    assert flow_uv.shape[2] == 2, 'input flow must have shape [H,W,2]'
+    if clip_flow is not None:
+        flow_uv = np.clip(flow_uv, 0, clip_flow)
+    u = flow_uv[:, :, 0]
+    v = flow_uv[:, :, 1]
+    rad = np.sqrt(np.square(u) + np.square(v))
+    rad_max = np.max(rad)
+    epsilon = 1e-5
+    u = u / (rad_max + epsilon)
+    v = v / (rad_max + epsilon)
+    return flow_uv_to_colors(u, v, convert_to_bgr)
+def flow_warp(x,
+              flow,
+              interpolation='bilinear',
+              padding_mode='zeros',
+              align_corners=True):
+    """Warp an image or a feature map with optical flow.
+    Args:
+        x (Tensor): Tensor with size (n, c, h, w).
+        flow (Tensor): Tensor with size (n, h, w, 2). The last dimension is
+            a two-channel, denoting the width and height relative offsets.
+            Note that the values are not normalized to [-1, 1].
+        interpolation (str): Interpolation mode: 'nearest' or 'bilinear'.
+            Default: 'bilinear'.
+        padding_mode (str): Padding mode: 'zeros' or 'border' or 'reflection'.
+            Default: 'zeros'.
+        align_corners (bool): Whether align corners. Default: True.
+    Returns:
+        Tensor: Warped image or feature map.
+    """
+    if x.size()[-2:] != flow.size()[1:3]:
+        raise ValueError(f'The spatial sizes of input ({x.size()[-2:]}) and '
+                         f'flow ({flow.size()[1:3]}) are not the same.')
+    _, _, h, w = x.size()
+    # create mesh grid
+    grid_y, grid_x = torch.meshgrid(torch.arange(0, h), torch.arange(0, w))
+    grid = torch.stack((grid_x, grid_y), 2).type_as(x)  # (w, h, 2)
+    grid.requires_grad = False
+    grid_flow = grid + flow
+    # scale grid_flow to [-1,1]
+    grid_flow_x = 2.0 * grid_flow[:, :, :, 0] / max(w - 1, 1) - 1.0
+    grid_flow_y = 2.0 * grid_flow[:, :, :, 1] / max(h - 1, 1) - 1.0
+    grid_flow = torch.stack((grid_flow_x, grid_flow_y), dim=3)
+    output = F.grid_sample(x,
+                           grid_flow,
+                           mode=interpolation,
+                           padding_mode=padding_mode,
+                           align_corners=align_corners)
+    return output
+def initial_mask_flow(mask):
+    """
+    mask 1 indicates valid pixel 0 indicates unknown pixel
+    """
+    B, T, C, H, W = mask.shape
+    # calculate relative position
+    grid_y, grid_x = torch.meshgrid(torch.arange(0, H), torch.arange(0, W))
+    grid_y, grid_x = grid_y.type_as(mask), grid_x.type_as(mask)
+    abs_relative_pos_y = H - torch.abs(grid_y[None, :, :] - grid_y[:, None, :])
+    relative_pos_y = H - (grid_y[None, :, :] - grid_y[:, None, :])
+    abs_relative_pos_x = W - torch.abs(grid_x[:, None, :] - grid_x[:, :, None])
+    relative_pos_x = W - (grid_x[:, None, :] - grid_x[:, :, None])
+    # calculate the nearest indices
+    pos_up = mask.unsqueeze(3).repeat(
+        1, 1, 1, H, 1, 1).flip(4) * abs_relative_pos_y[None, None, None] * (
+            relative_pos_y <= H)[None, None, None]
+    nearest_indice_up = pos_up.max(dim=4)[1]
+    pos_down = mask.unsqueeze(3).repeat(1, 1, 1, H, 1, 1) * abs_relative_pos_y[
+        None, None, None] * (relative_pos_y <= H)[None, None, None]
+    nearest_indice_down = (pos_down).max(dim=4)[1]
+    pos_left = mask.unsqueeze(4).repeat(
+        1, 1, 1, 1, W, 1).flip(5) * abs_relative_pos_x[None, None, None] * (
+            relative_pos_x <= W)[None, None, None]
+    nearest_indice_left = (pos_left).max(dim=5)[1]
+    pos_right = mask.unsqueeze(4).repeat(
+        1, 1, 1, 1, W, 1) * abs_relative_pos_x[None, None, None] * (
+            relative_pos_x <= W)[None, None, None]
+    nearest_indice_right = (pos_right).max(dim=5)[1]
+    # NOTE: IMPORTANT !!! depending on how to use this offset
+    initial_offset_up = -(nearest_indice_up - grid_y[None, None, None]).flip(3)
+    initial_offset_down = nearest_indice_down - grid_y[None, None, None]
+    initial_offset_left = -(nearest_indice_left -
+                            grid_x[None, None, None]).flip(4)
+    initial_offset_right = nearest_indice_right - grid_x[None, None, None]
+    # nearest_indice_x = (mask.unsqueeze(1).repeat(1, img_width, 1) * relative_pos_x).max(dim=2)[1]
+    # initial_offset_x = nearest_indice_x - grid_x
+    # handle the boundary cases
+    final_offset_down = (initial_offset_down < 0) * initial_offset_up + (
+        initial_offset_down > 0) * initial_offset_down
+    final_offset_up = (initial_offset_up > 0) * initial_offset_down + (
+        initial_offset_up < 0) * initial_offset_up
+    final_offset_right = (initial_offset_right < 0) * initial_offset_left + (
+        initial_offset_right > 0) * initial_offset_right
+    final_offset_left = (initial_offset_left > 0) * initial_offset_right + (
+        initial_offset_left < 0) * initial_offset_left
+    zero_offset = torch.zeros_like(final_offset_down)
+    # out = torch.cat([final_offset_left, zero_offset, final_offset_right, zero_offset, zero_offset, final_offset_up, zero_offset, final_offset_down], dim=2)
+    out = torch.cat([
+        zero_offset, final_offset_left, zero_offset, final_offset_right,
+        final_offset_up, zero_offset, final_offset_down, zero_offset
+    ],
+                    dim=2)
+    return out

inpainter/model/modules/spectral_norm.py ADDED Viewed

	@@ -0,0 +1,288 @@

+"""
+Spectral Normalization from https://arxiv.org/abs/1802.05957
+"""
+import torch
+from torch.nn.functional import normalize
+class SpectralNorm(object):
+    # Invariant before and after each forward call:
+    #   u = normalize(W @ v)
+    # NB: At initialization, this invariant is not enforced
+    _version = 1
+    # At version 1:
+    #   made  `W` not a buffer,
+    #   added `v` as a buffer, and
+    #   made eval mode use `W = u @ W_orig @ v` rather than the stored `W`.
+    def __init__(self, name='weight', n_power_iterations=1, dim=0, eps=1e-12):
+        self.name = name
+        self.dim = dim
+        if n_power_iterations <= 0:
+            raise ValueError(
+                'Expected n_power_iterations to be positive, but '
+                'got n_power_iterations={}'.format(n_power_iterations))
+        self.n_power_iterations = n_power_iterations
+        self.eps = eps
+    def reshape_weight_to_matrix(self, weight):
+        weight_mat = weight
+        if self.dim != 0:
+            # permute dim to front
+            weight_mat = weight_mat.permute(
+                self.dim,
+                *[d for d in range(weight_mat.dim()) if d != self.dim])
+        height = weight_mat.size(0)
+        return weight_mat.reshape(height, -1)
+    def compute_weight(self, module, do_power_iteration):
+        # NB: If `do_power_iteration` is set, the `u` and `v` vectors are
+        #     updated in power iteration **in-place**. This is very important
+        #     because in `DataParallel` forward, the vectors (being buffers) are
+        #     broadcast from the parallelized module to each module replica,
+        #     which is a new module object created on the fly. And each replica
+        #     runs its own spectral norm power iteration. So simply assigning
+        #     the updated vectors to the module this function runs on will cause
+        #     the update to be lost forever. And the next time the parallelized
+        #     module is replicated, the same randomly initialized vectors are
+        #     broadcast and used!
+        #
+        #     Therefore, to make the change propagate back, we rely on two
+        #     important behaviors (also enforced via tests):
+        #       1. `DataParallel` doesn't clone storage if the broadcast tensor
+        #          is already on correct device; and it makes sure that the
+        #          parallelized module is already on `device[0]`.
+        #       2. If the out tensor in `out=` kwarg has correct shape, it will
+        #          just fill in the values.
+        #     Therefore, since the same power iteration is performed on all
+        #     devices, simply updating the tensors in-place will make sure that
+        #     the module replica on `device[0]` will update the _u vector on the
+        #     parallized module (by shared storage).
+        #
+        #    However, after we update `u` and `v` in-place, we need to **clone**
+        #    them before using them to normalize the weight. This is to support
+        #    backproping through two forward passes, e.g., the common pattern in
+        #    GAN training: loss = D(real) - D(fake). Otherwise, engine will
+        #    complain that variables needed to do backward for the first forward
+        #    (i.e., the `u` and `v` vectors) are changed in the second forward.
+        weight = getattr(module, self.name + '_orig')
+        u = getattr(module, self.name + '_u')
+        v = getattr(module, self.name + '_v')
+        weight_mat = self.reshape_weight_to_matrix(weight)
+        if do_power_iteration:
+            with torch.no_grad():
+                for _ in range(self.n_power_iterations):
+                    # Spectral norm of weight equals to `u^T W v`, where `u` and `v`
+                    # are the first left and right singular vectors.
+                    # This power iteration produces approximations of `u` and `v`.
+                    v = normalize(torch.mv(weight_mat.t(), u),
+                                  dim=0,
+                                  eps=self.eps,
+                                  out=v)
+                    u = normalize(torch.mv(weight_mat, v),
+                                  dim=0,
+                                  eps=self.eps,
+                                  out=u)
+                if self.n_power_iterations > 0:
+                    # See above on why we need to clone
+                    u = u.clone()
+                    v = v.clone()
+        sigma = torch.dot(u, torch.mv(weight_mat, v))
+        weight = weight / sigma
+        return weight
+    def remove(self, module):
+        with torch.no_grad():
+            weight = self.compute_weight(module, do_power_iteration=False)
+        delattr(module, self.name)
+        delattr(module, self.name + '_u')
+        delattr(module, self.name + '_v')
+        delattr(module, self.name + '_orig')
+        module.register_parameter(self.name,
+                                  torch.nn.Parameter(weight.detach()))
+    def __call__(self, module, inputs):
+        setattr(
+            module, self.name,
+            self.compute_weight(module, do_power_iteration=module.training))
+    def _solve_v_and_rescale(self, weight_mat, u, target_sigma):
+        # Tries to returns a vector `v` s.t. `u = normalize(W @ v)`
+        # (the invariant at top of this class) and `u @ W @ v = sigma`.
+        # This uses pinverse in case W^T W is not invertible.
+        v = torch.chain_matmul(weight_mat.t().mm(weight_mat).pinverse(),
+                               weight_mat.t(), u.unsqueeze(1)).squeeze(1)
+        return v.mul_(target_sigma / torch.dot(u, torch.mv(weight_mat, v)))
+    @staticmethod
+    def apply(module, name, n_power_iterations, dim, eps):
+        for k, hook in module._forward_pre_hooks.items():
+            if isinstance(hook, SpectralNorm) and hook.name == name:
+                raise RuntimeError(
+                    "Cannot register two spectral_norm hooks on "
+                    "the same parameter {}".format(name))
+        fn = SpectralNorm(name, n_power_iterations, dim, eps)
+        weight = module._parameters[name]
+        with torch.no_grad():
+            weight_mat = fn.reshape_weight_to_matrix(weight)
+            h, w = weight_mat.size()
+            # randomly initialize `u` and `v`
+            u = normalize(weight.new_empty(h).normal_(0, 1), dim=0, eps=fn.eps)
+            v = normalize(weight.new_empty(w).normal_(0, 1), dim=0, eps=fn.eps)
+        delattr(module, fn.name)
+        module.register_parameter(fn.name + "_orig", weight)
+        # We still need to assign weight back as fn.name because all sorts of
+        # things may assume that it exists, e.g., when initializing weights.
+        # However, we can't directly assign as it could be an nn.Parameter and
+        # gets added as a parameter. Instead, we register weight.data as a plain
+        # attribute.
+        setattr(module, fn.name, weight.data)
+        module.register_buffer(fn.name + "_u", u)
+        module.register_buffer(fn.name + "_v", v)
+        module.register_forward_pre_hook(fn)
+        module._register_state_dict_hook(SpectralNormStateDictHook(fn))
+        module._register_load_state_dict_pre_hook(
+            SpectralNormLoadStateDictPreHook(fn))
+        return fn
+# This is a top level class because Py2 pickle doesn't like inner class nor an
+# instancemethod.
+class SpectralNormLoadStateDictPreHook(object):
+    # See docstring of SpectralNorm._version on the changes to spectral_norm.
+    def __init__(self, fn):
+        self.fn = fn
+    # For state_dict with version None, (assuming that it has gone through at
+    # least one training forward), we have
+    #
+    #    u = normalize(W_orig @ v)
+    #    W = W_orig / sigma, where sigma = u @ W_orig @ v
+    #
+    # To compute `v`, we solve `W_orig @ x = u`, and let
+    #    v = x / (u @ W_orig @ x) * (W / W_orig).
+    def __call__(self, state_dict, prefix, local_metadata, strict,
+                 missing_keys, unexpected_keys, error_msgs):
+        fn = self.fn
+        version = local_metadata.get('spectral_norm',
+                                     {}).get(fn.name + '.version', None)
+        if version is None or version < 1:
+            with torch.no_grad():
+                weight_orig = state_dict[prefix + fn.name + '_orig']
+                # weight = state_dict.pop(prefix + fn.name)
+                # sigma = (weight_orig / weight).mean()
+                weight_mat = fn.reshape_weight_to_matrix(weight_orig)
+                u = state_dict[prefix + fn.name + '_u']
+                # v = fn._solve_v_and_rescale(weight_mat, u, sigma)
+                # state_dict[prefix + fn.name + '_v'] = v
+# This is a top level class because Py2 pickle doesn't like inner class nor an
+# instancemethod.
+class SpectralNormStateDictHook(object):
+    # See docstring of SpectralNorm._version on the changes to spectral_norm.
+    def __init__(self, fn):
+        self.fn = fn
+    def __call__(self, module, state_dict, prefix, local_metadata):
+        if 'spectral_norm' not in local_metadata:
+            local_metadata['spectral_norm'] = {}
+        key = self.fn.name + '.version'
+        if key in local_metadata['spectral_norm']:
+            raise RuntimeError(
+                "Unexpected key in metadata['spectral_norm']: {}".format(key))
+        local_metadata['spectral_norm'][key] = self.fn._version
+def spectral_norm(module,
+                  name='weight',
+                  n_power_iterations=1,
+                  eps=1e-12,
+                  dim=None):
+    r"""Applies spectral normalization to a parameter in the given module.
+    .. math::
+        \mathbf{W}_{SN} = \dfrac{\mathbf{W}}{\sigma(\mathbf{W})},
+        \sigma(\mathbf{W}) = \max_{\mathbf{h}: \mathbf{h} \ne 0} \dfrac{\|\mathbf{W} \mathbf{h}\|_2}{\|\mathbf{h}\|_2}
+    Spectral normalization stabilizes the training of discriminators (critics)
+    in Generative Adversarial Networks (GANs) by rescaling the weight tensor
+    with spectral norm :math:`\sigma` of the weight matrix calculated using
+    power iteration method. If the dimension of the weight tensor is greater
+    than 2, it is reshaped to 2D in power iteration method to get spectral
+    norm. This is implemented via a hook that calculates spectral norm and
+    rescales weight before every :meth:`~Module.forward` call.
+    See `Spectral Normalization for Generative Adversarial Networks`_ .
+    .. _`Spectral Normalization for Generative Adversarial Networks`: https://arxiv.org/abs/1802.05957
+    Args:
+        module (nn.Module): containing module
+        name (str, optional): name of weight parameter
+        n_power_iterations (int, optional): number of power iterations to
+            calculate spectral norm
+        eps (float, optional): epsilon for numerical stability in
+            calculating norms
+        dim (int, optional): dimension corresponding to number of outputs,
+            the default is ``0``, except for modules that are instances of
+            ConvTranspose{1,2,3}d, when it is ``1``
+    Returns:
+        The original module with the spectral norm hook
+    Example::
+        >>> m = spectral_norm(nn.Linear(20, 40))
+        >>> m
+        Linear(in_features=20, out_features=40, bias=True)
+        >>> m.weight_u.size()
+        torch.Size([40])
+    """
+    if dim is None:
+        if isinstance(module,
+                      (torch.nn.ConvTranspose1d, torch.nn.ConvTranspose2d,
+                       torch.nn.ConvTranspose3d)):
+            dim = 1
+        else:
+            dim = 0
+    SpectralNorm.apply(module, name, n_power_iterations, dim, eps)
+    return module
+def remove_spectral_norm(module, name='weight'):
+    r"""Removes the spectral normalization reparameterization from a module.
+    Args:
+        module (Module): containing module
+        name (str, optional): name of weight parameter
+    Example:
+        >>> m = spectral_norm(nn.Linear(40, 10))
+        >>> remove_spectral_norm(m)
+    """
+    for k, hook in module._forward_pre_hooks.items():
+        if isinstance(hook, SpectralNorm) and hook.name == name:
+            hook.remove(module)
+            del module._forward_pre_hooks[k]
+            return module
+    raise ValueError("spectral_norm of '{}' not found in {}".format(
+        name, module))
+def use_spectral_norm(module, use_sn=False):
+    if use_sn:
+        return spectral_norm(module)
+    return module

inpainter/model/modules/tfocal_transformer.py ADDED Viewed

	@@ -0,0 +1,536 @@

+"""
+    This code is based on:
+    [1] FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, ICCV 2021
+        https://github.com/ruiliu-ai/FuseFormer
+    [2] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, ICCV 2021
+        https://github.com/yitu-opensource/T2T-ViT
+    [3] Focal Self-attention for Local-Global Interactions in Vision Transformers, NeurIPS 2021
+        https://github.com/microsoft/Focal-Transformer
+"""
+import math
+from functools import reduce
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class SoftSplit(nn.Module):
+    def __init__(self, channel, hidden, kernel_size, stride, padding,
+                 t2t_param):
+        super(SoftSplit, self).__init__()
+        self.kernel_size = kernel_size
+        self.t2t = nn.Unfold(kernel_size=kernel_size,
+                             stride=stride,
+                             padding=padding)
+        c_in = reduce((lambda x, y: x * y), kernel_size) * channel
+        self.embedding = nn.Linear(c_in, hidden)
+        self.f_h = int(
+            (t2t_param['output_size'][0] + 2 * t2t_param['padding'][0] -
+             (t2t_param['kernel_size'][0] - 1) - 1) / t2t_param['stride'][0] +
+            1)
+        self.f_w = int(
+            (t2t_param['output_size'][1] + 2 * t2t_param['padding'][1] -
+             (t2t_param['kernel_size'][1] - 1) - 1) / t2t_param['stride'][1] +
+            1)
+    def forward(self, x, b):
+        feat = self.t2t(x)
+        feat = feat.permute(0, 2, 1)
+        # feat shape [b*t, num_vec, ks*ks*c]
+        feat = self.embedding(feat)
+        # feat shape after embedding [b, t*num_vec, hidden]
+        feat = feat.view(b, -1, self.f_h, self.f_w, feat.size(2))
+        return feat
+class SoftComp(nn.Module):
+    def __init__(self, channel, hidden, output_size, kernel_size, stride,
+                 padding):
+        super(SoftComp, self).__init__()
+        self.relu = nn.LeakyReLU(0.2, inplace=True)
+        c_out = reduce((lambda x, y: x * y), kernel_size) * channel
+        self.embedding = nn.Linear(hidden, c_out)
+        self.t2t = torch.nn.Fold(output_size=output_size,
+                                 kernel_size=kernel_size,
+                                 stride=stride,
+                                 padding=padding)
+        h, w = output_size
+        self.bias = nn.Parameter(torch.zeros((channel, h, w),
+                                             dtype=torch.float32),
+                                 requires_grad=True)
+    def forward(self, x, t):
+        b_, _, _, _, c_ = x.shape
+        x = x.view(b_, -1, c_)
+        feat = self.embedding(x)
+        b, _, c = feat.size()
+        feat = feat.view(b * t, -1, c).permute(0, 2, 1)
+        feat = self.t2t(feat) + self.bias[None]
+        return feat
+class FusionFeedForward(nn.Module):
+    def __init__(self, d_model, n_vecs=None, t2t_params=None):
+        super(FusionFeedForward, self).__init__()
+        # We set d_ff as a default to 1960
+        hd = 1960
+        self.conv1 = nn.Sequential(nn.Linear(d_model, hd))
+        self.conv2 = nn.Sequential(nn.GELU(), nn.Linear(hd, d_model))
+        assert t2t_params is not None and n_vecs is not None
+        tp = t2t_params.copy()
+        self.fold = nn.Fold(**tp)
+        del tp['output_size']
+        self.unfold = nn.Unfold(**tp)
+        self.n_vecs = n_vecs
+    def forward(self, x):
+        x = self.conv1(x)
+        b, n, c = x.size()
+        normalizer = x.new_ones(b, n, 49).view(-1, self.n_vecs,
+                                               49).permute(0, 2, 1)
+        x = self.unfold(
+            self.fold(x.view(-1, self.n_vecs, c).permute(0, 2, 1)) /
+            self.fold(normalizer)).permute(0, 2, 1).contiguous().view(b, n, c)
+        x = self.conv2(x)
+        return x
+def window_partition(x, window_size):
+    """
+    Args:
+        x: shape is (B, T, H, W, C)
+        window_size (tuple[int]): window size
+    Returns:
+        windows: (B*num_windows, T*window_size*window_size, C)
+    """
+    B, T, H, W, C = x.shape
+    x = x.view(B, T, H // window_size[0], window_size[0], W // window_size[1],
+               window_size[1], C)
+    windows = x.permute(0, 2, 4, 1, 3, 5, 6).contiguous().view(
+        -1, T * window_size[0] * window_size[1], C)
+    return windows
+def window_partition_noreshape(x, window_size):
+    """
+    Args:
+        x: shape is (B, T, H, W, C)
+        window_size (tuple[int]): window size
+    Returns:
+        windows: (B, num_windows_h, num_windows_w, T, window_size, window_size, C)
+    """
+    B, T, H, W, C = x.shape
+    x = x.view(B, T, H // window_size[0], window_size[0], W // window_size[1],
+               window_size[1], C)
+    windows = x.permute(0, 2, 4, 1, 3, 5, 6).contiguous()
+    return windows
+def window_reverse(windows, window_size, T, H, W):
+    """
+    Args:
+        windows: shape is (num_windows*B, T, window_size, window_size, C)
+        window_size (tuple[int]): Window size
+        T (int): Temporal length of video
+        H (int): Height of image
+        W (int): Width of image
+    Returns:
+        x: (B, T, H, W, C)
+    """
+    B = int(windows.shape[0] / (H * W / window_size[0] / window_size[1]))
+    x = windows.view(B, H // window_size[0], W // window_size[1], T,
+                     window_size[0], window_size[1], -1)
+    x = x.permute(0, 3, 1, 4, 2, 5, 6).contiguous().view(B, T, H, W, -1)
+    return x
+class WindowAttention(nn.Module):
+    """Temporal focal window attention
+    """
+    def __init__(self, dim, expand_size, window_size, focal_window,
+                 focal_level, num_heads, qkv_bias, pool_method):
+        super().__init__()
+        self.dim = dim
+        self.expand_size = expand_size
+        self.window_size = window_size  # Wh, Ww
+        self.pool_method = pool_method
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = head_dim**-0.5
+        self.focal_level = focal_level
+        self.focal_window = focal_window
+        if any(i > 0 for i in self.expand_size) and focal_level > 0:
+            # get mask for rolled k and rolled v
+            mask_tl = torch.ones(self.window_size[0], self.window_size[1])
+            mask_tl[:-self.expand_size[0], :-self.expand_size[1]] = 0
+            mask_tr = torch.ones(self.window_size[0], self.window_size[1])
+            mask_tr[:-self.expand_size[0], self.expand_size[1]:] = 0
+            mask_bl = torch.ones(self.window_size[0], self.window_size[1])
+            mask_bl[self.expand_size[0]:, :-self.expand_size[1]] = 0
+            mask_br = torch.ones(self.window_size[0], self.window_size[1])
+            mask_br[self.expand_size[0]:, self.expand_size[1]:] = 0
+            mask_rolled = torch.stack((mask_tl, mask_tr, mask_bl, mask_br),
+                                      0).flatten(0)
+            self.register_buffer("valid_ind_rolled",
+                                 mask_rolled.nonzero(as_tuple=False).view(-1))
+        if pool_method != "none" and focal_level > 1:
+            self.unfolds = nn.ModuleList()
+            # build relative position bias between local patch and pooled windows
+            for k in range(focal_level - 1):
+                stride = 2**k
+                kernel_size = tuple(2 * (i // 2) + 2**k + (2**k - 1)
+                                    for i in self.focal_window)
+                # define unfolding operations
+                self.unfolds += [
+                    nn.Unfold(kernel_size=kernel_size,
+                              stride=stride,
+                              padding=tuple(i // 2 for i in kernel_size))
+                ]
+                # define unfolding index for focal_level > 0
+                if k > 0:
+                    mask = torch.zeros(kernel_size)
+                    mask[(2**k) - 1:, (2**k) - 1:] = 1
+                    self.register_buffer(
+                        "valid_ind_unfold_{}".format(k),
+                        mask.flatten(0).nonzero(as_tuple=False).view(-1))
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.proj = nn.Linear(dim, dim)
+        self.softmax = nn.Softmax(dim=-1)
+    def forward(self, x_all, mask_all=None):
+        """
+        Args:
+            x: input features with shape of (B, T, Wh, Ww, C)
+            mask: (0/-inf) mask with shape of (num_windows, T*Wh*Ww, T*Wh*Ww) or None
+            output: (nW*B, Wh*Ww, C)
+        """
+        x = x_all[0]
+        B, T, nH, nW, C = x.shape
+        qkv = self.qkv(x).reshape(B, T, nH, nW, 3,
+                                  C).permute(4, 0, 1, 2, 3, 5).contiguous()
+        q, k, v = qkv[0], qkv[1], qkv[2]  # B, T, nH, nW, C
+        # partition q map
+        (q_windows, k_windows, v_windows) = map(
+            lambda t: window_partition(t, self.window_size).view(
+                -1, T, self.window_size[0] * self.window_size[1], self.
+                num_heads, C // self.num_heads).permute(0, 3, 1, 2, 4).
+            contiguous().view(-1, self.num_heads, T * self.window_size[
+                0] * self.window_size[1], C // self.num_heads), (q, k, v))
+        # q(k/v)_windows shape : [16, 4, 225, 128]
+        if any(i > 0 for i in self.expand_size) and self.focal_level > 0:
+            (k_tl, v_tl) = map(
+                lambda t: torch.roll(t,
+                                     shifts=(-self.expand_size[0], -self.
+                                             expand_size[1]),
+                                     dims=(2, 3)), (k, v))
+            (k_tr, v_tr) = map(
+                lambda t: torch.roll(t,
+                                     shifts=(-self.expand_size[0], self.
+                                             expand_size[1]),
+                                     dims=(2, 3)), (k, v))
+            (k_bl, v_bl) = map(
+                lambda t: torch.roll(t,
+                                     shifts=(self.expand_size[0], -self.
+                                             expand_size[1]),
+                                     dims=(2, 3)), (k, v))
+            (k_br, v_br) = map(
+                lambda t: torch.roll(t,
+                                     shifts=(self.expand_size[0], self.
+                                             expand_size[1]),
+                                     dims=(2, 3)), (k, v))
+            (k_tl_windows, k_tr_windows, k_bl_windows, k_br_windows) = map(
+                lambda t: window_partition(t, self.window_size).view(
+                    -1, T, self.window_size[0] * self.window_size[1], self.
+                    num_heads, C // self.num_heads), (k_tl, k_tr, k_bl, k_br))
+            (v_tl_windows, v_tr_windows, v_bl_windows, v_br_windows) = map(
+                lambda t: window_partition(t, self.window_size).view(
+                    -1, T, self.window_size[0] * self.window_size[1], self.
+                    num_heads, C // self.num_heads), (v_tl, v_tr, v_bl, v_br))
+            k_rolled = torch.cat(
+                (k_tl_windows, k_tr_windows, k_bl_windows, k_br_windows),
+                2).permute(0, 3, 1, 2, 4).contiguous()
+            v_rolled = torch.cat(
+                (v_tl_windows, v_tr_windows, v_bl_windows, v_br_windows),
+                2).permute(0, 3, 1, 2, 4).contiguous()
+            # mask out tokens in current window
+            k_rolled = k_rolled[:, :, :, self.valid_ind_rolled]
+            v_rolled = v_rolled[:, :, :, self.valid_ind_rolled]
+            temp_N = k_rolled.shape[3]
+            k_rolled = k_rolled.view(-1, self.num_heads, T * temp_N,
+                                     C // self.num_heads)
+            v_rolled = v_rolled.view(-1, self.num_heads, T * temp_N,
+                                     C // self.num_heads)
+            k_rolled = torch.cat((k_windows, k_rolled), 2)
+            v_rolled = torch.cat((v_windows, v_rolled), 2)
+        else:
+            k_rolled = k_windows
+            v_rolled = v_windows
+        # q(k/v)_windows shape : [16, 4, 225, 128]
+        # k_rolled.shape : [16, 4, 5, 165, 128]
+        # ideal expanded window size 153 ((5+2*2)*(9+2*4))
+        # k_windows=45 expand_window=108 overlap_window=12 (since expand_size < window_size / 2)
+        if self.pool_method != "none" and self.focal_level > 1:
+            k_pooled = []
+            v_pooled = []
+            for k in range(self.focal_level - 1):
+                stride = 2**k
+                x_window_pooled = x_all[k + 1].permute(
+                    0, 3, 1, 2, 4).contiguous()  # B, T, nWh, nWw, C
+                nWh, nWw = x_window_pooled.shape[2:4]
+                # generate mask for pooled windows
+                mask = x_window_pooled.new(T, nWh, nWw).fill_(1)
+                # unfold mask: [nWh*nWw//s//s, k*k, 1]
+                unfolded_mask = self.unfolds[k](mask.unsqueeze(1)).view(
+                    1, T, self.unfolds[k].kernel_size[0], self.unfolds[k].kernel_size[1], -1).permute(4, 1, 2, 3, 0).contiguous().\
+                    view(nWh*nWw // stride // stride, -1, 1)
+                if k > 0:
+                    valid_ind_unfold_k = getattr(
+                        self, "valid_ind_unfold_{}".format(k))
+                    unfolded_mask = unfolded_mask[:, valid_ind_unfold_k]
+                x_window_masks = unfolded_mask.flatten(1).unsqueeze(0)
+                x_window_masks = x_window_masks.masked_fill(
+                    x_window_masks == 0,
+                    float(-100.0)).masked_fill(x_window_masks > 0, float(0.0))
+                mask_all[k + 1] = x_window_masks
+                # generate k and v for pooled windows
+                qkv_pooled = self.qkv(x_window_pooled).reshape(
+                    B, T, nWh, nWw, 3, C).permute(4, 0, 1, 5, 2,
+                                                  3).view(3, -1, C, nWh,
+                                                          nWw).contiguous()
+                k_pooled_k, v_pooled_k = qkv_pooled[1], qkv_pooled[
+                    2]  # B*T, C, nWh, nWw
+                # k_pooled_k shape: [5, 512, 4, 4]
+                # self.unfolds[k](k_pooled_k) shape: [5, 23040 (512 * 5 * 9 ), 16]
+                (k_pooled_k, v_pooled_k) = map(
+                    lambda t: self.unfolds[k](t).view(
+                    B, T, C, self.unfolds[k].kernel_size[0], self.unfolds[k].kernel_size[1], -1).permute(0, 5, 1, 3, 4, 2).contiguous().\
+                    view(-1, T, self.unfolds[k].kernel_size[0]*self.unfolds[k].kernel_size[1], self.num_heads, C // self.num_heads).permute(0, 3, 1, 2, 4).contiguous(),
+                    (k_pooled_k, v_pooled_k)  # (B x (nH*nW)) x nHeads x T x (unfold_wsize x unfold_wsize) x head_dim
+                )
+                # k_pooled_k shape : [16, 4, 5, 45, 128]
+                # select valid unfolding index
+                if k > 0:
+                    (k_pooled_k, v_pooled_k) = map(
+                        lambda t: t[:, :, :, valid_ind_unfold_k],
+                        (k_pooled_k, v_pooled_k))
+                k_pooled_k = k_pooled_k.view(
+                    -1, self.num_heads, T * self.unfolds[k].kernel_size[0] *
+                    self.unfolds[k].kernel_size[1], C // self.num_heads)
+                v_pooled_k = v_pooled_k.view(
+                    -1, self.num_heads, T * self.unfolds[k].kernel_size[0] *
+                    self.unfolds[k].kernel_size[1], C // self.num_heads)
+                k_pooled += [k_pooled_k]
+                v_pooled += [v_pooled_k]
+            # k_all (v_all) shape : [16, 4, 5 * 210, 128]
+            k_all = torch.cat([k_rolled] + k_pooled, 2)
+            v_all = torch.cat([v_rolled] + v_pooled, 2)
+        else:
+            k_all = k_rolled
+            v_all = v_rolled
+        N = k_all.shape[-2]
+        q_windows = q_windows * self.scale
+        attn = (
+            q_windows @ k_all.transpose(-2, -1)
+        )  # B*nW, nHead, T*window_size*window_size, T*focal_window_size*focal_window_size
+        # T * 45
+        window_area = T * self.window_size[0] * self.window_size[1]
+        # T * 165
+        window_area_rolled = k_rolled.shape[2]
+        if self.pool_method != "none" and self.focal_level > 1:
+            offset = window_area_rolled
+            for k in range(self.focal_level - 1):
+                # add attentional mask
+                # mask_all[1] shape [1, 16, T * 45]
+                bias = tuple((i + 2**k - 1) for i in self.focal_window)
+                if mask_all[k + 1] is not None:
+                    attn[:, :, :window_area, offset:(offset + (T*bias[0]*bias[1]))] = \
+                        attn[:, :, :window_area, offset:(offset + (T*bias[0]*bias[1]))] + \
+                            mask_all[k+1][:, :, None, None, :].repeat(attn.shape[0] // mask_all[k+1].shape[1], 1, 1, 1, 1).view(-1, 1, 1, mask_all[k+1].shape[-1])
+                offset += T * bias[0] * bias[1]
+        if mask_all[0] is not None:
+            nW = mask_all[0].shape[0]
+            attn = attn.view(attn.shape[0] // nW, nW, self.num_heads,
+                             window_area, N)
+            attn[:, :, :, :, :
+                 window_area] = attn[:, :, :, :, :window_area] + mask_all[0][
+                     None, :, None, :, :]
+            attn = attn.view(-1, self.num_heads, window_area, N)
+            attn = self.softmax(attn)
+        else:
+            attn = self.softmax(attn)
+        x = (attn @ v_all).transpose(1, 2).reshape(attn.shape[0], window_area,
+                                                   C)
+        x = self.proj(x)
+        return x
+class TemporalFocalTransformerBlock(nn.Module):
+    r""" Temporal Focal Transformer Block.
+    Args:
+        dim (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (tuple[int]): Window size.
+        shift_size (int): Shift size for SW-MSA.
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
+        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True
+        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm
+        focal_level (int):  The number level of focal window.
+        focal_window (int):  Window size of each focal window.
+        n_vecs (int): Required for F3N.
+        t2t_params (int): T2T parameters for F3N.
+    """
+    def __init__(self,
+                 dim,
+                 num_heads,
+                 window_size=(5, 9),
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 pool_method="fc",
+                 focal_level=2,
+                 focal_window=(5, 9),
+                 norm_layer=nn.LayerNorm,
+                 n_vecs=None,
+                 t2t_params=None):
+        super().__init__()
+        self.dim = dim
+        self.num_heads = num_heads
+        self.window_size = window_size
+        self.expand_size = tuple(i // 2 for i in window_size)  # TODO
+        self.mlp_ratio = mlp_ratio
+        self.pool_method = pool_method
+        self.focal_level = focal_level
+        self.focal_window = focal_window
+        self.window_size_glo = self.window_size
+        self.pool_layers = nn.ModuleList()
+        if self.pool_method != "none":
+            for k in range(self.focal_level - 1):
+                window_size_glo = tuple(
+                    math.floor(i / (2**k)) for i in self.window_size_glo)
+                self.pool_layers.append(
+                    nn.Linear(window_size_glo[0] * window_size_glo[1], 1))
+                self.pool_layers[-1].weight.data.fill_(
+                    1. / (window_size_glo[0] * window_size_glo[1]))
+                self.pool_layers[-1].bias.data.fill_(0)
+        self.norm1 = norm_layer(dim)
+        self.attn = WindowAttention(dim,
+                                    expand_size=self.expand_size,
+                                    window_size=self.window_size,
+                                    focal_window=focal_window,
+                                    focal_level=focal_level,
+                                    num_heads=num_heads,
+                                    qkv_bias=qkv_bias,
+                                    pool_method=pool_method)
+        self.norm2 = norm_layer(dim)
+        self.mlp = FusionFeedForward(dim, n_vecs=n_vecs, t2t_params=t2t_params)
+    def forward(self, x):
+        B, T, H, W, C = x.shape
+        shortcut = x
+        x = self.norm1(x)
+        shifted_x = x
+        x_windows_all = [shifted_x]
+        x_window_masks_all = [None]
+        # partition windows tuple(i // 2 for i in window_size)
+        if self.focal_level > 1 and self.pool_method != "none":
+            # if we add coarser granularity and the pool method is not none
+            for k in range(self.focal_level - 1):
+                window_size_glo = tuple(
+                    math.floor(i / (2**k)) for i in self.window_size_glo)
+                pooled_h = math.ceil(H / window_size_glo[0]) * (2**k)
+                pooled_w = math.ceil(W / window_size_glo[1]) * (2**k)
+                H_pool = pooled_h * window_size_glo[0]
+                W_pool = pooled_w * window_size_glo[1]
+                x_level_k = shifted_x
+                # trim or pad shifted_x depending on the required size
+                if H > H_pool:
+                    trim_t = (H - H_pool) // 2
+                    trim_b = H - H_pool - trim_t
+                    x_level_k = x_level_k[:, :, trim_t:-trim_b]
+                elif H < H_pool:
+                    pad_t = (H_pool - H) // 2
+                    pad_b = H_pool - H - pad_t
+                    x_level_k = F.pad(x_level_k, (0, 0, 0, 0, pad_t, pad_b))
+                if W > W_pool:
+                    trim_l = (W - W_pool) // 2
+                    trim_r = W - W_pool - trim_l
+                    x_level_k = x_level_k[:, :, :, trim_l:-trim_r]
+                elif W < W_pool:
+                    pad_l = (W_pool - W) // 2
+                    pad_r = W_pool - W - pad_l
+                    x_level_k = F.pad(x_level_k, (0, 0, pad_l, pad_r))
+                x_windows_noreshape = window_partition_noreshape(
+                    x_level_k.contiguous(), window_size_glo
+                )  # B, nw, nw, T, window_size, window_size, C
+                nWh, nWw = x_windows_noreshape.shape[1:3]
+                x_windows_noreshape = x_windows_noreshape.view(
+                    B, nWh, nWw, T, window_size_glo[0] * window_size_glo[1],
+                    C).transpose(4, 5)  # B, nWh, nWw, T, C, wsize**2
+                x_windows_pooled = self.pool_layers[k](
+                    x_windows_noreshape).flatten(-2)  # B, nWh, nWw, T, C
+                x_windows_all += [x_windows_pooled]
+                x_window_masks_all += [None]
+        attn_windows = self.attn(
+            x_windows_all,
+            mask_all=x_window_masks_all)  # nW*B, T*window_size*window_size, C
+        # merge windows
+        attn_windows = attn_windows.view(-1, T, self.window_size[0],
+                                         self.window_size[1], C)
+        shifted_x = window_reverse(attn_windows, self.window_size, T, H,
+                                   W)  # B T H' W' C
+        # FFN
+        x = shortcut + shifted_x
+        y = self.norm2(x)
+        x = x + self.mlp(y.view(B, T * H * W, C)).view(B, T, H, W, C)
+        return x

inpainter/model/modules/tfocal_transformer_hq.py ADDED Viewed

	@@ -0,0 +1,565 @@

+"""
+    This code is based on:
+    [1] FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting, ICCV 2021
+        https://github.com/ruiliu-ai/FuseFormer
+    [2] Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, ICCV 2021
+        https://github.com/yitu-opensource/T2T-ViT
+    [3] Focal Self-attention for Local-Global Interactions in Vision Transformers, NeurIPS 2021
+        https://github.com/microsoft/Focal-Transformer
+"""
+import math
+from functools import reduce
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class SoftSplit(nn.Module):
+    def __init__(self, channel, hidden, kernel_size, stride, padding,
+                 t2t_param):
+        super(SoftSplit, self).__init__()
+        self.kernel_size = kernel_size
+        self.t2t = nn.Unfold(kernel_size=kernel_size,
+                             stride=stride,
+                             padding=padding)
+        c_in = reduce((lambda x, y: x * y), kernel_size) * channel
+        self.embedding = nn.Linear(c_in, hidden)
+        self.t2t_param = t2t_param
+    def forward(self, x, b, output_size):
+        f_h = int((output_size[0] + 2 * self.t2t_param['padding'][0] -
+                   (self.t2t_param['kernel_size'][0] - 1) - 1) /
+                  self.t2t_param['stride'][0] + 1)
+        f_w = int((output_size[1] + 2 * self.t2t_param['padding'][1] -
+                   (self.t2t_param['kernel_size'][1] - 1) - 1) /
+                  self.t2t_param['stride'][1] + 1)
+        feat = self.t2t(x)
+        feat = feat.permute(0, 2, 1)
+        # feat shape [b*t, num_vec, ks*ks*c]
+        feat = self.embedding(feat)
+        # feat shape after embedding [b, t*num_vec, hidden]
+        feat = feat.view(b, -1, f_h, f_w, feat.size(2))
+        return feat
+class SoftComp(nn.Module):
+    def __init__(self, channel, hidden, kernel_size, stride, padding):
+        super(SoftComp, self).__init__()
+        self.relu = nn.LeakyReLU(0.2, inplace=True)
+        c_out = reduce((lambda x, y: x * y), kernel_size) * channel
+        self.embedding = nn.Linear(hidden, c_out)
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.padding = padding
+        self.bias_conv = nn.Conv2d(channel,
+                                   channel,
+                                   kernel_size=3,
+                                   stride=1,
+                                   padding=1)
+        # TODO upsample conv
+        # self.bias_conv = nn.Conv2d()
+        # self.bias = nn.Parameter(torch.zeros((channel, h, w), dtype=torch.float32), requires_grad=True)
+    def forward(self, x, t, output_size):
+        b_, _, _, _, c_ = x.shape
+        x = x.view(b_, -1, c_)
+        feat = self.embedding(x)
+        b, _, c = feat.size()
+        feat = feat.view(b * t, -1, c).permute(0, 2, 1)
+        feat = F.fold(feat,
+                      output_size=output_size,
+                      kernel_size=self.kernel_size,
+                      stride=self.stride,
+                      padding=self.padding)
+        feat = self.bias_conv(feat)
+        return feat
+class FusionFeedForward(nn.Module):
+    def __init__(self, d_model, n_vecs=None, t2t_params=None):
+        super(FusionFeedForward, self).__init__()
+        # We set d_ff as a default to 1960
+        hd = 1960
+        self.conv1 = nn.Sequential(nn.Linear(d_model, hd))
+        self.conv2 = nn.Sequential(nn.GELU(), nn.Linear(hd, d_model))
+        assert t2t_params is not None and n_vecs is not None
+        self.t2t_params = t2t_params
+    def forward(self, x, output_size):
+        n_vecs = 1
+        for i, d in enumerate(self.t2t_params['kernel_size']):
+            n_vecs *= int((output_size[i] + 2 * self.t2t_params['padding'][i] -
+                           (d - 1) - 1) / self.t2t_params['stride'][i] + 1)
+        x = self.conv1(x)
+        b, n, c = x.size()
+        normalizer = x.new_ones(b, n, 49).view(-1, n_vecs, 49).permute(0, 2, 1)
+        normalizer = F.fold(normalizer,
+                            output_size=output_size,
+                            kernel_size=self.t2t_params['kernel_size'],
+                            padding=self.t2t_params['padding'],
+                            stride=self.t2t_params['stride'])
+        x = F.fold(x.view(-1, n_vecs, c).permute(0, 2, 1),
+                   output_size=output_size,
+                   kernel_size=self.t2t_params['kernel_size'],
+                   padding=self.t2t_params['padding'],
+                   stride=self.t2t_params['stride'])
+        x = F.unfold(x / normalizer,
+                     kernel_size=self.t2t_params['kernel_size'],
+                     padding=self.t2t_params['padding'],
+                     stride=self.t2t_params['stride']).permute(
+                         0, 2, 1).contiguous().view(b, n, c)
+        x = self.conv2(x)
+        return x
+def window_partition(x, window_size):
+    """
+    Args:
+        x: shape is (B, T, H, W, C)
+        window_size (tuple[int]): window size
+    Returns:
+        windows: (B*num_windows, T*window_size*window_size, C)
+    """
+    B, T, H, W, C = x.shape
+    x = x.view(B, T, H // window_size[0], window_size[0], W // window_size[1],
+               window_size[1], C)
+    windows = x.permute(0, 2, 4, 1, 3, 5, 6).contiguous().view(
+        -1, T * window_size[0] * window_size[1], C)
+    return windows
+def window_partition_noreshape(x, window_size):
+    """
+    Args:
+        x: shape is (B, T, H, W, C)
+        window_size (tuple[int]): window size
+    Returns:
+        windows: (B, num_windows_h, num_windows_w, T, window_size, window_size, C)
+    """
+    B, T, H, W, C = x.shape
+    x = x.view(B, T, H // window_size[0], window_size[0], W // window_size[1],
+               window_size[1], C)
+    windows = x.permute(0, 2, 4, 1, 3, 5, 6).contiguous()
+    return windows
+def window_reverse(windows, window_size, T, H, W):
+    """
+    Args:
+        windows: shape is (num_windows*B, T, window_size, window_size, C)
+        window_size (tuple[int]): Window size
+        T (int): Temporal length of video
+        H (int): Height of image
+        W (int): Width of image
+    Returns:
+        x: (B, T, H, W, C)
+    """
+    B = int(windows.shape[0] / (H * W / window_size[0] / window_size[1]))
+    x = windows.view(B, H // window_size[0], W // window_size[1], T,
+                     window_size[0], window_size[1], -1)
+    x = x.permute(0, 3, 1, 4, 2, 5, 6).contiguous().view(B, T, H, W, -1)
+    return x
+class WindowAttention(nn.Module):
+    """Temporal focal window attention
+    """
+    def __init__(self, dim, expand_size, window_size, focal_window,
+                 focal_level, num_heads, qkv_bias, pool_method):
+        super().__init__()
+        self.dim = dim
+        self.expand_size = expand_size
+        self.window_size = window_size  # Wh, Ww
+        self.pool_method = pool_method
+        self.num_heads = num_heads
+        head_dim = dim // num_heads
+        self.scale = head_dim**-0.5
+        self.focal_level = focal_level
+        self.focal_window = focal_window
+        if any(i > 0 for i in self.expand_size) and focal_level > 0:
+            # get mask for rolled k and rolled v
+            mask_tl = torch.ones(self.window_size[0], self.window_size[1])
+            mask_tl[:-self.expand_size[0], :-self.expand_size[1]] = 0
+            mask_tr = torch.ones(self.window_size[0], self.window_size[1])
+            mask_tr[:-self.expand_size[0], self.expand_size[1]:] = 0
+            mask_bl = torch.ones(self.window_size[0], self.window_size[1])
+            mask_bl[self.expand_size[0]:, :-self.expand_size[1]] = 0
+            mask_br = torch.ones(self.window_size[0], self.window_size[1])
+            mask_br[self.expand_size[0]:, self.expand_size[1]:] = 0
+            mask_rolled = torch.stack((mask_tl, mask_tr, mask_bl, mask_br),
+                                      0).flatten(0)
+            self.register_buffer("valid_ind_rolled",
+                                 mask_rolled.nonzero(as_tuple=False).view(-1))
+        if pool_method != "none" and focal_level > 1:
+            self.unfolds = nn.ModuleList()
+            # build relative position bias between local patch and pooled windows
+            for k in range(focal_level - 1):
+                stride = 2**k
+                kernel_size = tuple(2 * (i // 2) + 2**k + (2**k - 1)
+                                    for i in self.focal_window)
+                # define unfolding operations
+                self.unfolds += [
+                    nn.Unfold(kernel_size=kernel_size,
+                              stride=stride,
+                              padding=tuple(i // 2 for i in kernel_size))
+                ]
+                # define unfolding index for focal_level > 0
+                if k > 0:
+                    mask = torch.zeros(kernel_size)
+                    mask[(2**k) - 1:, (2**k) - 1:] = 1
+                    self.register_buffer(
+                        "valid_ind_unfold_{}".format(k),
+                        mask.flatten(0).nonzero(as_tuple=False).view(-1))
+        self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
+        self.proj = nn.Linear(dim, dim)
+        self.softmax = nn.Softmax(dim=-1)
+    def forward(self, x_all, mask_all=None):
+        """
+        Args:
+            x: input features with shape of (B, T, Wh, Ww, C)
+            mask: (0/-inf) mask with shape of (num_windows, T*Wh*Ww, T*Wh*Ww) or None
+            output: (nW*B, Wh*Ww, C)
+        """
+        x = x_all[0]
+        B, T, nH, nW, C = x.shape
+        qkv = self.qkv(x).reshape(B, T, nH, nW, 3,
+                                  C).permute(4, 0, 1, 2, 3, 5).contiguous()
+        q, k, v = qkv[0], qkv[1], qkv[2]  # B, T, nH, nW, C
+        # partition q map
+        (q_windows, k_windows, v_windows) = map(
+            lambda t: window_partition(t, self.window_size).view(
+                -1, T, self.window_size[0] * self.window_size[1], self.
+                num_heads, C // self.num_heads).permute(0, 3, 1, 2, 4).
+            contiguous().view(-1, self.num_heads, T * self.window_size[
+                0] * self.window_size[1], C // self.num_heads), (q, k, v))
+        # q(k/v)_windows shape : [16, 4, 225, 128]
+        if any(i > 0 for i in self.expand_size) and self.focal_level > 0:
+            (k_tl, v_tl) = map(
+                lambda t: torch.roll(t,
+                                     shifts=(-self.expand_size[0], -self.
+                                             expand_size[1]),
+                                     dims=(2, 3)), (k, v))
+            (k_tr, v_tr) = map(
+                lambda t: torch.roll(t,
+                                     shifts=(-self.expand_size[0], self.
+                                             expand_size[1]),
+                                     dims=(2, 3)), (k, v))
+            (k_bl, v_bl) = map(
+                lambda t: torch.roll(t,
+                                     shifts=(self.expand_size[0], -self.
+                                             expand_size[1]),
+                                     dims=(2, 3)), (k, v))
+            (k_br, v_br) = map(
+                lambda t: torch.roll(t,
+                                     shifts=(self.expand_size[0], self.
+                                             expand_size[1]),
+                                     dims=(2, 3)), (k, v))
+            (k_tl_windows, k_tr_windows, k_bl_windows, k_br_windows) = map(
+                lambda t: window_partition(t, self.window_size).view(
+                    -1, T, self.window_size[0] * self.window_size[1], self.
+                    num_heads, C // self.num_heads), (k_tl, k_tr, k_bl, k_br))
+            (v_tl_windows, v_tr_windows, v_bl_windows, v_br_windows) = map(
+                lambda t: window_partition(t, self.window_size).view(
+                    -1, T, self.window_size[0] * self.window_size[1], self.
+                    num_heads, C // self.num_heads), (v_tl, v_tr, v_bl, v_br))
+            k_rolled = torch.cat(
+                (k_tl_windows, k_tr_windows, k_bl_windows, k_br_windows),
+                2).permute(0, 3, 1, 2, 4).contiguous()
+            v_rolled = torch.cat(
+                (v_tl_windows, v_tr_windows, v_bl_windows, v_br_windows),
+                2).permute(0, 3, 1, 2, 4).contiguous()
+            # mask out tokens in current window
+            k_rolled = k_rolled[:, :, :, self.valid_ind_rolled]
+            v_rolled = v_rolled[:, :, :, self.valid_ind_rolled]
+            temp_N = k_rolled.shape[3]
+            k_rolled = k_rolled.view(-1, self.num_heads, T * temp_N,
+                                     C // self.num_heads)
+            v_rolled = v_rolled.view(-1, self.num_heads, T * temp_N,
+                                     C // self.num_heads)
+            k_rolled = torch.cat((k_windows, k_rolled), 2)
+            v_rolled = torch.cat((v_windows, v_rolled), 2)
+        else:
+            k_rolled = k_windows
+            v_rolled = v_windows
+        # q(k/v)_windows shape : [16, 4, 225, 128]
+        # k_rolled.shape : [16, 4, 5, 165, 128]
+        # ideal expanded window size 153 ((5+2*2)*(9+2*4))
+        # k_windows=45 expand_window=108 overlap_window=12 (since expand_size < window_size / 2)
+        if self.pool_method != "none" and self.focal_level > 1:
+            k_pooled = []
+            v_pooled = []
+            for k in range(self.focal_level - 1):
+                stride = 2**k
+                # B, T, nWh, nWw, C
+                x_window_pooled = x_all[k + 1].permute(0, 3, 1, 2,
+                                                       4).contiguous()
+                nWh, nWw = x_window_pooled.shape[2:4]
+                # generate mask for pooled windows
+                mask = x_window_pooled.new(T, nWh, nWw).fill_(1)
+                # unfold mask: [nWh*nWw//s//s, k*k, 1]
+                unfolded_mask = self.unfolds[k](mask.unsqueeze(1)).view(
+                    1, T, self.unfolds[k].kernel_size[0], self.unfolds[k].kernel_size[1], -1).permute(4, 1, 2, 3, 0).contiguous().\
+                    view(nWh*nWw // stride // stride, -1, 1)
+                if k > 0:
+                    valid_ind_unfold_k = getattr(
+                        self, "valid_ind_unfold_{}".format(k))
+                    unfolded_mask = unfolded_mask[:, valid_ind_unfold_k]
+                x_window_masks = unfolded_mask.flatten(1).unsqueeze(0)
+                x_window_masks = x_window_masks.masked_fill(
+                    x_window_masks == 0,
+                    float(-100.0)).masked_fill(x_window_masks > 0, float(0.0))
+                mask_all[k + 1] = x_window_masks
+                # generate k and v for pooled windows
+                qkv_pooled = self.qkv(x_window_pooled).reshape(
+                    B, T, nWh, nWw, 3, C).permute(4, 0, 1, 5, 2,
+                                                  3).view(3, -1, C, nWh,
+                                                          nWw).contiguous()
+                # B*T, C, nWh, nWw
+                k_pooled_k, v_pooled_k = qkv_pooled[1], qkv_pooled[2]
+                # k_pooled_k shape: [5, 512, 4, 4]
+                # self.unfolds[k](k_pooled_k) shape: [5, 23040 (512 * 5 * 9 ), 16]
+                (k_pooled_k, v_pooled_k) = map(
+                    lambda t: self.unfolds[k]
+                    (t).view(B, T, C, self.unfolds[k].kernel_size[0], self.
+                             unfolds[k].kernel_size[1], -1)
+                    .permute(0, 5, 1, 3, 4, 2).contiguous().view(
+                        -1, T, self.unfolds[k].kernel_size[0] * self.unfolds[
+                            k].kernel_size[1], self.num_heads, C // self.
+                        num_heads).permute(0, 3, 1, 2, 4).contiguous(),
+                    # (B x (nH*nW)) x nHeads x T x (unfold_wsize x unfold_wsize) x head_dim
+                    (k_pooled_k, v_pooled_k))
+                # k_pooled_k shape : [16, 4, 5, 45, 128]
+                # select valid unfolding index
+                if k > 0:
+                    (k_pooled_k, v_pooled_k) = map(
+                        lambda t: t[:, :, :, valid_ind_unfold_k],
+                        (k_pooled_k, v_pooled_k))
+                k_pooled_k = k_pooled_k.view(
+                    -1, self.num_heads, T * self.unfolds[k].kernel_size[0] *
+                    self.unfolds[k].kernel_size[1], C // self.num_heads)
+                v_pooled_k = v_pooled_k.view(
+                    -1, self.num_heads, T * self.unfolds[k].kernel_size[0] *
+                    self.unfolds[k].kernel_size[1], C // self.num_heads)
+                k_pooled += [k_pooled_k]
+                v_pooled += [v_pooled_k]
+            # k_all (v_all) shape : [16, 4, 5 * 210, 128]
+            k_all = torch.cat([k_rolled] + k_pooled, 2)
+            v_all = torch.cat([v_rolled] + v_pooled, 2)
+        else:
+            k_all = k_rolled
+            v_all = v_rolled
+        N = k_all.shape[-2]
+        q_windows = q_windows * self.scale
+        # B*nW, nHead, T*window_size*window_size, T*focal_window_size*focal_window_size
+        attn = (q_windows @ k_all.transpose(-2, -1))
+        # T * 45
+        window_area = T * self.window_size[0] * self.window_size[1]
+        # T * 165
+        window_area_rolled = k_rolled.shape[2]
+        if self.pool_method != "none" and self.focal_level > 1:
+            offset = window_area_rolled
+            for k in range(self.focal_level - 1):
+                # add attentional mask
+                # mask_all[1] shape [1, 16, T * 45]
+                bias = tuple((i + 2**k - 1) for i in self.focal_window)
+                if mask_all[k + 1] is not None:
+                    attn[:, :, :window_area, offset:(offset + (T*bias[0]*bias[1]))] = \
+                        attn[:, :, :window_area, offset:(offset + (T*bias[0]*bias[1]))] + \
+                        mask_all[k+1][:, :, None, None, :].repeat(
+                            attn.shape[0] // mask_all[k+1].shape[1], 1, 1, 1, 1).view(-1, 1, 1, mask_all[k+1].shape[-1])
+                offset += T * bias[0] * bias[1]
+        if mask_all[0] is not None:
+            nW = mask_all[0].shape[0]
+            attn = attn.view(attn.shape[0] // nW, nW, self.num_heads,
+                             window_area, N)
+            attn[:, :, :, :, :
+                 window_area] = attn[:, :, :, :, :window_area] + mask_all[0][
+                     None, :, None, :, :]
+            attn = attn.view(-1, self.num_heads, window_area, N)
+            attn = self.softmax(attn)
+        else:
+            attn = self.softmax(attn)
+        x = (attn @ v_all).transpose(1, 2).reshape(attn.shape[0], window_area,
+                                                   C)
+        x = self.proj(x)
+        return x
+class TemporalFocalTransformerBlock(nn.Module):
+    r""" Temporal Focal Transformer Block.
+    Args:
+        dim (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (tuple[int]): Window size.
+        shift_size (int): Shift size for SW-MSA.
+        mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.
+        qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True
+        norm_layer (nn.Module, optional): Normalization layer.  Default: nn.LayerNorm
+        focal_level (int):  The number level of focal window.
+        focal_window (int):  Window size of each focal window.
+        n_vecs (int): Required for F3N.
+        t2t_params (int): T2T parameters for F3N.
+    """
+    def __init__(self,
+                 dim,
+                 num_heads,
+                 window_size=(5, 9),
+                 mlp_ratio=4.,
+                 qkv_bias=True,
+                 pool_method="fc",
+                 focal_level=2,
+                 focal_window=(5, 9),
+                 norm_layer=nn.LayerNorm,
+                 n_vecs=None,
+                 t2t_params=None):
+        super().__init__()
+        self.dim = dim
+        self.num_heads = num_heads
+        self.window_size = window_size
+        self.expand_size = tuple(i // 2 for i in window_size)  # TODO
+        self.mlp_ratio = mlp_ratio
+        self.pool_method = pool_method
+        self.focal_level = focal_level
+        self.focal_window = focal_window
+        self.window_size_glo = self.window_size
+        self.pool_layers = nn.ModuleList()
+        if self.pool_method != "none":
+            for k in range(self.focal_level - 1):
+                window_size_glo = tuple(
+                    math.floor(i / (2**k)) for i in self.window_size_glo)
+                self.pool_layers.append(
+                    nn.Linear(window_size_glo[0] * window_size_glo[1], 1))
+                self.pool_layers[-1].weight.data.fill_(
+                    1. / (window_size_glo[0] * window_size_glo[1]))
+                self.pool_layers[-1].bias.data.fill_(0)
+        self.norm1 = norm_layer(dim)
+        self.attn = WindowAttention(dim,
+                                    expand_size=self.expand_size,
+                                    window_size=self.window_size,
+                                    focal_window=focal_window,
+                                    focal_level=focal_level,
+                                    num_heads=num_heads,
+                                    qkv_bias=qkv_bias,
+                                    pool_method=pool_method)
+        self.norm2 = norm_layer(dim)
+        self.mlp = FusionFeedForward(dim, n_vecs=n_vecs, t2t_params=t2t_params)
+    def forward(self, x):
+        output_size = x[1]
+        x = x[0]
+        B, T, H, W, C = x.shape
+        shortcut = x
+        x = self.norm1(x)
+        shifted_x = x
+        x_windows_all = [shifted_x]
+        x_window_masks_all = [None]
+        # partition windows tuple(i // 2 for i in window_size)
+        if self.focal_level > 1 and self.pool_method != "none":
+            # if we add coarser granularity and the pool method is not none
+            for k in range(self.focal_level - 1):
+                window_size_glo = tuple(
+                    math.floor(i / (2**k)) for i in self.window_size_glo)
+                pooled_h = math.ceil(H / window_size_glo[0]) * (2**k)
+                pooled_w = math.ceil(W / window_size_glo[1]) * (2**k)
+                H_pool = pooled_h * window_size_glo[0]
+                W_pool = pooled_w * window_size_glo[1]
+                x_level_k = shifted_x
+                # trim or pad shifted_x depending on the required size
+                if H > H_pool:
+                    trim_t = (H - H_pool) // 2
+                    trim_b = H - H_pool - trim_t
+                    x_level_k = x_level_k[:, :, trim_t:-trim_b]
+                elif H < H_pool:
+                    pad_t = (H_pool - H) // 2
+                    pad_b = H_pool - H - pad_t
+                    x_level_k = F.pad(x_level_k, (0, 0, 0, 0, pad_t, pad_b))
+                if W > W_pool:
+                    trim_l = (W - W_pool) // 2
+                    trim_r = W - W_pool - trim_l
+                    x_level_k = x_level_k[:, :, :, trim_l:-trim_r]
+                elif W < W_pool:
+                    pad_l = (W_pool - W) // 2
+                    pad_r = W_pool - W - pad_l
+                    x_level_k = F.pad(x_level_k, (0, 0, pad_l, pad_r))
+                x_windows_noreshape = window_partition_noreshape(
+                    x_level_k.contiguous(), window_size_glo
+                )  # B, nw, nw, T, window_size, window_size, C
+                nWh, nWw = x_windows_noreshape.shape[1:3]
+                x_windows_noreshape = x_windows_noreshape.view(
+                    B, nWh, nWw, T, window_size_glo[0] * window_size_glo[1],
+                    C).transpose(4, 5)  # B, nWh, nWw, T, C, wsize**2
+                x_windows_pooled = self.pool_layers[k](
+                    x_windows_noreshape).flatten(-2)  # B, nWh, nWw, T, C
+                x_windows_all += [x_windows_pooled]
+                x_window_masks_all += [None]
+        # nW*B, T*window_size*window_size, C
+        attn_windows = self.attn(x_windows_all, mask_all=x_window_masks_all)
+        # merge windows
+        attn_windows = attn_windows.view(-1, T, self.window_size[0],
+                                         self.window_size[1], C)
+        shifted_x = window_reverse(attn_windows, self.window_size, T, H,
+                                   W)  # B T H' W' C
+        # FFN
+        x = shortcut + shifted_x
+        y = self.norm2(x)
+        x = x + self.mlp(y.view(B, T * H * W, C), output_size).view(
+            B, T, H, W, C)
+        return x, output_size

inpainter/util/__init__.py ADDED Viewed

File without changes

inpainter/util/tensor_util.py ADDED Viewed

	@@ -0,0 +1,24 @@

+import cv2
+import numpy as np
+# resize frames
+def resize_frames(frames, size=None):
+    """
+    size: (w, h)
+    """
+    if size is not None:
+        frames = [cv2.resize(f, size) for f in frames]
+        frames = np.stack(frames, 0)
+    return frames
+# resize frames
+def resize_masks(masks, size=None):
+    """
+    size: (w, h)
+    """
+    if size is not None:
+        masks = [np.expand_dims(cv2.resize(m, size), 2) for m in masks]
+        masks = np.stack(masks, 0)
+    return masks

requirements.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+progressbar2
+gdown
+gitpython
+git+https://github.com/cheind/py-thin-plate-spline
+hickle
+tensorboard
+numpy
+git+https://github.com/facebookresearch/segment-anything.git
+gradio==3.25.0
+opencv-python
+pycocotools
+matplotlib
+onnxruntime
+onnx
+metaseg
+pyyaml
+av

sam_vit_h_4b8939.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a7bf3b02f3ebf1267aba913ff637d9a2d5c33d3173bb679e46d9f338c26f262e
+size 2564550879

template.html ADDED Viewed

	@@ -0,0 +1,27 @@

+<!-- template.html -->
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Gradio Video Pause Time</title>
+</head>
+<body>
+    <video id="video" controls>
+        <source src="{{VIDEO_URL}}" type="video/mp4">
+        Your browser does not support the video tag.
+    </video>
+    <script>
+        const video = document.getElementById("video");
+        let pauseTime = null;
+        video.addEventListener("pause", () => {
+            pauseTime = video.currentTime;
+        });
+        function getPauseTime() {
+            return pauseTime;
+        }
+    </script>
+</body>
+</html>

templates/index.html ADDED Viewed

	@@ -0,0 +1,50 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Video Object Segmentation</title>
+    <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
+</head>
+<body>
+    <h1>Video Object Segmentation</h1>
+    <input type="file" id="video-input" accept="video/*">
+    <button id="upload-video">Upload Video</button>
+    <br>
+    <button id="template-select">Template Select</button>
+    <button id="sam-refine">SAM Refine</button>
+    <br>
+    <button id="track-video">Track Video</button>
+    <button id="track-image">Track Image</button>
+    <br>
+    <a href="/download_video" id="download-video" download>Download Video</a>
+    <script>
+        // JavaScript code for handling interactions with the server
+$("#upload-video").click(function() {
+    var videoInput = document.getElementById("video-input");
+    var formData = new FormData();
+    formData.append("video", videoInput.files[0]);
+    $.ajax({
+        url: "/upload_video",
+        type: "POST",
+        data: formData,
+        processData: false,
+        contentType: false,
+        success: function(response) {
+            console.log(response);
+            // Process the response and update the UI accordingly
+        },
+        error: function(jqXHR, textStatus, errorThrown) {
+            console.log(textStatus, errorThrown);
+        }
+    });
+});
+    </script>
+</body>
+</html>

text_server.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import os
+import sys
+import cv2
+import time
+import json
+import queue
+import numpy as np
+import requests
+import concurrent.futures
+from PIL import Image
+from flask import Flask, render_template, request, jsonify, send_file
+import torchvision
+import torch
+from demo import automask_image_app, automask_video_app, sahi_autoseg_app
+sys.path.append(sys.path[0] + "/tracker")
+sys.path.append(sys.path[0] + "/tracker/model")
+from track_anything import TrackingAnything
+from track_anything import parse_augment
+# ... (all the functions defined in the original code except the Gradio part)
+app = Flask(__name__)
+app.config['UPLOAD_FOLDER'] = './uploaded_videos'
+app.config['ALLOWED_EXTENSIONS'] = {'mp4', 'avi', 'mov', 'mkv'}
+def allowed_file(filename):
+    return '.' in filename and filename.rsplit('.', 1)[1].lower() in app.config['ALLOWED_EXTENSIONS']
+@app.route("/")
+def index():
+    return render_template("index.html")
+@app.route("/upload_video", methods=["POST"])
+def upload_video():
+    # ... (handle video upload and processing)
+    return jsonify(status="success", data=video_data)
+@app.route("/template_select", methods=["POST"])
+def template_select():
+    # ... (handle template selection and processing)
+    return jsonify(status="success", data=template_data)
+@app.route("/sam_refine", methods=["POST"])
+def sam_refine_request():
+    # ... (handle sam refine and processing)
+    return jsonify(status="success", data=sam_data)
+@app.route("/track_video", methods=["POST"])
+def track_video():
+    # ... (handle video tracking and processing)
+    return jsonify(status="success", data=tracking_data)
+@app.route("/track_image", methods=["POST"])
+def track_image():
+    # ... (handle image tracking and processing)
+    return jsonify(status="success", data=tracking_data)
+@app.route("/download_video", methods=["GET"])
+def download_video():
+    try:
+        return send_file("output.mp4", attachment_filename="output.mp4")
+    except Exception as e:
+        return str(e)
+if __name__ == "__main__":
+    app.run(debug=True, host="0.0.0.0", port=args.port)
+if __name__ == '__main__':
+    app.run(host="0.0.0.0",port=12212, debug=True)

tools/__init__.py ADDED Viewed

File without changes

tools/base_segmenter.py ADDED Viewed

	@@ -0,0 +1,129 @@

+import time
+import torch
+import cv2
+from PIL import Image, ImageDraw, ImageOps
+import numpy as np
+from typing import Union
+from segment_anything import sam_model_registry, SamPredictor, SamAutomaticMaskGenerator
+import matplotlib.pyplot as plt
+import PIL
+from .mask_painter import mask_painter
+class BaseSegmenter:
+    def __init__(self, SAM_checkpoint, model_type, device='cuda:0'):
+        """
+        device: model device
+        SAM_checkpoint: path of SAM checkpoint
+        model_type: vit_b, vit_l, vit_h
+        """
+        print(f"Initializing BaseSegmenter to {device}")
+        assert model_type in ['vit_b', 'vit_l', 'vit_h'], 'model_type must be vit_b, vit_l, or vit_h'
+        self.device = device
+        self.torch_dtype = torch.float16 if 'cuda' in device else torch.float32
+        self.model = sam_model_registry[model_type](checkpoint=SAM_checkpoint)
+        self.model.to(device=self.device)
+        self.predictor = SamPredictor(self.model)
+        self.embedded = False
+    @torch.no_grad()
+    def set_image(self, image: np.ndarray):
+        # PIL.open(image_path) 3channel: RGB
+        # image embedding: avoid encode the same image multiple times
+        self.orignal_image = image
+        if self.embedded:
+            print('repeat embedding, please reset_image.')
+            return
+        self.predictor.set_image(image)
+        self.embedded = True
+        return
+    @torch.no_grad()
+    def reset_image(self):
+        # reset image embeding
+        self.predictor.reset_image()
+        self.embedded = False
+    def predict(self, prompts, mode, multimask=True):
+        """
+        image: numpy array, h, w, 3
+        prompts: dictionary, 3 keys: 'point_coords', 'point_labels', 'mask_input'
+        prompts['point_coords']: numpy array [N,2]
+        prompts['point_labels']: numpy array [1,N]
+        prompts['mask_input']: numpy array [1,256,256]
+        mode: 'point' (points only), 'mask' (mask only), 'both' (consider both)
+        mask_outputs: True (return 3 masks), False (return 1 mask only)
+        whem mask_outputs=True, mask_input=logits[np.argmax(scores), :, :][None, :, :]
+        """
+        assert self.embedded, 'prediction is called before set_image (feature embedding).'
+        assert mode in ['point', 'mask', 'both'], 'mode must be point, mask, or both'
+        if mode == 'point':
+            masks, scores, logits = self.predictor.predict(point_coords=prompts['point_coords'],
+                                point_labels=prompts['point_labels'],
+                                multimask_output=multimask)
+        elif mode == 'mask':
+            masks, scores, logits = self.predictor.predict(mask_input=prompts['mask_input'],
+                                multimask_output=multimask)
+        elif mode == 'both':   # both
+            masks, scores, logits = self.predictor.predict(point_coords=prompts['point_coords'],
+                                point_labels=prompts['point_labels'],
+                                mask_input=prompts['mask_input'],
+                                multimask_output=multimask)
+        else:
+            raise("Not implement now!")
+        # masks (n, h, w), scores (n,), logits (n, 256, 256)
+        return masks, scores, logits
+if __name__ == "__main__":
+    # load and show an image
+    image = cv2.imread('/hhd3/gaoshang/truck.jpg')
+    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  # numpy array (h, w, 3)
+    # initialise BaseSegmenter
+    SAM_checkpoint= '/ssd1/gaomingqi/checkpoints/sam_vit_h_4b8939.pth'
+    model_type = 'vit_h'
+    device = "cuda:4"
+    base_segmenter = BaseSegmenter(SAM_checkpoint=SAM_checkpoint, model_type=model_type, device=device)
+    # image embedding (once embedded, multiple prompts can be applied)
+    base_segmenter.set_image(image)
+    # examples
+    # point only ------------------------
+    mode = 'point'
+    prompts = {
+        'point_coords': np.array([[500, 375], [1125, 625]]),
+        'point_labels': np.array([1, 1]),
+    }
+    masks, scores, logits = base_segmenter.predict(prompts, mode, multimask=False)  # masks (n, h, w), scores (n,), logits (n, 256, 256)
+    painted_image = mask_painter(image, masks[np.argmax(scores)].astype('uint8'), background_alpha=0.8)
+    painted_image = cv2.cvtColor(painted_image, cv2.COLOR_RGB2BGR)  # numpy array (h, w, 3)
+    cv2.imwrite('/hhd3/gaoshang/truck_point.jpg', painted_image)
+    # both ------------------------
+    mode = 'both'
+    mask_input  = logits[np.argmax(scores), :, :]
+    prompts = {'mask_input': mask_input [None, :, :]}
+    prompts = {
+        'point_coords': np.array([[500, 375], [1125, 625]]),
+        'point_labels': np.array([1, 0]),
+        'mask_input': mask_input[None, :, :]
+    }
+    masks, scores, logits = base_segmenter.predict(prompts, mode, multimask=True)  # masks (n, h, w), scores (n,), logits (n, 256, 256)
+    painted_image = mask_painter(image, masks[np.argmax(scores)].astype('uint8'), background_alpha=0.8)
+    painted_image = cv2.cvtColor(painted_image, cv2.COLOR_RGB2BGR)  # numpy array (h, w, 3)
+    cv2.imwrite('/hhd3/gaoshang/truck_both.jpg', painted_image)
+    # mask only ------------------------
+    mode = 'mask'
+    mask_input  = logits[np.argmax(scores), :, :]
+    prompts = {'mask_input': mask_input[None, :, :]}
+    masks, scores, logits = base_segmenter.predict(prompts, mode, multimask=True)  # masks (n, h, w), scores (n,), logits (n, 256, 256)
+    painted_image = mask_painter(image, masks[np.argmax(scores)].astype('uint8'), background_alpha=0.8)
+    painted_image = cv2.cvtColor(painted_image, cv2.COLOR_RGB2BGR)  # numpy array (h, w, 3)
+    cv2.imwrite('/hhd3/gaoshang/truck_mask.jpg', painted_image)

tools/interact_tools.py ADDED Viewed

	@@ -0,0 +1,265 @@

+import time
+import torch
+import cv2
+from PIL import Image, ImageDraw, ImageOps
+import numpy as np
+from typing import Union
+from segment_anything import sam_model_registry, SamPredictor, SamAutomaticMaskGenerator
+import matplotlib.pyplot as plt
+import PIL
+from .mask_painter import mask_painter as mask_painter2
+from .base_segmenter import BaseSegmenter
+from .painter import mask_painter, point_painter
+import os
+import requests
+import sys
+mask_color = 3
+mask_alpha = 0.7
+contour_color = 1
+contour_width = 5
+point_color_ne = 8
+point_color_ps = 50
+point_alpha = 0.9
+point_radius = 15
+contour_color = 2
+contour_width = 5
+class SamControler():
+    def __init__(self, SAM_checkpoint, model_type, device):
+        '''
+        initialize sam controler
+        '''
+        self.sam_controler = BaseSegmenter(SAM_checkpoint, model_type, device)
+    def seg_again(self, image: np.ndarray):
+        '''
+        it is used when interact in video
+        '''
+        self.sam_controler.reset_image()
+        self.sam_controler.set_image(image)
+        return
+    def first_frame_click(self, image: np.ndarray, points:np.ndarray, labels: np.ndarray, multimask=True):
+        '''
+        it is used in first frame in video
+        return: mask, logit, painted image(mask+point)
+        '''
+        # self.sam_controler.set_image(image)
+        origal_image = self.sam_controler.orignal_image
+        neg_flag = labels[-1]
+        if neg_flag==1:
+            #find neg
+            prompts = {
+                'point_coords': points,
+                'point_labels': labels,
+            }
+            masks, scores, logits = self.sam_controler.predict(prompts, 'point', multimask)
+            mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+            prompts = {
+                'point_coords': points,
+                'point_labels': labels,
+                'mask_input': logit[None, :, :]
+            }
+            masks, scores, logits = self.sam_controler.predict(prompts, 'both', multimask)
+            mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+        else:
+           #find positive
+            prompts = {
+                'point_coords': points,
+                'point_labels': labels,
+            }
+            masks, scores, logits = self.sam_controler.predict(prompts, 'point', multimask)
+            mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+        assert len(points)==len(labels)
+        painted_image = mask_painter(image, mask.astype('uint8'), mask_color, mask_alpha, contour_color, contour_width)
+        painted_image = point_painter(painted_image, np.squeeze(points[np.argwhere(labels>0)],axis = 1), point_color_ne, point_alpha, point_radius, contour_color, contour_width)
+        painted_image = point_painter(painted_image, np.squeeze(points[np.argwhere(labels<1)],axis = 1), point_color_ps, point_alpha, point_radius, contour_color, contour_width)
+        painted_image = Image.fromarray(painted_image)
+        return mask, logit, painted_image
+    def interact_loop(self, image:np.ndarray, same: bool, points:np.ndarray, labels: np.ndarray, logits: np.ndarray=None, multimask=True):
+        origal_image = self.sam_controler.orignal_image
+        if same:
+            '''
+            true; loop in the same image
+            '''
+            prompts = {
+                'point_coords': points,
+                'point_labels': labels,
+                'mask_input': logits[None, :, :]
+            }
+            masks, scores, logits = self.sam_controler.predict(prompts, 'both', multimask)
+            mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+            painted_image = mask_painter(image, mask.astype('uint8'), mask_color, mask_alpha, contour_color, contour_width)
+            painted_image = point_painter(painted_image, np.squeeze(points[np.argwhere(labels>0)],axis = 1), point_color_ne, point_alpha, point_radius, contour_color, contour_width)
+            painted_image = point_painter(painted_image, np.squeeze(points[np.argwhere(labels<1)],axis = 1), point_color_ps, point_alpha, point_radius, contour_color, contour_width)
+            painted_image = Image.fromarray(painted_image)
+            return mask, logit, painted_image
+        else:
+            '''
+            loop in the different image, interact in the video
+            '''
+            if image is None:
+                raise('Image error')
+            else:
+                self.seg_again(image)
+            prompts = {
+                'point_coords': points,
+                'point_labels': labels,
+            }
+            masks, scores, logits = self.sam_controler.predict(prompts, 'point', multimask)
+            mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+            painted_image = mask_painter(image, mask.astype('uint8'), mask_color, mask_alpha, contour_color, contour_width)
+            painted_image = point_painter(painted_image, np.squeeze(points[np.argwhere(labels>0)],axis = 1), point_color_ne, point_alpha, point_radius, contour_color, contour_width)
+            painted_image = point_painter(painted_image, np.squeeze(points[np.argwhere(labels<1)],axis = 1), point_color_ps, point_alpha, point_radius, contour_color, contour_width)
+            painted_image = Image.fromarray(painted_image)
+            return mask, logit, painted_image
+# def initialize():
+#     '''
+#     initialize sam controler
+#     '''
+#     checkpoint_url = "https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth"
+#     folder = "segmenter"
+#     SAM_checkpoint= './checkpoints/sam_vit_h_4b8939.pth'
+#     download_checkpoint(checkpoint_url, folder, SAM_checkpoint)
+#     model_type = 'vit_h'
+#     device = "cuda:0"
+#     sam_controler = BaseSegmenter(SAM_checkpoint, model_type, device)
+#     return sam_controler
+# def seg_again(sam_controler, image: np.ndarray):
+#     '''
+#     it is used when interact in video
+#     '''
+#     sam_controler.reset_image()
+#     sam_controler.set_image(image)
+#     return
+# def first_frame_click(sam_controler, image: np.ndarray, points:np.ndarray, labels: np.ndarray, multimask=True):
+#     '''
+#     it is used in first frame in video
+#     return: mask, logit, painted image(mask+point)
+#     '''
+#     sam_controler.set_image(image)
+#     prompts = {
+#         'point_coords': points,
+#         'point_labels': labels,
+#     }
+#     masks, scores, logits = sam_controler.predict(prompts, 'point', multimask)
+#     mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+#     assert len(points)==len(labels)
+#     painted_image = mask_painter(image, mask.astype('uint8'), mask_color, mask_alpha, contour_color, contour_width)
+#     painted_image = point_painter(painted_image, np.squeeze(points[np.argwhere(labels>0)],axis = 1), point_color_ne, point_alpha, point_radius, contour_color, contour_width)
+#     painted_image = point_painter(painted_image, np.squeeze(points[np.argwhere(labels<1)],axis = 1), point_color_ps, point_alpha, point_radius, contour_color, contour_width)
+#     painted_image = Image.fromarray(painted_image)
+#     return mask, logit, painted_image
+# def interact_loop(sam_controler, image:np.ndarray, same: bool, points:np.ndarray, labels: np.ndarray, logits: np.ndarray=None, multimask=True):
+#     if same:
+#         '''
+#         true; loop in the same image
+#         '''
+#         prompts = {
+#             'point_coords': points,
+#             'point_labels': labels,
+#             'mask_input': logits[None, :, :]
+#         }
+#         masks, scores, logits = sam_controler.predict(prompts, 'both', multimask)
+#         mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+#         painted_image = mask_painter(image, mask.astype('uint8'), mask_color, mask_alpha, contour_color, contour_width)
+#         painted_image = point_painter(painted_image, np.squeeze(points[np.argwhere(labels>0)],axis = 1), point_color_ne, point_alpha, point_radius, contour_color, contour_width)
+#         painted_image = point_painter(painted_image, np.squeeze(points[np.argwhere(labels<1)],axis = 1), point_color_ps, point_alpha, point_radius, contour_color, contour_width)
+#         painted_image = Image.fromarray(painted_image)
+#         return mask, logit, painted_image
+#     else:
+#         '''
+#         loop in the different image, interact in the video
+#         '''
+#         if image is None:
+#             raise('Image error')
+#         else:
+#             seg_again(sam_controler, image)
+#         prompts = {
+#             'point_coords': points,
+#             'point_labels': labels,
+#         }
+#         masks, scores, logits = sam_controler.predict(prompts, 'point', multimask)
+#         mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+#         painted_image = mask_painter(image, mask.astype('uint8'), mask_color, mask_alpha, contour_color, contour_width)
+#         painted_image = point_painter(painted_image, np.squeeze(points[np.argwhere(labels>0)],axis = 1), point_color_ne, point_alpha, point_radius, contour_color, contour_width)
+#         painted_image = point_painter(painted_image, np.squeeze(points[np.argwhere(labels<1)],axis = 1), point_color_ps, point_alpha, point_radius, contour_color, contour_width)
+#         painted_image = Image.fromarray(painted_image)
+#         return mask, logit, painted_image
+if __name__ == "__main__":
+    points = np.array([[500, 375], [1125, 625]])
+    labels = np.array([1, 1])
+    image = cv2.imread('/hhd3/gaoshang/truck.jpg')
+    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
+    sam_controler = initialize()
+    mask, logit, painted_image_full = first_frame_click(sam_controler,image, points, labels, multimask=True)
+    painted_image = mask_painter2(image, mask.astype('uint8'), background_alpha=0.8)
+    painted_image = cv2.cvtColor(painted_image, cv2.COLOR_RGB2BGR)  # numpy array (h, w, 3)
+    cv2.imwrite('/hhd3/gaoshang/truck_point.jpg', painted_image)
+    cv2.imwrite('/hhd3/gaoshang/truck_change.jpg', image)
+    painted_image_full.save('/hhd3/gaoshang/truck_point_full.jpg')
+    mask, logit, painted_image_full = interact_loop(sam_controler,image,True, points, np.array([1, 0]), logit, multimask=True)
+    painted_image = mask_painter2(image, mask.astype('uint8'), background_alpha=0.8)
+    painted_image = cv2.cvtColor(painted_image, cv2.COLOR_RGB2BGR)  # numpy array (h, w, 3)
+    cv2.imwrite('/hhd3/gaoshang/truck_same.jpg', painted_image)
+    painted_image_full.save('/hhd3/gaoshang/truck_same_full.jpg')
+    mask, logit, painted_image_full = interact_loop(sam_controler,image, False, points, labels, multimask=True)
+    painted_image = mask_painter2(image, mask.astype('uint8'), background_alpha=0.8)
+    painted_image = cv2.cvtColor(painted_image, cv2.COLOR_RGB2BGR)  # numpy array (h, w, 3)
+    cv2.imwrite('/hhd3/gaoshang/truck_diff.jpg', painted_image)
+    painted_image_full.save('/hhd3/gaoshang/truck_diff_full.jpg')

tools/mask_painter.py ADDED Viewed

	@@ -0,0 +1,288 @@

+import cv2
+import torch
+import numpy as np
+from PIL import Image
+import copy
+import time
+def colormap(rgb=True):
+	color_list = np.array(
+		[
+			0.000, 0.000, 0.000,
+			1.000, 1.000, 1.000,
+			1.000, 0.498, 0.313,
+			0.392, 0.581, 0.929,
+			0.000, 0.447, 0.741,
+			0.850, 0.325, 0.098,
+			0.929, 0.694, 0.125,
+			0.494, 0.184, 0.556,
+			0.466, 0.674, 0.188,
+			0.301, 0.745, 0.933,
+			0.635, 0.078, 0.184,
+			0.300, 0.300, 0.300,
+			0.600, 0.600, 0.600,
+			1.000, 0.000, 0.000,
+			1.000, 0.500, 0.000,
+			0.749, 0.749, 0.000,
+			0.000, 1.000, 0.000,
+			0.000, 0.000, 1.000,
+			0.667, 0.000, 1.000,
+			0.333, 0.333, 0.000,
+			0.333, 0.667, 0.000,
+			0.333, 1.000, 0.000,
+			0.667, 0.333, 0.000,
+			0.667, 0.667, 0.000,
+			0.667, 1.000, 0.000,
+			1.000, 0.333, 0.000,
+			1.000, 0.667, 0.000,
+			1.000, 1.000, 0.000,
+			0.000, 0.333, 0.500,
+			0.000, 0.667, 0.500,
+			0.000, 1.000, 0.500,
+			0.333, 0.000, 0.500,
+			0.333, 0.333, 0.500,
+			0.333, 0.667, 0.500,
+			0.333, 1.000, 0.500,
+			0.667, 0.000, 0.500,
+			0.667, 0.333, 0.500,
+			0.667, 0.667, 0.500,
+			0.667, 1.000, 0.500,
+			1.000, 0.000, 0.500,
+			1.000, 0.333, 0.500,
+			1.000, 0.667, 0.500,
+			1.000, 1.000, 0.500,
+			0.000, 0.333, 1.000,
+			0.000, 0.667, 1.000,
+			0.000, 1.000, 1.000,
+			0.333, 0.000, 1.000,
+			0.333, 0.333, 1.000,
+			0.333, 0.667, 1.000,
+			0.333, 1.000, 1.000,
+			0.667, 0.000, 1.000,
+			0.667, 0.333, 1.000,
+			0.667, 0.667, 1.000,
+			0.667, 1.000, 1.000,
+			1.000, 0.000, 1.000,
+			1.000, 0.333, 1.000,
+			1.000, 0.667, 1.000,
+			0.167, 0.000, 0.000,
+			0.333, 0.000, 0.000,
+			0.500, 0.000, 0.000,
+			0.667, 0.000, 0.000,
+			0.833, 0.000, 0.000,
+			1.000, 0.000, 0.000,
+			0.000, 0.167, 0.000,
+			0.000, 0.333, 0.000,
+			0.000, 0.500, 0.000,
+			0.000, 0.667, 0.000,
+			0.000, 0.833, 0.000,
+			0.000, 1.000, 0.000,
+			0.000, 0.000, 0.167,
+			0.000, 0.000, 0.333,
+			0.000, 0.000, 0.500,
+			0.000, 0.000, 0.667,
+			0.000, 0.000, 0.833,
+			0.000, 0.000, 1.000,
+			0.143, 0.143, 0.143,
+			0.286, 0.286, 0.286,
+			0.429, 0.429, 0.429,
+			0.571, 0.571, 0.571,
+			0.714, 0.714, 0.714,
+			0.857, 0.857, 0.857
+		]
+	).astype(np.float32)
+	color_list = color_list.reshape((-1, 3)) * 255
+	if not rgb:
+		color_list = color_list[:, ::-1]
+	return color_list
+color_list = colormap()
+color_list = color_list.astype('uint8').tolist()
+def vis_add_mask(image, background_mask, contour_mask, background_color, contour_color, background_alpha, contour_alpha):
+	background_color = np.array(background_color)
+	contour_color = np.array(contour_color)
+	# background_mask = 1 - background_mask
+	# contour_mask = 1 - contour_mask
+	for i in range(3):
+		image[:, :, i] = image[:, :, i] * (1-background_alpha+background_mask*background_alpha) \
+			+ background_color[i] * (background_alpha-background_mask*background_alpha)
+		image[:, :, i] = image[:, :, i] * (1-contour_alpha+contour_mask*contour_alpha) \
+			+ contour_color[i] * (contour_alpha-contour_mask*contour_alpha)
+	return image.astype('uint8')
+def mask_generator_00(mask, background_radius, contour_radius):
+	# no background width when '00'
+	# distance map
+	dist_transform_fore = cv2.distanceTransform(mask, cv2.DIST_L2, 3)
+	dist_transform_back = cv2.distanceTransform(1-mask, cv2.DIST_L2, 3)
+	dist_map = dist_transform_fore - dist_transform_back
+	# ...:::!!!:::...
+	contour_radius += 2
+	contour_mask = np.abs(np.clip(dist_map, -contour_radius, contour_radius))
+	contour_mask = contour_mask / np.max(contour_mask)
+	contour_mask[contour_mask>0.5] = 1.
+	return mask, contour_mask
+def mask_generator_01(mask, background_radius, contour_radius):
+	# no background width when '00'
+	# distance map
+	dist_transform_fore = cv2.distanceTransform(mask, cv2.DIST_L2, 3)
+	dist_transform_back = cv2.distanceTransform(1-mask, cv2.DIST_L2, 3)
+	dist_map = dist_transform_fore - dist_transform_back
+	# ...:::!!!:::...
+	contour_radius += 2
+	contour_mask = np.abs(np.clip(dist_map, -contour_radius, contour_radius))
+	contour_mask = contour_mask / np.max(contour_mask)
+	return mask, contour_mask
+def mask_generator_10(mask, background_radius, contour_radius):
+	# distance map
+	dist_transform_fore = cv2.distanceTransform(mask, cv2.DIST_L2, 3)
+	dist_transform_back = cv2.distanceTransform(1-mask, cv2.DIST_L2, 3)
+	dist_map = dist_transform_fore - dist_transform_back
+	# .....:::::!!!!!
+	background_mask = np.clip(dist_map, -background_radius, background_radius)
+	background_mask = (background_mask - np.min(background_mask))
+	background_mask = background_mask / np.max(background_mask)
+	# ...:::!!!:::...
+	contour_radius += 2
+	contour_mask = np.abs(np.clip(dist_map, -contour_radius, contour_radius))
+	contour_mask = contour_mask / np.max(contour_mask)
+	contour_mask[contour_mask>0.5] = 1.
+	return background_mask, contour_mask
+def mask_generator_11(mask, background_radius, contour_radius):
+	# distance map
+	dist_transform_fore = cv2.distanceTransform(mask, cv2.DIST_L2, 3)
+	dist_transform_back = cv2.distanceTransform(1-mask, cv2.DIST_L2, 3)
+	dist_map = dist_transform_fore - dist_transform_back
+	# .....:::::!!!!!
+	background_mask = np.clip(dist_map, -background_radius, background_radius)
+	background_mask = (background_mask - np.min(background_mask))
+	background_mask = background_mask / np.max(background_mask)
+	# ...:::!!!:::...
+	contour_radius += 2
+	contour_mask = np.abs(np.clip(dist_map, -contour_radius, contour_radius))
+	contour_mask = contour_mask / np.max(contour_mask)
+	return background_mask, contour_mask
+def mask_painter(input_image, input_mask, background_alpha=0.5, background_blur_radius=7, contour_width=3, contour_color=3, contour_alpha=1, mode='11'):
+	"""
+	Input:
+	input_image: numpy array
+	input_mask: numpy array
+	background_alpha: transparency of background, [0, 1], 1: all black, 0: do nothing
+	background_blur_radius: radius of background blur, must be odd number
+	contour_width: width of mask contour, must be odd number
+	contour_color: color index (in color map) of mask contour, 0: black, 1: white, >1: others
+	contour_alpha: transparency of mask contour, [0, 1], if 0: no contour highlighted
+	mode: painting mode, '00', no blur, '01' only blur contour, '10' only blur background, '11' blur both
+	Output:
+	painted_image: numpy array
+	"""
+	assert input_image.shape[:2] == input_mask.shape, 'different shape'
+	assert background_blur_radius % 2 * contour_width % 2 > 0, 'background_blur_radius and contour_width must be ODD'
+	assert mode in ['00', '01', '10', '11'], 'mode should be 00, 01, 10, or 11'
+	# downsample input image and mask
+	width, height = input_image.shape[0], input_image.shape[1]
+	res = 1024
+	ratio = min(1.0 * res / max(width, height), 1.0)
+	input_image = cv2.resize(input_image, (int(height*ratio), int(width*ratio)))
+	input_mask = cv2.resize(input_mask, (int(height*ratio), int(width*ratio)))
+	# 0: background, 1: foreground
+	msk = np.clip(input_mask, 0, 1)
+	# generate masks for background and contour pixels
+	background_radius = (background_blur_radius - 1) // 2
+	contour_radius = (contour_width - 1) // 2
+	generator_dict = {'00':mask_generator_00, '01':mask_generator_01, '10':mask_generator_10, '11':mask_generator_11}
+	background_mask, contour_mask = generator_dict[mode](msk, background_radius, contour_radius)
+	# paint
+	painted_image = vis_add_mask\
+		(input_image, background_mask, contour_mask, color_list[0], color_list[contour_color], background_alpha, contour_alpha)	# black for background
+	return painted_image
+if __name__ == '__main__':
+	background_alpha = 0.7  	# transparency of background 1: all black, 0: do nothing
+	background_blur_radius = 31	# radius of background blur, must be odd number
+	contour_width = 11       	# contour width, must be odd number
+	contour_color = 3      		# id in color map, 0: black, 1: white, >1: others
+	contour_alpha = 1       	# transparency of background, 0: no contour highlighted
+	# load input image and mask
+	input_image = np.array(Image.open('./test_img/painter_input_image.jpg').convert('RGB'))
+	input_mask = np.array(Image.open('./test_img/painter_input_mask.jpg').convert('P'))
+	# paint
+	overall_time_1 = 0
+	overall_time_2 = 0
+	overall_time_3 = 0
+	overall_time_4 = 0
+	overall_time_5 = 0
+	for i in range(50):
+		t2 = time.time()
+		painted_image_00 = mask_painter(input_image, input_mask, background_alpha, background_blur_radius, contour_width, contour_color, contour_alpha, mode='00')
+		e2 = time.time()
+		t3 = time.time()
+		painted_image_10 = mask_painter(input_image, input_mask, background_alpha, background_blur_radius, contour_width, contour_color, contour_alpha, mode='10')
+		e3 = time.time()
+		t1 = time.time()
+		painted_image = mask_painter(input_image, input_mask, background_alpha, background_blur_radius, contour_width, contour_color, contour_alpha)
+		e1 = time.time()
+		t4 = time.time()
+		painted_image_01 = mask_painter(input_image, input_mask, background_alpha, background_blur_radius, contour_width, contour_color, contour_alpha, mode='01')
+		e4 = time.time()
+		t5 = time.time()
+		painted_image_11 = mask_painter(input_image, input_mask, background_alpha, background_blur_radius, contour_width, contour_color, contour_alpha, mode='11')
+		e5 = time.time()
+		overall_time_1 += (e1 - t1)
+		overall_time_2 += (e2 - t2)
+		overall_time_3 += (e3 - t3)
+		overall_time_4 += (e4 - t4)
+		overall_time_5 += (e5 - t5)
+	print(f'average time w gaussian: {overall_time_1/50}')
+	print(f'average time w/o gaussian00: {overall_time_2/50}')
+	print(f'average time w/o gaussian10: {overall_time_3/50}')
+	print(f'average time w/o gaussian01: {overall_time_4/50}')
+	print(f'average time w/o gaussian11: {overall_time_5/50}')
+	# save
+	painted_image_00 = Image.fromarray(painted_image_00)
+	painted_image_00.save('./test_img/painter_output_image_00.png')
+	painted_image_10 = Image.fromarray(painted_image_10)
+	painted_image_10.save('./test_img/painter_output_image_10.png')
+	painted_image_01 = Image.fromarray(painted_image_01)
+	painted_image_01.save('./test_img/painter_output_image_01.png')
+	painted_image_11 = Image.fromarray(painted_image_11)
+	painted_image_11.save('./test_img/painter_output_image_11.png')

tools/painter.py ADDED Viewed

	@@ -0,0 +1,215 @@

+# paint masks, contours, or points on images, with specified colors
+import cv2
+import torch
+import numpy as np
+from PIL import Image
+import copy
+import time
+def colormap(rgb=True):
+	color_list = np.array(
+		[
+			0.000, 0.000, 0.000,
+			1.000, 1.000, 1.000,
+			1.000, 0.498, 0.313,
+			0.392, 0.581, 0.929,
+			0.000, 0.447, 0.741,
+			0.850, 0.325, 0.098,
+			0.929, 0.694, 0.125,
+			0.494, 0.184, 0.556,
+			0.466, 0.674, 0.188,
+			0.301, 0.745, 0.933,
+			0.635, 0.078, 0.184,
+			0.300, 0.300, 0.300,
+			0.600, 0.600, 0.600,
+			1.000, 0.000, 0.000,
+			1.000, 0.500, 0.000,
+			0.749, 0.749, 0.000,
+			0.000, 1.000, 0.000,
+			0.000, 0.000, 1.000,
+			0.667, 0.000, 1.000,
+			0.333, 0.333, 0.000,
+			0.333, 0.667, 0.000,
+			0.333, 1.000, 0.000,
+			0.667, 0.333, 0.000,
+			0.667, 0.667, 0.000,
+			0.667, 1.000, 0.000,
+			1.000, 0.333, 0.000,
+			1.000, 0.667, 0.000,
+			1.000, 1.000, 0.000,
+			0.000, 0.333, 0.500,
+			0.000, 0.667, 0.500,
+			0.000, 1.000, 0.500,
+			0.333, 0.000, 0.500,
+			0.333, 0.333, 0.500,
+			0.333, 0.667, 0.500,
+			0.333, 1.000, 0.500,
+			0.667, 0.000, 0.500,
+			0.667, 0.333, 0.500,
+			0.667, 0.667, 0.500,
+			0.667, 1.000, 0.500,
+			1.000, 0.000, 0.500,
+			1.000, 0.333, 0.500,
+			1.000, 0.667, 0.500,
+			1.000, 1.000, 0.500,
+			0.000, 0.333, 1.000,
+			0.000, 0.667, 1.000,
+			0.000, 1.000, 1.000,
+			0.333, 0.000, 1.000,
+			0.333, 0.333, 1.000,
+			0.333, 0.667, 1.000,
+			0.333, 1.000, 1.000,
+			0.667, 0.000, 1.000,
+			0.667, 0.333, 1.000,
+			0.667, 0.667, 1.000,
+			0.667, 1.000, 1.000,
+			1.000, 0.000, 1.000,
+			1.000, 0.333, 1.000,
+			1.000, 0.667, 1.000,
+			0.167, 0.000, 0.000,
+			0.333, 0.000, 0.000,
+			0.500, 0.000, 0.000,
+			0.667, 0.000, 0.000,
+			0.833, 0.000, 0.000,
+			1.000, 0.000, 0.000,
+			0.000, 0.167, 0.000,
+			0.000, 0.333, 0.000,
+			0.000, 0.500, 0.000,
+			0.000, 0.667, 0.000,
+			0.000, 0.833, 0.000,
+			0.000, 1.000, 0.000,
+			0.000, 0.000, 0.167,
+			0.000, 0.000, 0.333,
+			0.000, 0.000, 0.500,
+			0.000, 0.000, 0.667,
+			0.000, 0.000, 0.833,
+			0.000, 0.000, 1.000,
+			0.143, 0.143, 0.143,
+			0.286, 0.286, 0.286,
+			0.429, 0.429, 0.429,
+			0.571, 0.571, 0.571,
+			0.714, 0.714, 0.714,
+			0.857, 0.857, 0.857
+		]
+	).astype(np.float32)
+	color_list = color_list.reshape((-1, 3)) * 255
+	if not rgb:
+		color_list = color_list[:, ::-1]
+	return color_list
+color_list = colormap()
+color_list = color_list.astype('uint8').tolist()
+def vis_add_mask(image, mask, color, alpha):
+	color = np.array(color_list[color])
+	mask = mask > 0.5
+	image[mask] = image[mask] * (1-alpha) + color * alpha
+	return image.astype('uint8')
+def point_painter(input_image, input_points, point_color=5, point_alpha=0.9, point_radius=15, contour_color=2, contour_width=5):
+	h, w = input_image.shape[:2]
+	point_mask = np.zeros((h, w)).astype('uint8')
+	for point in input_points:
+		point_mask[point[1], point[0]] = 1
+	kernel = cv2.getStructuringElement(2, (point_radius, point_radius))
+	point_mask = cv2.dilate(point_mask, kernel)
+	contour_radius = (contour_width - 1) // 2
+	dist_transform_fore = cv2.distanceTransform(point_mask, cv2.DIST_L2, 3)
+	dist_transform_back = cv2.distanceTransform(1-point_mask, cv2.DIST_L2, 3)
+	dist_map = dist_transform_fore - dist_transform_back
+	# ...:::!!!:::...
+	contour_radius += 2
+	contour_mask = np.abs(np.clip(dist_map, -contour_radius, contour_radius))
+	contour_mask = contour_mask / np.max(contour_mask)
+	contour_mask[contour_mask>0.5] = 1.
+	# paint mask
+	painted_image = vis_add_mask(input_image.copy(), point_mask, point_color, point_alpha)
+	# paint contour
+	painted_image = vis_add_mask(painted_image.copy(), 1-contour_mask, contour_color, 1)
+	return painted_image
+def mask_painter(input_image, input_mask, mask_color=5, mask_alpha=0.7, contour_color=1, contour_width=3):
+	assert input_image.shape[:2] == input_mask.shape, 'different shape between image and mask'
+	# 0: background, 1: foreground
+	mask = np.clip(input_mask, 0, 1)
+	contour_radius = (contour_width - 1) // 2
+	dist_transform_fore = cv2.distanceTransform(mask, cv2.DIST_L2, 3)
+	dist_transform_back = cv2.distanceTransform(1-mask, cv2.DIST_L2, 3)
+	dist_map = dist_transform_fore - dist_transform_back
+	# ...:::!!!:::...
+	contour_radius += 2
+	contour_mask = np.abs(np.clip(dist_map, -contour_radius, contour_radius))
+	contour_mask = contour_mask / np.max(contour_mask)
+	contour_mask[contour_mask>0.5] = 1.
+	# paint mask
+	painted_image = vis_add_mask(input_image.copy(), mask.copy(), mask_color, mask_alpha)
+	# paint contour
+	painted_image = vis_add_mask(painted_image.copy(), 1-contour_mask, contour_color, 1)
+	return painted_image
+def background_remover(input_image, input_mask):
+	"""
+	input_image: H, W, 3, np.array
+	input_mask: H, W, np.array
+	image_wo_background: PIL.Image
+	"""
+	assert input_image.shape[:2] == input_mask.shape, 'different shape between image and mask'
+	# 0: background, 1: foreground
+	mask = np.expand_dims(np.clip(input_mask, 0, 1), axis=2)*255
+	image_wo_background = np.concatenate([input_image, mask], axis=2)		# H, W, 4
+	image_wo_background = Image.fromarray(image_wo_background).convert('RGBA')
+	return image_wo_background
+if __name__ == '__main__':
+	input_image = np.array(Image.open('images/painter_input_image.jpg').convert('RGB'))
+	input_mask = np.array(Image.open('images/painter_input_mask.jpg').convert('P'))
+	# example of mask painter
+	mask_color = 3
+	mask_alpha = 0.7
+	contour_color = 1
+	contour_width = 5
+	# save
+	painted_image = Image.fromarray(input_image)
+	painted_image.save('images/original.png')
+	painted_image = mask_painter(input_image, input_mask, mask_color, mask_alpha, contour_color, contour_width)
+	# save
+	painted_image = Image.fromarray(input_image)
+	painted_image.save('images/original1.png')
+	# example of point painter
+	input_image = np.array(Image.open('images/painter_input_image.jpg').convert('RGB'))
+	input_points = np.array([[500, 375], [70, 600]])	# x, y
+	point_color = 5
+	point_alpha = 0.9
+	point_radius = 15
+	contour_color = 2
+	contour_width = 5
+	painted_image_1 = point_painter(input_image, input_points, point_color, point_alpha, point_radius, contour_color, contour_width)
+	# save
+	painted_image = Image.fromarray(painted_image_1)
+	painted_image.save('images/point_painter_1.png')
+	input_image = np.array(Image.open('images/painter_input_image.jpg').convert('RGB'))
+	painted_image_2 = point_painter(input_image, input_points, point_color=9, point_radius=20, contour_color=29)
+	# save
+	painted_image = Image.fromarray(painted_image_2)
+	painted_image.save('images/point_painter_2.png')
+	# example of background remover
+	input_image = np.array(Image.open('images/original.png').convert('RGB'))
+	image_wo_background = background_remover(input_image, input_mask)	# return PIL.Image
+	image_wo_background.save('images/image_wo_background.png')

track_anything.py ADDED Viewed

	@@ -0,0 +1,93 @@

+import sys
+sys.path.append("/hhd3/gaoshang/Track-Anything/tracker")
+import PIL
+from tools.interact_tools import SamControler
+from tracker.base_tracker import BaseTracker
+import numpy as np
+import argparse
+class TrackingAnything():
+    def __init__(self, sam_checkpoint, xmem_checkpoint, args):
+        self.args = args
+        self.samcontroler = SamControler(sam_checkpoint, args.sam_model_type, args.device)
+        self.xmem = BaseTracker(xmem_checkpoint, device=args.device)
+    def inference_step(self, first_flag: bool, interact_flag: bool, image: np.ndarray,
+                       same_image_flag: bool, points:np.ndarray, labels: np.ndarray, logits: np.ndarray=None, multimask=True):
+        if first_flag:
+            mask, logit, painted_image = self.samcontroler.first_frame_click(image, points, labels, multimask)
+            return mask, logit, painted_image
+        if interact_flag:
+            mask, logit, painted_image = self.samcontroler.interact_loop(image, same_image_flag, points, labels, logits, multimask)
+            return mask, logit, painted_image
+        mask, logit, painted_image = self.xmem.track(image, logit)
+        return mask, logit, painted_image
+    def first_frame_click(self, image: np.ndarray, points:np.ndarray, labels: np.ndarray, multimask=True):
+        mask, logit, painted_image = self.samcontroler.first_frame_click(image, points, labels, multimask)
+        return mask, logit, painted_image
+    def interact(self, image: np.ndarray, same_image_flag: bool, points:np.ndarray, labels: np.ndarray, logits: np.ndarray=None, multimask=True):
+        mask, logit, painted_image = self.samcontroler.interact_loop(image, same_image_flag, points, labels, logits, multimask)
+        return mask, logit, painted_image
+    def generator(self, images: list, template_mask:np.ndarray):
+        masks = []
+        logits = []
+        painted_images = []
+        for i in range(len(images)):
+            if i ==0:
+                mask, logit, painted_image = self.xmem.track(images[i], template_mask)
+                masks.append(mask)
+                logits.append(logit)
+                painted_images.append(painted_image)
+            else:
+                mask, logit, painted_image = self.xmem.track(images[i])
+                masks.append(mask)
+                logits.append(logit)
+                painted_images.append(painted_image)
+        return masks, logits, painted_images
+def parse_augment():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--device', type=str, default="cuda:0")
+    parser.add_argument('--sam_model_type', type=str, default="vit_h")
+    parser.add_argument('--port', type=int, default=6080, help="only useful when running gradio applications")
+    parser.add_argument('--debug', action="store_true")
+    parser.add_argument('--mask_save', default=True)
+    args = parser.parse_args()
+    if args.debug:
+        print(args)
+    return args
+if __name__ == "__main__":
+    masks = None
+    logits = None
+    painted_images = None
+    images = []
+    image  = np.array(PIL.Image.open('/hhd3/gaoshang/truck.jpg'))
+    args = parse_augment()
+    # images.append(np.ones((20,20,3)).astype('uint8'))
+    # images.append(np.ones((20,20,3)).astype('uint8'))
+    images.append(image)
+    images.append(image)
+    mask = np.zeros_like(image)[:,:,0]
+    mask[0,0]= 1
+    trackany = TrackingAnything('/ssd1/gaomingqi/checkpoints/sam_vit_h_4b8939.pth','/ssd1/gaomingqi/checkpoints/XMem-s012.pth', args)
+    masks, logits ,painted_images= trackany.generator(images, mask)

tracker/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

tracker/base_tracker.py ADDED Viewed

	@@ -0,0 +1,233 @@

+# import for debugging
+import os
+import glob
+import numpy as np
+from PIL import Image
+# import for base_tracker
+import torch
+import yaml
+import torch.nn.functional as F
+from model.network import XMem
+from inference.inference_core import InferenceCore
+from util.mask_mapper import MaskMapper
+from torchvision import transforms
+from util.range_transform import im_normalization
+import sys
+sys.path.insert(0, sys.path[0]+"/../")
+from tools.painter import mask_painter
+from tools.base_segmenter import BaseSegmenter
+from torchvision.transforms import Resize
+class BaseTracker:
+    def __init__(self, xmem_checkpoint, device, sam_model=None, model_type=None) -> None:
+        """
+        device: model device
+        xmem_checkpoint: checkpoint of XMem model
+        """
+        # load configurations
+        with open("tracker/config/config.yaml", 'r') as stream:
+            config = yaml.safe_load(stream)
+        # initialise XMem
+        network = XMem(config, xmem_checkpoint).to(device).eval()
+        # initialise IncerenceCore
+        self.tracker = InferenceCore(network, config)
+        # data transformation
+        self.im_transform = transforms.Compose([
+            transforms.ToTensor(),
+            im_normalization,
+        ])
+        self.device = device
+        # changable properties
+        self.mapper = MaskMapper()
+        self.initialised = False
+        # # SAM-based refinement
+        # self.sam_model = sam_model
+        # self.resizer = Resize([256, 256])
+    @torch.no_grad()
+    def resize_mask(self, mask):
+        # mask transform is applied AFTER mapper, so we need to post-process it in eval.py
+        h, w = mask.shape[-2:]
+        min_hw = min(h, w)
+        return F.interpolate(mask, (int(h/min_hw*self.size), int(w/min_hw*self.size)),
+                    mode='nearest')
+    @torch.no_grad()
+    def track(self, frame, first_frame_annotation=None):
+        """
+        Input:
+        frames: numpy arrays (H, W, 3)
+        logit: numpy array (H, W), logit
+        Output:
+        mask: numpy arrays (H, W)
+        logit: numpy arrays, probability map (H, W)
+        painted_image: numpy array (H, W, 3)
+        """
+        if first_frame_annotation is not None:   # first frame mask
+            # initialisation
+            mask, labels = self.mapper.convert_mask(first_frame_annotation)
+            mask = torch.Tensor(mask).to(self.device)
+            self.tracker.set_all_labels(list(self.mapper.remappings.values()))
+        else:
+            mask = None
+            labels = None
+        # prepare inputs
+        frame_tensor = self.im_transform(frame).to(self.device)
+        # track one frame
+        probs, _ = self.tracker.step(frame_tensor, mask, labels)   # logits 2 (bg fg) H W
+        # # refine
+        # if first_frame_annotation is None:
+        #     out_mask = self.sam_refinement(frame, logits[1], ti)
+        # convert to mask
+        out_mask = torch.argmax(probs, dim=0)
+        out_mask = (out_mask.detach().cpu().numpy()).astype(np.uint8)
+        num_objs = out_mask.max()
+        painted_image = frame
+        for obj in range(1, num_objs+1):
+            painted_image = mask_painter(painted_image, (out_mask==obj).astype('uint8'), mask_color=obj+1)
+        return out_mask, out_mask, painted_image
+    @torch.no_grad()
+    def sam_refinement(self, frame, logits, ti):
+        """
+        refine segmentation results with mask prompt
+        """
+        # convert to 1, 256, 256
+        self.sam_model.set_image(frame)
+        mode = 'mask'
+        logits = logits.unsqueeze(0)
+        logits = self.resizer(logits).cpu().numpy()
+        prompts = {'mask_input': logits}    # 1 256 256
+        masks, scores, logits = self.sam_model.predict(prompts, mode, multimask=True)  # masks (n, h, w), scores (n,), logits (n, 256, 256)
+        painted_image = mask_painter(frame, masks[np.argmax(scores)].astype('uint8'), mask_alpha=0.8)
+        painted_image = Image.fromarray(painted_image)
+        painted_image.save(f'/ssd1/gaomingqi/refine/{ti:05d}.png')
+        self.sam_model.reset_image()
+    @torch.no_grad()
+    def clear_memory(self):
+        self.tracker.clear_memory()
+        self.mapper.clear_labels()
+if __name__ == '__main__':
+    # video frames (multiple objects)
+    video_path_list = glob.glob(os.path.join('/ssd1/gaomingqi/datasets/davis/JPEGImages/480p/horsejump-high', '*.jpg'))
+    video_path_list.sort()
+    # first frame
+    first_frame_path = '/ssd1/gaomingqi/datasets/davis/Annotations/480p/horsejump-high/00000.png'
+    # load frames
+    frames = []
+    for video_path in video_path_list:
+        frames.append(np.array(Image.open(video_path).convert('RGB')))
+    frames = np.stack(frames, 0)    # N, H, W, C
+    # load first frame annotation
+    first_frame_annotation = np.array(Image.open(first_frame_path).convert('P'))    # H, W, C
+    # ----------------------------------------------------------
+    # initalise tracker
+    # ----------------------------------------------------------
+    device = 'cuda:4'
+    XMEM_checkpoint = '/ssd1/gaomingqi/checkpoints/XMem-s012.pth'
+    SAM_checkpoint= '/ssd1/gaomingqi/checkpoints/sam_vit_h_4b8939.pth'
+    model_type = 'vit_h'
+    # sam_model = BaseSegmenter(SAM_checkpoint, model_type, device=device)
+    tracker = BaseTracker(XMEM_checkpoint, device, None, device)
+    # test for storage efficiency
+    frames = np.load('/ssd1/gaomingqi/efficiency/efficiency.npy')
+    first_frame_annotation = np.array(Image.open('/ssd1/gaomingqi/efficiency/template_mask.png'))
+    for ti, frame in enumerate(frames):
+        print(ti)
+        if ti > 200:
+            break
+        if ti == 0:
+            mask, prob, painted_image = tracker.track(frame, first_frame_annotation)
+        else:
+            mask, prob, painted_image = tracker.track(frame)
+        # save
+        painted_image = Image.fromarray(painted_image)
+        painted_image.save(f'/ssd1/gaomingqi/results/TrackA/gsw/{ti:05d}.png')
+    tracker.clear_memory()
+    for ti, frame in enumerate(frames):
+        print(ti)
+        # if ti > 200:
+        #     break
+        if ti == 0:
+            mask, prob, painted_image = tracker.track(frame, first_frame_annotation)
+        else:
+            mask, prob, painted_image = tracker.track(frame)
+        # save
+        painted_image = Image.fromarray(painted_image)
+        painted_image.save(f'/ssd1/gaomingqi/results/TrackA/gsw/{ti:05d}.png')
+    # # track anything given in the first frame annotation
+    # for ti, frame in enumerate(frames):
+    #     if ti == 0:
+    #         mask, prob, painted_image = tracker.track(frame, first_frame_annotation)
+    #     else:
+    #         mask, prob, painted_image = tracker.track(frame)
+    #     # save
+    #     painted_image = Image.fromarray(painted_image)
+    #     painted_image.save(f'/ssd1/gaomingqi/results/TrackA/horsejump-high/{ti:05d}.png')
+    # # ----------------------------------------------------------
+    # # another video
+    # # ----------------------------------------------------------
+    # # video frames
+    # video_path_list = glob.glob(os.path.join('/ssd1/gaomingqi/datasets/davis/JPEGImages/480p/camel', '*.jpg'))
+    # video_path_list.sort()
+    # # first frame
+    # first_frame_path = '/ssd1/gaomingqi/datasets/davis/Annotations/480p/camel/00000.png'
+    # # load frames
+    # frames = []
+    # for video_path in video_path_list:
+    #     frames.append(np.array(Image.open(video_path).convert('RGB')))
+    # frames = np.stack(frames, 0)    # N, H, W, C
+    # # load first frame annotation
+    # first_frame_annotation = np.array(Image.open(first_frame_path).convert('P'))    # H, W, C
+    # print('first video done. clear.')
+    # tracker.clear_memory()
+    # # track anything given in the first frame annotation
+    # for ti, frame in enumerate(frames):
+    #     if ti == 0:
+    #         mask, prob, painted_image = tracker.track(frame, first_frame_annotation)
+    #     else:
+    #         mask, prob, painted_image = tracker.track(frame)
+    #     # save
+    #     painted_image = Image.fromarray(painted_image)
+    #     painted_image.save(f'/ssd1/gaomingqi/results/TrackA/camel/{ti:05d}.png')
+    # # failure case test
+    # failure_path = '/ssd1/gaomingqi/failure'
+    # frames = np.load(os.path.join(failure_path, 'video_frames.npy'))
+    # # first_frame = np.array(Image.open(os.path.join(failure_path, 'template_frame.png')).convert('RGB'))
+    # first_mask = np.array(Image.open(os.path.join(failure_path, 'template_mask.png')).convert('P'))
+    # first_mask = np.clip(first_mask, 0, 1)
+    # for ti, frame in enumerate(frames):
+    #     if ti == 0:
+    #         mask, probs, painted_image = tracker.track(frame, first_mask)
+    #     else:
+    #         mask, probs, painted_image = tracker.track(frame)
+    #     # save
+    #     painted_image = Image.fromarray(painted_image)
+    #     painted_image.save(f'/ssd1/gaomingqi/failure/LJ/{ti:05d}.png')
+    #     prob = Image.fromarray((probs[1].cpu().numpy()*255).astype('uint8'))
+    #     # prob.save(f'/ssd1/gaomingqi/failure/probs/{ti:05d}.png')