File size: 6,112 Bytes
c90b9a8
 
 
 
 
d832176
 
 
 
 
 
 
 
 
c90b9a8
d832176
 
 
 
 
 
 
c90b9a8
 
 
 
d832176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c90b9a8
d832176
 
 
c90b9a8
d832176
c90b9a8
d832176
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c90b9a8
d832176
c90b9a8
 
 
d832176
c90b9a8
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# MVInpainter
[NeurIPS 2024] MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

[[arXiv]](https://arxiv.org/pdf/2408.08000) [[Project Page]](https://ewrfcas.github.io/MVInpainter/)

## Preparation

### Setup repository and environment
```
git clone https://github.com/ewrfcas/MVInpainter.git
cd MVInpainter

conda create -n mvinpainter python=3.8
conda activate mvinpainter

pip install -r requirements.txt
mim install mmcv-full
pip install mmflow

# We need to replace the new decoder py of mmflow for faster flow estimation
cp ./check_points/mmflow/raft_decoder.py /usr/local/conda/envs/mvinpainter/lib/python3.8/site-packages/mmflow/models/decoders/
```

### Dataset preparation (training)
1. Downloading [Co3dv2](https://github.com/facebookresearch/co3d), [MVImgNet](https://github.com/GAP-LAB-CUHK-SZ/MVImgNet) for MVInpainter-O.
Downloading [Real10k](https://google.github.io/realestate10k/download.html), [DL3DV](https://github.com/DL3DV-10K/Dataset), [Scannet++](https://kaldir.vc.in.tum.de/scannetpp) for MVInpainter-F.
2. Downloading information of indices, masking formats, and captions from [Link](). Put them to `./data`. Note that we remove some dirty samples from aforementioned datasets. Since Co3dv2 data contains object masks but MVImgNet does not, we additionally provide complete [foreground masks]() for MVImgNet through `CarveKit`. Please put the MVImgNet masks to `./data/mvimagenet/masks`.

### Pretrained weights
1. [RAFT weights]() (put it to `./check_points/mmflow/`).
2. [SD1.5-inpainting]()  (put it to `./check_points/`).
3. [AnimateDiff weights](). We revise the key name for easier `peft` usages (put it to `./check_points/`).

## Training

Training with fixed nframe=12:
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --mixed_precision="fp16" --num_processes=8 --num_machines 1 --main_process_port 29502 \
  --config_file configs/deepspeed/acc_zero2.yaml train.py \
  --config_file="configs/mvinpainter_{o,f}.yaml" \
  --output_dir="check_points/mvinpainter_{o,f}_256" \
  --train_log_interval=250 \
  --val_interval=2000 \
  --val_cfg=7.5 \
  --img_size=256
```

Finetuning with dynamic frames (8~24):
```
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch --mixed_precision="fp16" --num_processes=8 --num_machines 1 --main_process_port 29502 \
  --config_file configs/deepspeed/acc_zero2.yaml train.py \
  --config_file="configs/mvinpainter_{o,f}.yaml" \
  --output_dir="check_points/mvinpainter_{o,f}_256" \
  --train_log_interval=250 \
  --val_interval=2000 \
  --val_cfg=7.5 \
  --img_size=256 \
  --resume_from_checkpoint="latest" \
  --dynamic_nframe \
  --low_nframe 8 \
  --high_nframe 24
```
Please use `mvinpainter_{o,f}_512.yaml` to train 512x512 models.

## Inference

### Model weights
1. [MVSInpainter-O]() (Novel view synthesis, put it to `./check_points/`).
2. [MVSInpainter-F]() (Removal, put it to `./check_points/`).

### Pipeline

1. Removing or synthesis foreground of the first view through 2D-inpainting. We recommend using [Fooocus-inpainting](https://github.com/lllyasviel/Fooocus) to accomplish this. Getting tracking masks through [Track-Anything](https://github.com/gaomingqi/Track-Anything). 
Some examples are provided in `./demo`.
```
- <folder>
    - images      # input images with foregrounds
    - inpainted   # inpainted result of the first view
    - masks       # masks for images
```
2. (Optional) removing foregrounds from all other views through `MVInpainter-F`:
```
CUDA_VISIBLE_DEVICES=0 python test_removal.py \
  --load_path="check_points/mvinpainter_f_256" \
  --dataset_root="./demo/removal" \
  --output_path="demo_removal" \
  --resume_from_checkpoint="best" \
  --val_cfg=5.0 \
  --img_size=256 \
  --sampling_interval=1.0 \
  --dataset_names realworld \
  --reference_path="inpainted" \
  --nframe=24 \
  --save_images # (whether to save samples respectively)
```
![removal](assets/kitchen_DSCF0676_removal_seq_0.jpg)

3. Achieving 3d bbox of the object generated from 2D-inpainting through `python draw_bbox.py`. Put the image `000x.png` and `000x.json` from `./bbox` to `obj_bbox` of the target folder.

![draw_bbox_demo](assets/draw_bbox.gif)

4. Mask adaption to achieve `warp_masks`. If the basic plane where the foreground placed on enjoys a small percentage of the whole image, please use methods like [Grounded-Segment-Anything](https://github.com/IDEA-Research/Grounded-Segment-Anything) to get `plane_masks`.
```
CUDA_VISIBLE_DEVICES=0 python mask_adaption.py --input_path="demo/nvs/kitchen" --edited_index=0
``` 
You can also use `--no_irregular_mask` to disable irregular mask for more precise warped masks.

![warp_bbox](assets/0000_bbox.jpg)

Make sure the final folder looks like:
```
- <folder>
    - obj_bbox    # inpainted 2d images with new foreground and bbox json
    - removal     # images without foregrounds
    - warp_masks  # masks from adaption for the removal folder
    - plane_masks # (optional, only for mask_adaption) masks of basic plane where the foreground is placed on
```

5. Run `MVInpainter-O` for novel view synthesis:
```
CUDA_VISIBLE_DEVICES=0 python test_nvs.py \
  --load_path="check_points/mvinpainter_o_256" \
  --dataset_root="./demo/nvs" \
  --output_path="demo_nvs" \
  --edited_index=0 \
  --resume_from_checkpoint="best" \
  --val_cfg=7.5 \
  --img_height=256 \
  --img_width=256 \
  --sampling_interval=1.0 \
  --nframe=24 \
  --prompt="a red apple with circle and round shape on the table." \
  --limit_frame=24
```
![nvs_result](assets/kitchen_0000_seq_0.jpg)

6. 3D reconstruction: See [Dust3R](https://github.com/naver/dust3r), [MVSFormer++](https://github.com/maybeLx/MVSFormerPlusPlus), and [3DGS](https://github.com/graphdeco-inria/gaussian-splatting) for more details.


## Cite
If you found our project helpful, please consider citing:

```
@article{cao2024mvinpainter,
  title={MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing},
  author={Cao, Chenjie and Yu, Chaohui and Fu, Yanwei and Wang, Fan and Xue, Xiangyang},
  journal={arXiv preprint arXiv:2408.08000},
  year={2024}
}
```