File size: 6,861 Bytes
bf079e6
 
ab9d825
bf079e6
 
 
 
 
 
 
 
 
 
 
ab9d825
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
title: Unsupervised Generative Video Dubbing
emoji: πŸŽ₯
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: gpl-3.0
short_description: enjoy
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Unsupervised Generative Video Dubbing

Author: Jimin Tan, Chenqin Yang, Yakun Wang, Yash Deshpande

Project Website: [https://tanjimin.github.io/unsupervised-video-dubbing/](https://tanjimin.github.io/unsupervised-video-dubbing/)


Training Code for the dubbing model is under the root directory. We used a pre-processed LRW for training. See `data.py` for details.


We created a simple depolyment pipeline which can be find under  `post_processing` subdirectory. The pipeline takes the model weights we pre-trained on LRW. The pipeline takes a video and a equal duration audio segments and output a dubbed video based on audio information. See the instruction below for more details. 

## Requirement

- LibROSA 0.7.2 
- dlib 19.19
- OpenCV 4.2.0

- Pillow 6.2.2
- PyTorch 1.2.0
- TorchVision 0.4.0

## Post-Procesing Folder 

```
.
β”œβ”€β”€ source                  
β”‚   β”œβ”€β”€ audio_driver_mp4    # contain audio drivers (saved in mp4 format)
β”‚   β”œβ”€β”€ audio_driver_wav    # contain audio drivers (saved in wav format)
β”‚   β”œβ”€β”€ base_video          # contain base videos (videos you'd like to modify)
β”‚   β”œβ”€β”€ dlib            		# trained dlib models
β”‚   └── model               # trained landmark generation models
β”œβ”€β”€ main.py									# main function for post processing
β”œβ”€β”€ main_support.py					# support functions used in main.py
β”œβ”€β”€ models.py								# define the landmark generation model
β”œβ”€β”€ step_3_vid2vid.sh		  	# Bash script for running vid2vid
β”œβ”€β”€ step_4_denoise.sh.      # Bash script for denoising vid2vid results
β”œβ”€β”€ compare_openness.ipynb  # mouth openness comparison across generated videos
└── README.md
```

> - shape_predictor_68_face_landmarks.dat
>
> This is trained on the ibug 300-W dataset (https://ibug.doc.ic.ac.uk/resources/facial-point-annotations/)
>
> The license for this dataset excludes commercial use and Stefanos Zafeiriou, one of the creators of the dataset, asked me to include a note here saying that the trained model therefore can't be used in a commerical product. So you should contact a lawyer or talk to Imperial College London to find out if it's OK for you to use this model in a commercial product.
>
> {C. Sagonas, E. Antonakos, G, Tzimiropoulos, S. Zafeiriou, M. Pantic. 300 faces In-the-wild challenge: Database and results. Image and Vision Computing (IMAVIS), Special Issue on Facial Landmark Localisation "In-The-Wild". 2016.}

## Detailed steps for model deployment


- **Go to** `post_processing` directory
- Run: ```python3 main.py -r step  ``` (corresponding step below)
  - e.g: `python3 main.py -r 1` will run the first step and etc 

#### Step 1 β€” generate landmarks

- Input
  - Base video file path (`./source/base_video/base_video.mp4`)
  - Audio driver file path (`./source/audio_driver_wav/audio_driver.wav`)
  - Epoch (`int`)
- Output (`./result`)
  - keypoints.npy (# generated landmarks in `npy` format)
  - source.txt (contains information about base video, audio driver, model epoch)
- Process
  - Extract facial landmarks from base video
  - Extract MFCC features from driver audio
  - Pass MFCC features and facial landmarks into the model to retrieve mouth landmarks
  - Combine facial & mouth landmarks and save in `npy` format

#### Step 2 β€” Test generated frames

- Input
  - None
- Output (`./result`)
  - Folder β€” save_keypoints: visualized generated frames
  - Folder β€” save_keypoints_csv : landmark coordinates for each frame, saved in `txt` format
  - openness.png: mouth openness measured and plotted across all frames
- Process
  - Generate images from `npy` file
  - Generate openness plot

#### Step 3 β€” Execute vid2vid

- Input
  - None
- Output
  - Path for generated fake images from vid2vid are shown at the end; Please copy it back to the `/result/vid2vid_frames/`
    - Folder: vid2vid generated images
- Process
  - Run vid2vid
  - Copy back vid2vid results to main folder

#### Step 4 β€” Denoise and smooth vid2vid results

- Input
  - vid2vid generated images folder path
  - Original base images folder path
- Output
  - Folder: Modified images (base image + vid2vid mouth regions)
  - Folder: Denoised and smoothed frames
- Process
  - Crop mouth areas from vid2vid generated images and paste them back to base images β€”> modified image
  - Generate circular smoothed images by using gradient masking 
  - Take `(modified image, circular smoothed images)` as pairs and do denoising

#### Step 5 β€” Generate modified videos with sound

- Input
  - Saved frames folder path
    - By default, it is saved in `./result/save_keypoints`; you can enter `d` to go with default path
    - Otherwise, input the frames folder path
  - Audio driver file path (`./source/audio_driver_wav/audio_driver.wav`)
- Output (`./result/save_keypoints/result/`)
  - video_without_sound.mp4: modified videos without sound
  - audio_only.mp4: audio driver
  - final_output.mp4: modified videos with sound
- Process
  - Generate the modified video without sound with define fps
  - Extract `wav` from audio driver
  - Combine audio and video to generate final output

## Important Notice

- You may need to modify how MFCC features are extracted in `extract_mfcc` function
  - Be careful about sample rate, window_length, hop_length
  - Good resource: https://www.mathworks.com/help/audio/ref/mfcc.html
- You may need to modify the region of interest (mouth area) in `frame_crop` function
- You may need to modify the frame rate defined in step_3 of the main.py, which should be your base video fps

```python
# How to check your base video fps
# source: https://www.learnopencv.com/how-to-find-frame-rate-or-frames-per-second-fps-in-opencv-python-cpp/

import cv2
video = cv2.VideoCapture("video.mp4");

# Find OpenCV version
(major_ver, minor_ver, subminor_ver) = (cv2.__version__).split('.')
if int(major_ver)  < 3 :
    fps = video.get(cv2.cv.CV_CAP_PROP_FPS)
    print("Frames per second using video.get(cv2.cv.CV_CAP_PROP_FPS): {0}".format(fps))
else :
    fps = video.get(cv2.CAP_PROP_FPS)
    print("Frames per second using video.get(cv2.CAP_PROP_FPS) : {0}".format(fps))
video.release()
```

- You may need to modify the shell path

```shell
echo $SHELL
```

- You may need to modify the audio sampling rate in `extract_audio` function
- You may need to customize your parameters in `combine_audio_video` function
  - Good resource: https://ffmpeg.org/ffmpeg.html
  - https://gist.github.com/tayvano/6e2d456a9897f55025e25035478a3a50



## Update History

- March 22, 2020: Drafted documentation