File size: 6,861 Bytes
bf079e6 ab9d825 bf079e6 ab9d825 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
---
title: Unsupervised Generative Video Dubbing
emoji: π₯
colorFrom: blue
colorTo: blue
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
license: gpl-3.0
short_description: enjoy
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# Unsupervised Generative Video Dubbing
Author: Jimin Tan, Chenqin Yang, Yakun Wang, Yash Deshpande
Project Website: [https://tanjimin.github.io/unsupervised-video-dubbing/](https://tanjimin.github.io/unsupervised-video-dubbing/)
Training Code for the dubbing model is under the root directory. We used a pre-processed LRW for training. See `data.py` for details.
We created a simple depolyment pipeline which can be find under `post_processing` subdirectory. The pipeline takes the model weights we pre-trained on LRW. The pipeline takes a video and a equal duration audio segments and output a dubbed video based on audio information. See the instruction below for more details.
## Requirement
- LibROSA 0.7.2
- dlib 19.19
- OpenCV 4.2.0
- Pillow 6.2.2
- PyTorch 1.2.0
- TorchVision 0.4.0
## Post-Procesing Folder
```
.
βββ source
β βββ audio_driver_mp4 # contain audio drivers (saved in mp4 format)
β βββ audio_driver_wav # contain audio drivers (saved in wav format)
β βββ base_video # contain base videos (videos you'd like to modify)
β βββ dlib # trained dlib models
β βββ model # trained landmark generation models
βββ main.py # main function for post processing
βββ main_support.py # support functions used in main.py
βββ models.py # define the landmark generation model
βββ step_3_vid2vid.sh # Bash script for running vid2vid
βββ step_4_denoise.sh. # Bash script for denoising vid2vid results
βββ compare_openness.ipynb # mouth openness comparison across generated videos
βββ README.md
```
> - shape_predictor_68_face_landmarks.dat
>
> This is trained on the ibug 300-W dataset (https://ibug.doc.ic.ac.uk/resources/facial-point-annotations/)
>
> The license for this dataset excludes commercial use and Stefanos Zafeiriou, one of the creators of the dataset, asked me to include a note here saying that the trained model therefore can't be used in a commerical product. So you should contact a lawyer or talk to Imperial College London to find out if it's OK for you to use this model in a commercial product.
>
> {C. Sagonas, E. Antonakos, G, Tzimiropoulos, S. Zafeiriou, M. Pantic. 300 faces In-the-wild challenge: Database and results. Image and Vision Computing (IMAVIS), Special Issue on Facial Landmark Localisation "In-The-Wild". 2016.}
## Detailed steps for model deployment
- **Go to** `post_processing` directory
- Run: ```python3 main.py -r step ``` (corresponding step below)
- e.g: `python3 main.py -r 1` will run the first step and etc
#### Step 1 β generate landmarks
- Input
- Base video file path (`./source/base_video/base_video.mp4`)
- Audio driver file path (`./source/audio_driver_wav/audio_driver.wav`)
- Epoch (`int`)
- Output (`./result`)
- keypoints.npy (# generated landmarks in `npy` format)
- source.txt (contains information about base video, audio driver, model epoch)
- Process
- Extract facial landmarks from base video
- Extract MFCC features from driver audio
- Pass MFCC features and facial landmarks into the model to retrieve mouth landmarks
- Combine facial & mouth landmarks and save in `npy` format
#### Step 2 β Test generated frames
- Input
- None
- Output (`./result`)
- Folder β save_keypoints: visualized generated frames
- Folder β save_keypoints_csv : landmark coordinates for each frame, saved in `txt` format
- openness.png: mouth openness measured and plotted across all frames
- Process
- Generate images from `npy` file
- Generate openness plot
#### Step 3 β Execute vid2vid
- Input
- None
- Output
- Path for generated fake images from vid2vid are shown at the end; Please copy it back to the `/result/vid2vid_frames/`
- Folder: vid2vid generated images
- Process
- Run vid2vid
- Copy back vid2vid results to main folder
#### Step 4 β Denoise and smooth vid2vid results
- Input
- vid2vid generated images folder path
- Original base images folder path
- Output
- Folder: Modified images (base image + vid2vid mouth regions)
- Folder: Denoised and smoothed frames
- Process
- Crop mouth areas from vid2vid generated images and paste them back to base images β> modified image
- Generate circular smoothed images by using gradient masking
- Take `(modified image, circular smoothed images)` as pairs and do denoising
#### Step 5 β Generate modified videos with sound
- Input
- Saved frames folder path
- By default, it is saved in `./result/save_keypoints`; you can enter `d` to go with default path
- Otherwise, input the frames folder path
- Audio driver file path (`./source/audio_driver_wav/audio_driver.wav`)
- Output (`./result/save_keypoints/result/`)
- video_without_sound.mp4: modified videos without sound
- audio_only.mp4: audio driver
- final_output.mp4: modified videos with sound
- Process
- Generate the modified video without sound with define fps
- Extract `wav` from audio driver
- Combine audio and video to generate final output
## Important Notice
- You may need to modify how MFCC features are extracted in `extract_mfcc` function
- Be careful about sample rate, window_length, hop_length
- Good resource: https://www.mathworks.com/help/audio/ref/mfcc.html
- You may need to modify the region of interest (mouth area) in `frame_crop` function
- You may need to modify the frame rate defined in step_3 of the main.py, which should be your base video fps
```python
# How to check your base video fps
# source: https://www.learnopencv.com/how-to-find-frame-rate-or-frames-per-second-fps-in-opencv-python-cpp/
import cv2
video = cv2.VideoCapture("video.mp4");
# Find OpenCV version
(major_ver, minor_ver, subminor_ver) = (cv2.__version__).split('.')
if int(major_ver) < 3 :
fps = video.get(cv2.cv.CV_CAP_PROP_FPS)
print("Frames per second using video.get(cv2.cv.CV_CAP_PROP_FPS): {0}".format(fps))
else :
fps = video.get(cv2.CAP_PROP_FPS)
print("Frames per second using video.get(cv2.CAP_PROP_FPS) : {0}".format(fps))
video.release()
```
- You may need to modify the shell path
```shell
echo $SHELL
```
- You may need to modify the audio sampling rate in `extract_audio` function
- You may need to customize your parameters in `combine_audio_video` function
- Good resource: https://ffmpeg.org/ffmpeg.html
- https://gist.github.com/tayvano/6e2d456a9897f55025e25035478a3a50
## Update History
- March 22, 2020: Drafted documentation
|