File size: 18,254 Bytes
cf2b5aa
0ec4411
5db9373
0ec4411
5db9373
0ec4411
 
 
 
 
 
 
 
 
 
 
 
 
 
5db9373
0ec4411
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32cb79e
17508e4
53705ce
 
 
5a495df
53705ce
 
 
2f442f0
5a495df
 
0ec4411
 
 
2f442f0
0ec4411
 
 
 
 
2f442f0
 
 
 
 
0ec4411
5a495df
0ec4411
 
 
2f442f0
0ec4411
 
17508e4
0ec4411
32cb79e
17508e4
32cb79e
ea6dd85
53f0e80
32cb79e
 
 
 
53f0e80
32cb79e
ea6dd85
53f0e80
32cb79e
 
 
 
 
 
53f0e80
ea6dd85
32cb79e
ea6dd85
53f0e80
32cb79e
 
 
 
 
53f0e80
ea6dd85
32cb79e
53f0e80
32cb79e
53f0e80
32cb79e
53f0e80
32cb79e
53f0e80
32cb79e
17508e4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ec4411
 
5db9373
0ec4411
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5db9373
0ec4411
5db9373
0ec4411
5db9373
0ec4411
5db9373
0ec4411
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
# πŸ”‰πŸ‘„ Wav2Lip STUDIO Standalone

<img src="demo/demo.gif" width="100%">

demo/demo1.mp4

## πŸ’‘ Description
This repository contains a Wav2Lip Studio Standalone Version. 

It's an all-in-one solution: just choose a video and a speech file (wav or mp3), and the tools will generate a lip-sync video, faceswap, voice clone, and translate video with voice clone (HeyGen like). 
It improves the quality of the lip-sync videos generated by the [Wav2Lip tool](https://github.com/Rudrabha/Wav2Lip) by applying specific post-processing techniques.

![Illustration](demo/demo.png)
![Illustration](demo/demo1.png)

## πŸ“– Quick Index
* [πŸš€ Updates](#-updates)
* [πŸ”— Requirements](#-requirements)
* [πŸ’» Installation](#-installation)
* [🐍 Tutorial](#-tutorial)
* [🐍 Usage](#-usage)
* [πŸ‘„ Keyframes Manager](#-keyframes-manager)
* [πŸ‘„ Input Video](#-input-video)
* [πŸ“Ί Examples](#-examples)
* [πŸ“– Behind the scenes](#-behind-the-scenes)
* [πŸ’ͺ Quality tips](#-quality-tips)
* [⚠️Noted Constraints](#-noted-constraints)
* [πŸ“ To do](#-to-do)
* [😎 Contributing](#-contributing)
* [πŸ™ Appreciation](#-appreciation)
* [πŸ“ Citation](#-citation)
* [πŸ“œ License](#-license)
* [β˜• Support Wav2lip Studio](#-support-wav2lip-studio)

## πŸš€ Updates
**2024.01.20 Major Update (Standalone version only)**
- β™» Manage project: Add a feature to manage multiple project
- πŸ‘ͺ Introduced multiple face swap: Can now Swap multiple face in one shot (See Usage section)
- β›” Visible face restriction: Can now make whole process even if no face detected on frame!
- πŸ“Ί Video Size: works with high resolution video input, (test with 1980x1080, should works with 4K but slow)
- πŸ”‘ Keyframe manager: Add a keyframe manager for better control of the video generation
- πŸͺ coqui TTS integration: Remove bark integration, use coqui TTS instead (See Usage section)
- πŸ’¬ Conversation: Add a conversation feature with multiple person (See Usage section)
- πŸ”ˆ Record your own voice: Add a feature to record your own voice (See Usage section)
- πŸ‘¬ Clone voice: Add a feature to clone voice from video (See Usage section)
- 🎏 translate video: Add a feature to translate video with voice clone (See Usage section)
- πŸ”‰ Volume amplifier for wav2lip: Add a feature to amplify the volume of the wav2lip output (See Usage section)
- πŸ•‘ Add delay before sound speech start
- πŸš€ Speed up process: Speed up the process

**2023.09.13**
- πŸ‘ͺ Introduced face swap: facefusion integration (See Usage section) **this feature is under experimental**.

**2023.08.22**
- πŸ‘„ Introduced [bark](https://github.com/suno-ai/bark/) (See Usage section), **this feature is under experimental**.

**2023.08.20**
- 🚒 Introduced the GFPGAN model as an option.
- β–Ά Added the feature to resume generation.
- πŸ“ Optimized to release memory post-generation.

**2023.08.17**
- πŸ› Fixed purple lips bug 

**2023.08.16**
- ⚑ Added Wav2lip and enhanced video output, with the option to download the one that's best for you, likely the "generated video".
- 🚒 Updated User Interface: Introduced control over CodeFormer Fidelity.
- πŸ‘„ Removed image as input, [SadTalker](https://github.com/OpenTalker/SadTalker) is better suited for this.
- πŸ› Fixed a bug regarding the discrepancy between input and output video that incorrectly positioned the mask.
- πŸ’ͺ Refined the quality process for greater efficiency.
- 🚫 Interruption will now generate videos if the process creates frames

**2023.08.13**
- ⚑ Speed-up computation 
- 🚒 Change User Interface : Add controls on hidden parameters
- πŸ‘„ Only Track mouth if needed
- πŸ“° Control debug
- πŸ› Fix resize factor bug

## πŸ”— Requirements

- FFmpeg : download it from the [official FFmpeg site](https://ffmpeg.org/download.html). Follow the instructions appropriate for your operating system, note ffmpeg have to be accessible from the command line.

## πŸ’» Installation

# Windows Users

1.Install [Visual Studio](https://visualstudio.microsoft.com/fr/downloads/). During the install, make sure to include the Python and C++ packages in visual studio installer.
   ![Illustration](demo/visual_studio_1.png)
   ![Illustration](demo/visual_studio_2.png)
   
2. Install [python 3.10.11](https://www.python.org/downloads/release/python-31011/)
3. Install [git](https://git-scm.com/downloads)
4. Install [Cuda 11.8](https://developer.nvidia.com/cuda-11-8-0-download-archive) if not ever done.
  ![Illustration](demo/cuda.png)

6. Check python and git installation
    ```bash
    python --version
    git --version
    nvcc --version
    ```
   Must return something like
    ```bash
    Python 3.10.11
    git version 2.35.1.windows.2
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2022 NVIDIA Corporation
    Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
    Cuda compilation tools, release 11.8, V11.8.89
    Build cuda_11.8.r11.8/compiler.31833905_0
    ```
7. if you have multiple Python version on your computer edit wav2lip-studio.bat and change the following line:
    ```bash
    REM set PYTHON="your python.exe path"
    ```
    ```bash
    set PYTHON="your python.exe path"
    ```
8. double click on wav2lip-studio.bat, that will install the requirements and download models

# MACOS Users

1. Install python 3.9
   
   ```
   brew update
   brew install python@3.9
   brew install git-lfs
   git-lfs install
   ```
3. Install environnement and requirements

   ```
   cd /YourWav2lipStudioFolder
   /opt/homebrew/bin/python3.9 -m venv venv
   ./venv/bin/python3.9 -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2
   ./venv/bin/python3.9 -m pip install -r requirements.txt
   ./venv/bin/python3.9 -m pip install transformers==4.33.2
   ./venv/bin/python3.9 -m pip install numpy==1.24.4
   ```

   if It doesn't works or too long on pip install -r requirements.txt

   ```
   ./venv/bin/python3.9 -m pip install inaSpeechSegmenter
   ./venv/bin/python3.9 -m pip install gradio==4.14.0 imutils==0.5.4 numpy opencv-python==4.8.0.76 scipy==1.11.2 requests==2.28.1  pillow==9.3.0  librosa==0.10.0 opencv-contrib-python==4.8.0.76 huggingface_hub==0.20.2 tqdm==4.66.1 cutlet==0.3.0 numba==0.57.1 imageio_ffmpeg==0.4.9 insightface==0.7.3 unidic==1.1.0 onnx==1.14.1 onnxruntime==1.16.0 psutil==5.9.5 lpips==0.1.4 GitPython==3.1.36 facexlib==0.3.0 gfpgan==1.3.8 gdown==4.7.1 pyannote.audio==3.1.1 TTS==0.21.2 openai-whisper==20231117 resampy==0.4.0 scenedetect==0.6.2 uvicorn==0.23.2 starlette==0.35.1 fastapi==0.109.0 fugashii
   ./venv/bin/python3.9 -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2
   ./venv/bin/python3.9 -m pip install transformers==4.33.2
   ./venv/bin/python3.9 -m pip install numpy==1.24.4
   ```
   
3. Install models
   ```
   git clone https://huggingface.co/numz/wav2lip_studio models
   ```
5. Launch UI
   ```
   ./venv/bin/python3.9 wav2lip_studio.py
   ```

# All Users

1. pyannote.audio:You need to agree to share your contact information to access pyannote models. 
To do so, go to both link:
    - [pyannote diarization-3.1 huggingface repository](https://huggingface.co/pyannote/speaker-diarization-3.1)
    - [pyannote segmentation-3.0 huggingface repository](https://huggingface.co/pyannote/segmentation-3.0)

set each field and click "Agree and access repository"
    ![Illustration](demo/hf_aggrement.png)
    
2. Create an access token to Huggingface:
    1. Connect with your account
    2. go to [access tokens](https://huggingface.co/settings/token) in settings
    3. create a new token in read mode
    4. copy the token
    5. paste it in the file api_keys.json
    ```json
     {
       "huggingface_token": "your token"
     }
     ```

## Tutorial
- [FR version](https://youtu.be/43Q8YASkcUA)
- [EN Version](https://youtu.be/B84A5alpPDc)

## 🐍 Usage
##PARAMETERS
1. Enter project name and click enter.
2. Choose a video (avi or mp4 format). Note avi file will not appear in Video input but process will works.
3. Face Swap (take times so be patient):
   - **Face Swap**: choose the image of the faces you want to swap with the face in the video (multiple faces are now available), left face is id 0.
4. **Resolution Divide Factor**: The resolution of the video will be divided by this factor. The higher the factor, the faster the process, but the lower the resolution of the output video.
5. **Min Face Width Detection**: The minimum width of the face to detect. Allow to ignore little face in the video.
6. **Align Faces**: allows for straightening the head before sending it for Wav2Lip processing.
7. **Keyframes On Speaker Change**: Allows you to generate a keyframe when the speaker changes. This allows you to better control the video generation.
8. **Keyframes On scene Change**: Allows you to generate a keyframe when the scene changes. This allows you to better control the video generation.
9. When parameters above are set click on **Generate Keyframes**, See [Keyframes manager](#keyframes-manager) section for more details.
10. Audio, 3 options:
      1. Put audio file in the "Speech" input. or record one with the "Record" button. 
      2. Generate Audio with the text to speech [coqui TTS](https://github.com/coqui-ai/TTS) integration.
         1. Choose the language
         2. Choose the Voice
         3. Write your speech in the text area "Prompt" in text format or json format:
            1. Text format: 
               ```bash
               Hello, my name is John. I am 25 years old.
               ```
            2. Json format (you can ask chat GPT to generate discussion for you):
                ```bash
                [
                  {
                    "start": 0.0,
                    "end": 3.0,
                    "text": "Hello, my name is John. I am 25 years old.",
                    "speaker": "arnold"
                  },
                  {
                    "start": 3.0,
                    "end": 4.0,
                    "text": "Ho really ?",
                    "speaker": "female_01"
                  },
                  ...
                ]
                ```
         4. Input Video: Allow to use audio from the input video, voices cloning and translation. see [Input Video](#input-video) section for more details.
11. **Video Quality**: 
    - **Low**: Original Wav2Lip quality, fast but not very good.
    - **Medium**: Better quality by apply post processing on the mouth, slower.
    - **High**: Better quality by apply post processing and upscale the mouth quality, slower.
12. **Wav2lip Checkpoint**: Choose beetwen 2 wav2lip model:
    - **Wav2lip**: Original Wav2Lip model, fast but not very good.
    - **Wav2lip GAN**: Better quality by apply post processing on the mouth, slower.
13. **Face Restoration Model**: Choose beetwen 2 face restoration model:
    - **Code Former**:
      - A value of 0 offers higher quality but may significantly alter the person's facial appearance and cause noticeable flickering between frames.
      - A value of 1 provides lower quality but maintains the person's face more consistently and reduces frame flickering.
      - Using a value below 0.5 is not advised. Adjust this setting to achieve optimal results. Starting with a value of 0.75 is recommended.
    - **GFPGAN**: Usually better quality.
14. **Volume Amplifier**: Not amplify the volume of the output audio but allows you to amplify the volume of the audio when sending it to Wav2Lip. This allows you to better control on lips movement.

## KEYFRAMES MANAGER
![Illustration](demo/keyframes-manager.png)

Global parameters:
1. **Only Track The Mouth**: This option tracks only the mouth, removing other facial motions like those of the cheeks and chin.
2. **Only show Speaker Face**: This option allows you to only focus the face of the speaker, the other faces will be hidden.
3. **Frame Number**: A slider that allows you to move between the frames of the video.
4. **Add Keyframe**: Allows you to add a keyframe at the current Frame Number.
5. **Remove Keyframe**: Allows you to remove a keyframe at the current Frame Number.
6. **Keyframes**: A list of all the keyframes.

For each face on keyframe:
1. **Face Id**: List of all the faces in current keyframe.
2. **Speaker**: Checkbox to set the speaker on the current Face Id of the current keyframe.
3. **Face Swap Id**: Checkbox to set the face swap id of the current keyframe on the current Face Id.
4. **Mouth Mask Dilate**: This will dilate the mouth mask to cover more area around the mouth. depends on the mouth size.
5. **Face Mask Erode**: This will erode the face mask to remove some area around the face. depends on the face size.
6. **Mask Blur**: This will blur the mask to make it more smooth, try to keep it under or equal to **Mouth Mask Dilate**.
7. **Padding sliders**: This will add padding to the head to avoid cuting the head in the video.

## Input Video
![Illustration](demo/input-video.png)

If no sound in translated audio, will take the audio from the input video. Can be useful if you have a bad lipsync on the input video.

Clone Voices:
1. **Number Of Speakers**: The number of speakers in the video. Help clone to know how many voices to clone.
2. **Remove Background Sounf Before Clone**: Remove noise/music from the background sound before clone.
3. **Clone Voices**: Clone voices from the input video.
4. **Voices**: List of the cloned voices.

Translation:
1. **Language**: Target language to translate the input video.
2. **Whisper Model**: List of the whisper models to use for the translation, choose beetwen 5 models, the higher the model the better the quality but the slower the process.
3. **Translate**: Translate the input video to the selected language.
4. **Translation**: The translated text.
5. **Translated Audio**: The translated audio.
6. **Convert To Audio**: Convert the translated text to translated audio.

## πŸ“Ί Examples

demo/demo2.mp4

demo/demo3.mp4

demo/demo4.mp4

demo/demo5.mp4

## πŸ“– Behind the scenes

This extension operates in several stages to improve the quality of Wav2Lip-generated videos:

1. **Generate face swap video**: The script first generates the face swap video if image is in "face Swap" field, this operation take times so be patient.
2. **Generate a Wav2lip video**: Then script generates a low-quality Wav2Lip video using the input video and audio.
3. **Video Quality Enhancement**: Create a high-quality video using the low-quality video by using the enhancer define by user. 
4. **Mask Creation**: The script creates a mask around the mouth and tries to keep other facial motions like those of the cheeks and chin.
5. **Video Generation**: The script then takes the high-quality mouth image and overlays it onto the original image guided by the mouth mask.

## πŸ’ͺ Quality tips
- Use a high quality video as input
- Use a video with a consistent frame rate. Occasionally, videos may exhibit unusual playback frame rates (not the standard 24, 25, 30, 60), which can lead to issues with the face mask.
- Use a high quality audio file as input, without background noise or music. Clean audio with a tool like [https://podcast.adobe.com/enhance](https://podcast.adobe.com/enhance).
- Dilate the mouth mask. This will help the model retain some facial motion and hide the original mouth.
- Mask Blur maximum twice the value of Mouth Mask Dilate. If you want to increase the blur, increase the value of Mouth Mask Dilate otherwise the mouth will be blurred and the underlying mouth could be visible.
- Upscaling can be good for improving result, particularly around the mouth area. However, it will extend the processing duration. Use this tutorial from Olivio Sarikas to upscale your video: [https://www.youtube.com/watch?v=3z4MKUqFEUk](https://www.youtube.com/watch?v=3z4MKUqFEUk). Ensure the denoising strength is set between 0.0 and 0.05, select the 'revAnimated' model, and use the batch mode. i'll create a tutorial for this soon.

## ⚠ Noted Constraints
- for speed up process try to keep resolution under 1000x1000px and upscaling after process.
- If the initial phase is excessively lengthy, consider using the "resize factor" to decrease the video's dimensions.
- While there's no strict size limit for videos, larger videos will require more processing time. It's advisable to employ the "resize factor" to minimize the video size and then upscale the video once processing is complete.

## πŸ“ To do
- βœ”οΈ Standalone version
- βœ”οΈ Add a way to use a face swap image
- βœ”οΈ Add Possibility to use a video for audio input
- βœ”οΈ Convert avi to mp4. Avi is not show in video input but process work fine
- [ ] ComfyUI intergration

## 😎 Contributing

We welcome contributions to this project. When submitting pull requests, please provide a detailed description of the changes. see [CONTRIBUTING](CONTRIBUTING.md) for more information.

## πŸ™ Appreciation 
- [Wav2Lip](https://github.com/Rudrabha/Wav2Lip)
- [CodeFormer](https://github.com/sczhou/CodeFormer)
- [Coqui TTS](https://github.com/coqui-ai/TTS)
- [facefusion](https://github.com/facefusion/facefusion)
- [Vocal Remover](https://github.com/tsurumeso/vocal-remover)

## β˜• Support Wav2lip Studio

this project is open-source effort that is free to use and modify. I rely on the support of users to keep this project going and help improve it. If you'd like to support me, you can make a donation on my Patreon page. Any contribution, large or small, is greatly appreciated!

Your support helps me cover the costs of development and maintenance, and allows me to allocate more time and resources to enhancing this project. Thank you for your support!

[patreon page](https://www.patreon.com/Wav2LipStudio)

## πŸ“ Citation
If you use this project in your own work, in articles, tutorials, or presentations, we encourage you to cite this project to acknowledge the efforts put into it.

To cite this project, please use the following BibTeX format:

```
@misc{wav2lip_uhq,
  author = {numz},
  title = {Wav2Lip UHQ},
  year = {2023},
  howpublished = {GitHub repository},
  publisher = {numz},
  url = {https://github.com/numz/sd-wav2lip-uhq}
}
``` 

## πŸ“œ License
* The code in this repository is released under the MIT license as found in the [LICENSE file](LICENSE).