🔉👄 Wav2Lip STUDIO

English | 简体中文

https://user-images.githubusercontent.com/800903/262435301-af205a91-30d7-43f2-afcc-05980d581fe0.mp4

💡 Description

This repository contains a Wav2Lip Studio Standalone Version.

It's an all-in-one solution: just choose a video and a speech file (wav or mp3), and the tools will generate a lip-sync video, faceswap, voice clone, and translate video with voice clone (HeyGen like). It improves the quality of the lip-sync videos generated by the Wav2Lip tool by applying specific post-processing techniques.

📖 Quick Index

🚀 Updates
🔗 Requirements
💻 Installation
🐍 Tutorial
🐍 Usage
👄 Keyframes Manager
👄 Input Video
📺 Examples
📖 Behind the scenes
💪 Quality tips
⚠️Noted Constraints
📝 To do
😎 Contributing
🙏 Appreciation
📝 Citation
📜 License
☕ Support Wav2lip Studio

🚀 Updates

2024.02.09 Spped Up Update (Standalone version only)

👬 Clone voice: Add controls to manage the voice clone (See Usage section)
🎏 translate video: Add features to translate panel to manage translation (See Usage section)
📺 Add Trim feature: Add a feature to trim the video.
🔑 Automatic mask: Add a feature to automatically calculate the mask parameters (padding, dilate...). You can change parameters if needed.
🚀 Speed up processes : All processes are now faster, Analysis, Face Swap, Generation in High quality

2024.01.20 Major Update (Standalone version only)

♻ Manage project: Add a feature to manage multiple project
👪 Introduced multiple face swap: Can now Swap multiple face in one shot (See Usage section)
⛔ Visible face restriction: Can now make whole process even if no face detected on frame!
📺 Video Size: works with high resolution video input, (test with 1980x1080, should works with 4K but slow)
🔑 Keyframe manager: Add a keyframe manager for better control of the video generation
🍪 coqui TTS integration: Remove bark integration, use coqui TTS instead (See Usage section)
💬 Conversation: Add a conversation feature with multiple person (See Usage section)
🔈 Record your own voice: Add a feature to record your own voice (See Usage section)
👬 Clone voice: Add a feature to clone voice from video (See Usage section)
🎏 translate video: Add a feature to translate video with voice clone (See Usage section)
🔉 Volume amplifier for wav2lip: Add a feature to amplify the volume of the wav2lip output (See Usage section)
🕡 Add delay before sound speech start
🚀 Speed up process: Speed up the process

2023.09.13

👪 Introduced face swap: facefusion integration (See Usage section) this feature is under experimental.

2023.08.22

👄 Introduced bark (See Usage section), this feature is under experimental.

2023.08.20

🚢 Introduced the GFPGAN model as an option.
▶ Added the feature to resume generation.
📏 Optimized to release memory post-generation.

2023.08.17

🐛 Fixed purple lips bug

2023.08.16

⚡ Added Wav2lip and enhanced video output, with the option to download the one that's best for you, likely the "generated video".
🚢 Updated User Interface: Introduced control over CodeFormer Fidelity.
👄 Removed image as input, SadTalker is better suited for this.
🐛 Fixed a bug regarding the discrepancy between input and output video that incorrectly positioned the mask.
💪 Refined the quality process for greater efficiency.
🚫 Interruption will now generate videos if the process creates frames

2023.08.13

⚡ Speed-up computation
🚢 Change User Interface : Add controls on hidden parameters
👄 Only Track mouth if needed
📰 Control debug
🐛 Fix resize factor bug

🔗 Requirements

FFmpeg : download it from the official FFmpeg site. Follow the instructions appropriate for your operating system, note ffmpeg have to be accessible from the command line.
Make sure ffmpeg is in your PATH environment variable. If not, add it to your PATH environment variable.

pyannote.audio:You need to agree to share your contact information to access pyannote models. To do so, go to both link:
- pyannote diarization-3.1 huggingface repository
- pyannote segmentation-3.0 huggingface repository

set each field and click "Agree and access repository"

Create an access token to Huggingface:
1. Connect with your account
2. go to access tokens in settings
3. create a new token in read mode
4. copy the token
5. paste it in the file api_keys.json
```
 {
   "huggingface_token": "your token"
 }
```

💻 Installation

Install python 3.10.11
Install git

Check ffmpeg, python, cuda and git installation

python --version
git --version
ffmpeg -version
nvcc --version (only if you have a Nvidia GPU and not MacOS)

Must return something like

Python 3.10.11
git version 2.35.1.windows.2
ffmpeg version N-110509-g722ff74055-20230506 Copyright (c) 2000-2023 the FFmpeg developers built with gcc 12.2.0 (crosstool-NG 1.25.0.152_89671bf) bla bla bla...
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:41:10_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Windows Users

Install Cuda 11.8 if not ever done.
Install Visual Studio. During the install, make sure to include the Python and C++ packages in visual studio installer.
if you have multiple Python version on your computer edit launch.py and change the following line:
```
REM set PYTHON="your python.exe path"
```
```
 set PYTHON="your python.exe path"
```
double click on wav2lip-studio.bat, that will install the requirements and download the models

MACOS Users

Install python 3.9

brew update
brew install python@3.9
brew install git-lfs
git-lfs install

Install environnement and requirements

cd /YourWav2lipStudioFolder
/opt/homebrew/bin/python3.9 -m venv venv
./venv/bin/python3.9 -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2
./venv/bin/python3.9 -m pip install -r requirements.txt
./venv/bin/python3.9 -m pip install transformers==4.33.2
./venv/bin/python3.9 -m pip install numpy==1.24.4

if It doesn't works or too long on pip install -r requirements.txt

./venv/bin/python3.9 -m pip install inaSpeechSegmenter
./venv/bin/python3.9 -m pip install gradio==4.14.0 imutils==0.5.4 numpy opencv-python==4.8.0.76 scipy==1.11.2 requests==2.28.1  pillow==9.3.0  librosa==0.10.0 opencv-contrib-python==4.8.0.76 huggingface_hub==0.20.2 tqdm==4.66.1 cutlet==0.3.0 numba==0.57.1 imageio_ffmpeg==0.4.9 insightface==0.7.3 unidic==1.1.0 onnx==1.14.1 onnxruntime==1.16.0 psutil==5.9.5 lpips==0.1.4 GitPython==3.1.36 facexlib==0.3.0 gfpgan==1.3.8 gdown==4.7.1 pyannote.audio==3.1.1 TTS==0.21.2 openai-whisper==20231117 resampy==0.4.0 scenedetect==0.6.2 uvicorn==0.23.2 starlette==0.35.1 fastapi==0.109.0 fugashii
./venv/bin/python3.9 -m pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2
./venv/bin/python3.9 -m pip install transformers==4.33.2
./venv/bin/python3.9 -m pip install numpy==1.24.4

Install models

git clone https://huggingface.co/numz/wav2lip_studio-0.2 models

Launch UI
```
./venv/bin/python3.9 wav2lip_studio.py
```

Tutorial

🐍 Usage

##PARAMETERS

Enter project name and click enter.
Choose a video (avi or mp4 format). Note avi file will not appear in Video input but process will works.
Face Swap (take times so be patient):
- Face Swap: choose the image of the faces you want to swap with the face in the video (multiple faces are now available), left face is id 0.
Resolution Divide Factor: The resolution of the video will be divided by this factor. The higher the factor, the faster the process, but the lower the resolution of the output video.
Min Face Width Detection: The minimum width of the face to detect. Allow to ignore little face in the video.
Align Faces: allows for straightening the head before sending it for Wav2Lip processing.
Keyframes On Speaker Change: Allows you to generate a keyframe when the speaker changes. This allows you to better control the video generation.
Keyframes On scene Change: Allows you to generate a keyframe when the scene changes. This allows you to better control the video generation.
When parameters above are set click on Generate Keyframes, See Keyframes manager section for more details.
Audio, 3 options:
1. Put audio file in the "Speech" input. or record one with the "Record" button.
2. Generate Audio with the text to speech coqui TTS integration.
  1. Choose the language
  2. Choose the Voice
  3. Write your speech in the text area "Prompt" in text format or json format:
    1. Text format:
```
Hello, my name is John. I am 25 years old.
```
    2. Json format (you can ask chat GPT to generate discussion for you):
```
[
  {
    "start": 0.0,
    "end": 3.0,
    "text": "Hello, my name is John. I am 25 years old.",
    "speaker": "arnold"
  },
  {
    "start": 3.0,
    "end": 4.0,
    "text": "Ho really ?",
    "speaker": "female_01"
  },
  ...
]
```
3. Input Video: Allow to use audio from the input video, voices cloning and translation. see Input Video section for more details.
Video Quality:
- Low: Original Wav2Lip quality, fast but not very good.
- Medium: Better quality by apply post processing on the mouth, slower.
- High: Better quality by apply post processing and upscale the mouth quality, slower.
Wav2lip Checkpoint: Choose beetwen 2 wav2lip model:
- Wav2lip: Original Wav2Lip model, fast but not very good.
- Wav2lip GAN: Better quality by apply post processing on the mouth, slower.
Face Restoration Model: Choose beetwen 2 face restoration model:
- Code Former:
  - A value of 0 offers higher quality but may significantly alter the person's facial appearance and cause noticeable flickering between frames.
  - A value of 1 provides lower quality but maintains the person's face more consistently and reduces frame flickering.
  - Using a value below 0.5 is not advised. Adjust this setting to achieve optimal results. Starting with a value of 0.75 is recommended.
- GFPGAN: Usually better quality.
Volume Amplifier: Not amplify the volume of the output audio but allows you to amplify the volume of the audio when sending it to Wav2Lip. This allows you to better control on lips movement.

KEYFRAMES MANAGER

###Global parameters:

Only show Speaker Face: This option allows you to only focus the face of the speaker, the other faces will be hidden.
Frame Number: A slider that allows you to move between the frames of the video.
Add Keyframe: Allows you to add a keyframe at the current Frame Number.
Remove Keyframe: Allows you to remove a keyframe at the current Frame Number.
Keyframes: A list of all the keyframes.

###For each face on keyframe:

Face Id: List of all the faces in current keyframe.
translation info: If there is a translation associate to the project it will be shown here, you can see the speaker, and then it can help to select the good speaker on this keyframe.
Speaker: Checkbox to set the speaker on the current Face Id of the current keyframe.
Face Swap Id: Checkbox to set the face swap id of the current keyframe on the current Face Id.
Automatic Mask: Default True, if False, you can draw the mask manually.
Mouth Mask Dilate: This will dilate the mouth mask to cover more area around the mouth. depends on the mouth size.
Face Mask Erode: This will erode the face mask to remove some area around the face. depends on the face size.
Mask Blur: This will blur the mask to make it more smooth, try to keep it under or equal to Mouth Mask Dilate.
Padding sliders: This will add padding to the head to avoid cuting the head in the video.

When you configure a keyframes, it's influence goes until next keyframe so intermediate frames will be generated with the same configuration. Note that this configuration can't be seen in UI for intermediate frames.

Input Video

If no sound in translated audio, will take the audio from the input video. Can be useful if you have a bad lipsync on the input video.

###Clone Voices:

Number Of Speakers: The number of speakers in the video. Help clone to know how many voices to clone.
Remove Background Sound Before Clone: Remove noise/music from the background sound before clone.
Clone Voices: Clone voices from the input video.
Voices: List of the cloned voices. You can rename voice to identify them in translation. For each voices you can :
- Play: Listen the voice.
- regen sentence: Regenerate the sentence sample.
- save voice: Save the voice to your voices library.
Voices Files: List of voices files used by models to create the cloned voices. You can modify the voices files to change the cloned voices. Make sure to have only one voice per file, no background sound and no music. You can listen the voices files by clicking on the play button. and change the speaker name to identify the voice.

###Translation: Translation panel is now linked to the cloned voices panel because translation will try to identify the speaker to translate the voice.

Language: Target language to translate the input video.
Whisper Model: List of the whisper models to use for the translation, choose beetwen 5 models, the higher the model the better the quality but the slower the process.
Translate: Translate the input video to the selected language.
Translation: The translated text.
Translated Audio: The translated audio.
Convert To Audio: Convert the translated text to translated audio.

For each segment of the translated text, you can :

Modify the translated text
Modify the time start and end of the segment.
Change the speaker of the segment.
listen to the original audio by click on the play button.
listen to the translated audio by click on the red ideogram button.
Generate the translation for this segment by click on the recycle button.
Delete the segment by click on the trash button.
Add a new segment under this one by click on the arrow down button.

📺 Examples

https://user-images.githubusercontent.com/800903/262439441-bb9d888a-d33e-4246-9f0a-1ddeac062d35.mp4

https://user-images.githubusercontent.com/800903/262442794-61b1e32f-3f87-4b36-98d6-f711822bdb1e.mp4

https://user-images.githubusercontent.com/800903/262449305-901086a3-22cb-42d2-b5be-a5f38db4549a.mp4

https://user-images.githubusercontent.com/800903/267808494-300f8cc3-9136-4810-86e2-92f2114a5f9a.mp4

📖 Behind the scenes

This extension operates in several stages to improve the quality of Wav2Lip-generated videos:

Generate face swap video: The script first generates the face swap video if image is in "face Swap" field, this operation take times so be patient.
Generate a Wav2lip video: Then script generates a low-quality Wav2Lip video using the input video and audio.
Video Quality Enhancement: Create a high-quality video using the low-quality video by using the enhancer define by user.
Mask Creation: The script creates a mask around the mouth and tries to keep other facial motions like those of the cheeks and chin.
Video Generation: The script then takes the high-quality mouth image and overlays it onto the original image guided by the mouth mask.

💪 Quality tips

Use a high quality video as input
Use a video with a consistent frame rate. Occasionally, videos may exhibit unusual playback frame rates (not the standard 24, 25, 30, 60), which can lead to issues with the face mask.
Use a high quality audio file as input, without background noise or music. Clean audio with a tool like https://podcast.adobe.com/enhance.
Dilate the mouth mask. This will help the model retain some facial motion and hide the original mouth.
Mask Blur maximum twice the value of Mouth Mask Dilate. If you want to increase the blur, increase the value of Mouth Mask Dilate otherwise the mouth will be blurred and the underlying mouth could be visible.
Upscaling can be good for improving result, particularly around the mouth area. However, it will extend the processing duration. Use this tutorial from Olivio Sarikas to upscale your video: https://www.youtube.com/watch?v=3z4MKUqFEUk. Ensure the denoising strength is set between 0.0 and 0.05, select the 'revAnimated' model, and use the batch mode. i'll create a tutorial for this soon.

⚠ Noted Constraints

for speed up process try to keep resolution under 1000x1000px and upscaling after process.
If the initial phase is excessively lengthy, consider using the "resize factor" to decrease the video's dimensions.
While there's no strict size limit for videos, larger videos will require more processing time. It's advisable to employ the "resize factor" to minimize the video size and then upscale the video once processing is complete.

know issues:

If you have issues to install insightface, follow this step:

Download insightface precompiled and paste it in the root folder of Wav2lip-studio
in terminal go to wav2lip-studio folder and type the following commands:

.\venv\Scripts\activate
python -m pip install -U pip
python -m pip install insightface-0.7.3-cp310-cp310-win_amd64.whl

Enjoy

📝 To do

✔️ Standalone version
✔️ Add a way to use a face swap image
✔️ Add Possibility to use a video for audio input
✔️ Convert avi to mp4. Avi is not show in video input but process work fine
ComfyUI intergration

😎 Contributing

We welcome contributions to this project. When submitting pull requests, please provide a detailed description of the changes. see CONTRIBUTING for more information.

🙏 Appreciation

☕ Support Wav2lip Studio

this project is open-source effort that is free to use and modify. I rely on the support of users to keep this project going and help improve it. If you'd like to support me, you can make a donation on my Patreon page. Any contribution, large or small, is greatly appreciated!

Your support helps me cover the costs of development and maintenance, and allows me to allocate more time and resources to enhancing this project. Thank you for your support!

patreon page

📝 Citation

If you use this project in your own work, in articles, tutorials, or presentations, we encourage you to cite this project to acknowledge the efforts put into it.

To cite this project, please use the following BibTeX format:

@misc{wav2lip_uhq,
  author = {numz},
  title = {Wav2Lip UHQ},
  year = {2023},
  howpublished = {GitHub repository},
  publisher = {numz},
  url = {https://github.com/numz/sd-wav2lip-uhq}
}

📜 License

The code in this repository is released under the MIT license as found in the LICENSE file.