Few-Shot Voice Cloning

This repository is an implementation of the pipeline for few-short voice cloning based on SpeechT5 architecture introduced in SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing. It is able to clone a voice from 15-30 seconds of audio recording in English (another languages are planned).

Getting Started

Clone repository

git clone https://github.com/konverner/deep-voice-cloning.git

Install the modules

pip install .

Run traning specifying arguments using config file training_config.json or the console command, for example

python scripts/train.py --audio_path scripts/input/hank.mp3 --output_dir /content/deep-voice-cloning/models

Resulting model will be saved in output_dir directory. It will be used in the next step.

Run inference specifying arguments using config file inference_config.json or the console command, for example

python scripts/cloning_inference.py --model_path "/content/deep-voice-cloning/models/microsoft_speecht5_tts_hank"\
--input_text 'do the things, not because they are easy, but because they are hard'\
--output_path "scripts/output/do_the_things.wav"

Resulting audio file will be saved as output_path file.

Docker

To build docker image:

docker build -t deep-voice-cloning .

To pull docker image from Hub:

docker pull konverner/deep-voice-cloning:latest

To run image in a container:

docker run -it --entrypoint=/bin/bash konverner/deep-voice-cloning

To run training in a container for example:

python scripts/train.py --audio_path scripts/input/hank.mp3 --output_dir models

To run inference in a container for example:

python scripts/cloning_inference.py --model_path models/microsoft_speecht5_tts_hank --input_text "do the things, not because they are easy, but because they are hard" --output_path scripts/output/do_the_things.wav

Notebook Examples

Example of using CLI for training and inference can be found in notebook