FastSpeech2ConformerWithHifiGan

This model combines FastSpeech2Conformer and FastSpeech2ConformerHifiGan into one model for a simpler and more convenient usage.

FastSpeech2Conformer is a non-autoregressive text-to-speech (TTS) model that combines the strengths of FastSpeech2 and the conformer architecture to generate high-quality speech from text quickly and efficiently, and the HiFi-GAN vocoder is used to turn generated mel-spectrograms into speech waveforms.

🤗 Transformers Usage

You can run FastSpeech2Conformer locally with the 🤗 Transformers library.

First install the 🤗 Transformers library and g2p-en:

pip install --upgrade pip
pip install --upgrade transformers g2p-en

Run inference via the Transformers modelling code with the model and hifigan combined


from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerWithHifiGan
import soundfile as sf

tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]

model = FastSpeech2ConformerWithHifiGan.from_pretrained("espnet/fastspeech2_conformer_with_hifigan")
output_dict = model(input_ids, return_dict=True)
waveform = output_dict["waveform"]

sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)