--- license: apache-2.0 language: - en library_name: transformers --- # FastSpeech2ConformerWithHifiGan This model combines [FastSpeech2Conformer](https://huggingface.co/espnet/fastspeech2_conformer) and [FastSpeech2ConformerHifiGan](https://huggingface.co/espnet/fastspeech2_conformer_hifigan) into one model for a simpler and more convenient usage. FastSpeech2Conformer is a non-autoregressive text-to-speech (TTS) model that combines the strengths of FastSpeech2 and the conformer architecture to generate high-quality speech from text quickly and efficiently, and the HiFi-GAN vocoder is used to turn generated mel-spectrograms into speech waveforms. ## 🤗 Transformers Usage You can run FastSpeech2Conformer locally with the 🤗 Transformers library. 1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers) and g2p-en: ``` pip install --upgrade pip pip install --upgrade transformers g2p-en ``` 2. Run inference via the Transformers modelling code with the model and hifigan combined ```python from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerWithHifiGan import soundfile as sf tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer") inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt") input_ids = inputs["input_ids"] model = FastSpeech2ConformerWithHifiGan.from_pretrained("espnet/fastspeech2_conformer_with_hifigan") output_dict = model(input_ids, return_dict=True) waveform = output_dict["waveform"] sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050) ```