Using ESPnet at Hugging Face

espnet is an end-to-end toolkit for speech processing, including automatic speech recognition, text to speech, speech enhancement, dirarization and other tasks.

Exploring ESPnet in the Hub

You can find hundreds of espnet models by filtering at the left of the models page.

All models on the Hub come up with useful features:

An automatically generated model card with a description, a training configuration, licenses and more.
Metadata tags that help for discoverability and contain information such as license, language and datasets.
An interactive widget you can use to play out with the model directly in the browser.
An Inference API that allows to make inference requests.

Using existing models

For a full guide on loading pre-trained models, we recommend checking out the official guide).

If you’re interested in doing inference, different classes for different tasks have a from_pretrained method that allows loading models from the Hub. For example:

Speech2Text for Automatic Speech Recognition.
Text2Speech for Text to Speech.
SeparateSpeech for Audio Source Separation.

Here is an inference example:

import soundfile
from espnet2.bin.tts_inference import Text2Speech

text2speech = Text2Speech.from_pretrained("model_name")
speech = text2speech("foobar")["wav"]
soundfile.write("out.wav", speech.numpy(), text2speech.fs, "PCM_16")

If you want to see how to load a specific model, you can click Use in ESPnet and you will be given a working snippet that you can load it!

Sharing your models

ESPnet outputs a zip file that can be uploaded to Hugging Face easily. For a full guide on sharing models, we recommend checking out the official guide).

The run.sh script allows to upload a given model to a Hugging Face repository.

./run.sh --stage 15 --skip_upload_hf false --hf_repo username/model_repo

Additional resources

ESPnet docs.
ESPnet model zoo repository.
Integration docs.

< > Update on GitHub