Text-to-Speech (TTS) is the task of generating natural sounding speech given text input. TTS models can be extended to have a single model that generates speech for multiple speakers and multiple languages.


I love audio models on the Hub!

Text-to-Speech Model

About Text-to-Speech

Use Cases

Text-to-Speech (TTS) models can be used in any speech-enabled application that requires converting text to speech.

Voice Assistants

TTS models are used to create voice assistants on smart devices. These models are a better alternative compared to concatenative methods where the assistant is built by recording sounds and mapping them, since the outputs in TTS models contain elements in natural speech such as emphasis.

Announcement Systems

TTS models are widely used in airport and public transportation announcement systems to convert the announcement of a given text into speech.


The Hub contains over 100 TTS models that you can use right away by trying out the widgets directly in the browser or calling the models as a service using the Inference API. Here is a simple code snippet to do exactly this:

import json
import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/facebook/wav2vec2-base-960h"

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response

output = query({"inputs": "This is a test"})

You can also use libraries such as espnet if you want to handle the Inference directly.

from espnet2.bin.tts_inference import Text2Speech

model = Text2Speech.from_pretrained("espnet/kan-bayashi_ljspeech_vits")

speech, *_ = model("text to generate speech from")

Useful Resources

Compatible libraries

ESPnet TensorFlowTTS
Text-to-Speech demo
This model can be loaded on the Inference API on-demand.
Models for Text-to-Speech
Browse Models (251)

Note An end-to-end TTS model trained for a single speaker.

Datasets for Text-to-Speech
Browse Datasets (19)

Note Thousands of short audio clips of a single speaker.

Spaces using Text-to-Speech

Note An application for end-to-end text-to-speech.

Note An application that contains multiple speech recognition models for various languages and datasets.

Metrics for Text-to-Speech
mel cepstral distortion
The Mel Cepstral Distortion (MCD) metric is used to calculate the quality of generated speech.