Edit model card

SpeechT5-base-cs-tts

This is a monolingual Czech SpeechT5 base model pre-trained from 120 thousand hours of Czech speech and a large Czech text corpus with 17.5B words. It has been introduced in a paper Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model accepted to TSD2024 conference. It is meant to be used as a starting checkpoint for Czech TTS fine-tuning.

Important note: This is only a pre-trained model, so it has not yet been trained for the TTS task.

Examples of fine-tuned voices are available at https://jalehecka.github.io/TSD2024.

See our paper for details.

Paper

https://link.springer.com/chapter/10.1007/978-3-031-70566-3_5

Pre-print: http://arxiv.org/abs/2407.17167.

Citation

If you find this model useful, please cite our paper:

@InProceedings{10.1007/978-3-031-70566-3_5,
  author="Lehe{\v{c}}ka, Jan
    and Hanzl{\'i}{\v{c}}ek, Zden{\v{e}}k
    and Matou{\v{s}}ek, Jind{\v{r}}ich
    and Tihelka, Daniel",
  editor="N{\"o}th, Elmar
    and Hor{\'a}k, Ale{\v{s}}
    and Sojka, Petr",
  title="Zero-Shot vs. Few-Shot Multi-speaker TTS Using Pre-trained Czech SpeechT5 Model",
  booktitle="Text, Speech, and Dialogue",
  year="2024",
  publisher="Springer Nature Switzerland",
  address="Cham",
  pages="46--57",
  isbn="978-3-031-70566-3"
}

Usage

This is a TTS variant of the SpeechT5 model, i.e., the input modality is text (processed via the text pre-net), and the output modality is speech (processed via the speech post-net). The model has the same format as the microsoft/speecht5_tts compatible with the SpeechT5ForTextToSpeech class.

In order to use this model for text-to-speech, it must be fine-tuned on labeled TTS data.

The usage is the same as for the microsoft/speecht5_tts

Related works

Downloads last month
47
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.