Spaces:
Running
Running
LLM_BENCHMARKS_TEXT = f""" | |
# About | |
As many recent Text-to-Speech (TTS) models have shown, synthetic audio can be close to real human speech. | |
However, traditional evaluation methods for TTS systems need an update to keep pace with these new developments. | |
Our TTSDS benchmark assesses the quality of synthetic speech by considering factors like prosody, speaker identity, and intelligibility. | |
By comparing these factors with both real speech and noise datasets, we can better understand how synthetic speech stacks up. | |
## More information | |
More details can be found in our paper [*TTSDS -- Text-to-Speech Distribution Score*](https://arxiv.org/abs/2407.12707). | |
## Reproducibility | |
To reproduce our results, check out our repository [here](https://github.com/ttsds/ttsds). | |
## Credits | |
This benchmark is inspired by [TTS Arena](https://huggingface.co/spaces/TTS-AGI/TTS-Arena) which instead focuses on the subjective evaluation of TTS models. | |
Our benchmark would not be possible without the many open-source TTS models on Hugging Face and GitHub. | |
Additionally, our benchmark uses the following datasets: | |
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/h) | |
- [LibriTTS](https://www.openslr.org/60/) | |
- [VCTK](https://datashare.ed.ac.uk/handle/10283/2950) | |
- [Common Voice](https://commonvoice.mozilla.org/) | |
- [ESC-50](https://github.com/karolpiczak/ESC-50) | |
And the following metrics/representations/tools: | |
- [Wav2Vec2](https://arxiv.org/abs/2006.11477) | |
- [Hubert](https://arxiv.org/abs/2006.11477) | |
- [WavLM](https://arxiv.org/abs/2110.13900) | |
- [PESQ](https://en.wikipedia.org/wiki/Perceptual_Evaluation_of_Speech_Quality) | |
- [VoiceFixer](https://arxiv.org/abs/2204.05841) | |
- [WADA SNR](https://www.cs.cmu.edu/~robust/Papers/KimSternIS08.pdf) | |
- [Whisper](https://arxiv.org/abs/2212.04356) | |
- [Masked Prosody Model](https://huggingface.co/cdminix/masked_prosody_model) | |
- [PyWorld](https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder) | |
- [WeSpeaker](https://arxiv.org/abs/2210.17016) | |
- [D-Vector](https://github.com/yistLin/dvector) | |
Authors: Christoph Minixhofer, OndΕej Klejch, and Peter Bell | |
of the University of Edinburgh. | |
""" | |
EVALUATION_QUEUE_TEXT = """ | |
## How to submit a TTS model to the leaderboard | |
### 1) download the evaluation dataset | |
The evaluation dataset consists of wav / text pairs. | |
You can download ``speaker_text_pairs.tar.gz`` from here: | |
https://huggingface.co/datasets/ttsds/speaker_text_pairs/blob/main/speaker_text_pairs.tar.gz | |
The format of the dataset is as follows: | |
``` | |
eval/ | |
βββ 0001.wav | |
βββ 0001.txt | |
βββ 0002.wav | |
βββ 0002.txt | |
βββ ... | |
``` | |
Please note that the .wav file is the speaker reference and the .txt file is the prompt. | |
### 2) create your TTS dataset | |
Create a dataset with your TTS model and the evaluation dataset. | |
Use the wav files as speaker reference and the text as the prompt. | |
Create a .tar.gz file with the dataset, and make sure to inlcude .wav files and .txt files. | |
### 3) submit your TTS dataset | |
Submit your dataset below. | |
""" | |
CITATION_TEXT = """ | |
@misc{minixhofer2024ttsds, | |
title={TTSDS -- Text-to-Speech Distribution Score}, | |
author={Christoph Minixhofer and OndΕej Klejch and Peter Bell}, | |
year={2024}, | |
eprint={2407.12707}, | |
archivePrefix={arXiv}, | |
primaryClass={eess.AS}, | |
url={https://arxiv.org/abs/2407.12707}, | |
} | |
""" | |