README.md · mechanicalsea/speecht5-vc at 422b2a8fa9b561afb004b3b0fb28c010462a89e1

metadata

license: mit
tags:
  - speech
  - text
  - cross-modal
  - unified model
  - self-supervised learning
  - SpeechT5
  - Voice Conversion
datasets:
  - CMU ARCTIC
  - bdl
  - clb
  - rms
  - slt

SpeechT5 TTS Manifest

| Github | Huggingface |

This manifest is an attempt to recreate the Voice Conversion recipe used for training SpeechT5. This manifest was constructed using CMU ARCTIC four speakers, e.g., bdl, clb, rms, slt. There are 932 utterances for training, 100 utterances for validation, and 100 utterance for evaluation.

Requirements

SpeechBrain for extracting speaker embedding
Parallel WaveGAN for implementing vocoder.

Tools

manifest/utils is used to extract speaker embedding, generate manifest, and apply vocoder.
manifest/arctic* provides the pre-trained vocoder for each speaker.

Reference

If you find our work is useful in your research, please cite the following paper:

@inproceedings{ao-etal-2022-speecht5,
    title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
    author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
    booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    month = {May},
    year = {2022},
    pages={5723--5738},
}