File size: 2,264 Bytes
b727725
 
0233e7e
 
 
 
 
 
 
 
 
a3109bf
0233e7e
 
 
 
b727725
0233e7e
3bd5924
0233e7e
 
 
 
 
a3109bf
 
 
 
0233e7e
 
 
 
 
 
 
795469e
 
0233e7e
479d6f2
 
795469e
 
479d6f2
0233e7e
 
 
 
 
 
 
 
 
 
 
 
 
a3109bf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
license: mit
tags:
- speech
- text
- cross-modal
- unified model
- self-supervised learning
- SpeechT5
- Voice Conversion
datasets:
- CMUARCTIC
- bdl
- clb
- rms
- slt
---

## SpeechT5 VC Manifest

| [**Github**](https://github.com/microsoft/SpeechT5) | [**Huggingface**](https://huggingface.co/mechanicalsea/speecht5-vc) |

This manifest is an attempt to recreate the Voice Conversion recipe used for training [SpeechT5](https://aclanthology.org/2022.acl-long.393). This manifest was constructed using [CMU ARCTIC](http://www.festvox.org/cmu_arctic/) four speakers, e.g., bdl, clb, rms, slt. There are 932 utterances for training, 100 utterances for validation, and 100 utterance for evaluation.

### News

- 8 February 2023: SpeechT5 is integrated as an official model into the Hugging Face Transformers library [[Blog](https://huggingface.co/blog/speecht5)] and [[Demo](https://huggingface.co/spaces/Matthijs/speecht5-vc-demo)].

### Requirements

- [SpeechBrain](https://github.com/speechbrain/speechbrain) for extracting speaker embedding
- [Parallel WaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN) for implementing vocoder.

### Tools

- `manifest/utils` is used to extract speaker embedding, generate manifest, and apply vocoder.
- `manifest/arctic*` provides the pre-trained vocoder for each speaker.

### Model and Samples

- [`speecht5_vc.pt`](./speecht5_vc.pt) are reimplemented Voice Conversion fine-tuning on the released manifest **but with a smaller batch size or max updates** (Ensure the manifest is ok).
- `samples` are created by the released fine-tuned model and vocoder.

### Reference

If you find our work is useful in your research, please cite the following paper:

```bibtex
@inproceedings{ao-etal-2022-speecht5,
    title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
    author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
    booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    month = {May},
    year = {2022},
    pages={5723--5738},
}
```