Text-to-Speech
coqui
tobiolatunji commited on
Commit
b84a6da
1 Parent(s): d8bbc6a

initial commit -m

Browse files
Files changed (8) hide show
  1. .gitignore +0 -0
  2. LICENSE.txt +84 -0
  3. README.md +84 -1
  4. config.json +205 -0
  5. dvae.pth +3 -0
  6. mel_stats.pth +3 -0
  7. model.pth +3 -0
  8. vocab.json +0 -0
.gitignore ADDED
File without changes
LICENSE.txt ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Coqui Public Model License 1.0.0
2
+ https://coqui.ai/cpml.txt
3
+
4
+
5
+ This license allows only non-commercial use of a machine learning model and its outputs.
6
+
7
+
8
+ ## Acceptance
9
+
10
+
11
+ In order to get any license under these terms, you must agree to them as both strict obligations and conditions to all your licenses.
12
+
13
+
14
+ ## Licenses
15
+
16
+
17
+ The licensor grants you a copyright license to do everything you might do with the model that would otherwise infringe the licensor's copyright in it, for any non-commercial purpose. The licensor grants you a patent license that covers patent claims the licensor can license, or becomes able to license, that you would infringe by using the model in the form provided by
18
+ the licensor, for any non-commercial purpose.
19
+
20
+
21
+ ## Non-commercial Purpose
22
+
23
+
24
+ Non-commercial purposes include any of the following uses of the model or its output, but only so far as you do not receive any direct or indirect payment arising from the use of the model or its output.
25
+
26
+
27
+ ### Personal use for research, experiment, and testing for the benefit of public knowledge, personal study, private entertainment, hobby projects, amateur pursuits, or religious
28
+ observance.
29
+
30
+
31
+ ### Use by commercial or for-profit entities for testing, evaluation, or non-commercial research and development. Use of the model to train other models for commercial use is not a non-commercial purpose.
32
+
33
+
34
+ ### Use by any charitable organization for charitable purposes, or for testing or evaluation. Use for revenue-generating activity, including projects directly funded by government grants, is not a non-commercial purpose.
35
+
36
+
37
+ ## Notices
38
+
39
+
40
+ You must ensure that anyone who gets a copy of any part of the model, or any modification of the model, or their output, from you also gets a copy of these terms or the URL for them above.
41
+
42
+
43
+ ## No Other Rights
44
+
45
+
46
+ These terms do not allow you to sublicense or transfer any of your licenses to anyone else, or prevent the licensor from granting licenses to anyone else. These terms do not imply
47
+ any other licenses.
48
+
49
+
50
+ ## Patent Defense
51
+
52
+
53
+ If you make any written claim that the model infringes or contributes to infringement of any patent, your licenses for the model granted under these terms ends immediately. If your company makes such a claim, your patent license ends immediately for work on behalf of your company.
54
+
55
+
56
+ ## Violations
57
+
58
+
59
+ The first time you are notified in writing that you have violated any of these terms, or done anything with the model or its output that is not covered by your licenses, your licenses can nonetheless continue if you come into full compliance with these terms, and take practical steps to correct past violations, within 30 days of receiving notice. Otherwise, all your licenses
60
+ end immediately.
61
+
62
+
63
+ ## No Liability
64
+
65
+
66
+ ***As far as the law allows, the model and its output come as is, without any warranty or condition, and the licensor will not be liable to you for any damages arising out of these terms or the use or nature of the model or its output, under any kind of legal claim. If this provision is not enforceable in your jurisdiction, your licenses are void.***
67
+
68
+
69
+ ## Definitions
70
+
71
+
72
+ The **licensor** is the individual or entity offering these terms, and the **model** is the model the licensor makes available under these terms, including any documentation or similar information about the model.
73
+
74
+
75
+ **You** refers to the individual or entity agreeing to these terms.
76
+
77
+
78
+ **Your company** is any legal entity, sole proprietorship, or other kind of organization that you work for, plus all organizations that have control over, are under the control of, or are under common control with that organization. **Control** means ownership of substantially all the assets of an entity, or the power to direct its management and policies by vote, contract, or otherwise. Control can be direct or indirect.
79
+
80
+
81
+ **Your licenses** are all the licenses granted to you under these terms.
82
+
83
+
84
+ **Use** means anything you do with the model or its output requiring one of your licenses.
README.md CHANGED
@@ -1,3 +1,86 @@
1
  ---
2
- license: cc-by-nc-sa-4.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: other
3
+ license_name: coqui-public-model-license
4
+ license_link: https://coqui.ai/cpml
5
+ library_name: coqui
6
+ pipeline_tag: text-to-speech
7
+ widget:
8
+ - text: "Abraham said today is a good day to sound like an African"
9
  ---
10
+
11
+ # Afro-TTS
12
+ Afro-TTS is the first pan-African accented English speech synthesis system capable of generating speech in 86 African accents. It includes 1000 personas representing the rich phonological diversity across the continent for applications in Education, Public Health, and Automated Content Creation. Afro-TTS lets you clone voices into different African accents by using just a quick 6-second audio clip.
13
+ The model was adapted from the XTTS model which was developed by [Coqui Studio](https://coqui.ai/).
14
+
15
+ Read more about this model in our paper: https://arxiv.org/abs/2406.11727
16
+
17
+ ### Features
18
+ - Supports 86 unique African accents
19
+ - Voice cloning with just a 6-second audio clip
20
+ - Emotion and style transfer by cloning
21
+ - Multi-accent English speech generation
22
+ - 24kHz sampling rate for high-quality audio
23
+
24
+ ## Performance
25
+
26
+ Afro-TTS achieves near ground truth Mean Opinion Scores (MOS) for naturalness and accentedness. Objective and subjective evaluations demonstrated that the model generates natural-sounding accented speech, bridging the current gap in the representation of African voices in speech synthesis.
27
+
28
+
29
+ ### Languages
30
+ Afro-TTS supports only English languages for now.
31
+ Stay tuned as we continue to add support for more languages. If you have any language requests, feel free to reach out!
32
+
33
+ ### Code
34
+ The code-base for the paper of this model can be found [here](https://github.com/intron-innovation/AfriSpeech-TTS)
35
+
36
+
37
+ ### License
38
+ This model is licensed under [Coqui Public Model License](https://coqui.ai/cpml). There's a lot that goes into a license for generative models, and you can read more of [the origin story of CPML here](https://coqui.ai/blog/tts/cpml).
39
+
40
+ ### Contact
41
+ Come and join in our Bioramp Community. We're active on [Masakhane Slack Server](https://join.slack.com/t/masakhane-nlp/shared_invite/zt-1zgnxx911-YWvICNas~mpeKDNqiO3r3g) and our [website](https://bioramp.org/).
42
+ You can also mail the authors at sewade.ogun@inria.fr, tobi@intron.io
43
+
44
+ #### Using Afro-TTS:
45
+
46
+ Install the Coqui TTS package:
47
+
48
+ ```bash
49
+ pip install TTS
50
+ ```
51
+ Run the following code:
52
+ ```python
53
+ from scipy.io.wavfile import write
54
+ from TTS.tts.configs.xtts_config import XttsConfig
55
+ from TTS.tts.models.xtts import Xtts
56
+
57
+ config = XttsConfig()
58
+ config.load_json("intronhealth/afro-tts/config.json")
59
+ model = Xtts.init_from_config(config)
60
+ model.load_checkpoint(config, checkpoint_dir="intronhealth/afro-tts/", eval=True)
61
+ model.cuda()
62
+
63
+ outputs = model.synthesize(
64
+ "Abraham said today is a good day to sound like an African.,
65
+ config,
66
+ speaker_wav="audios/reference_accent.wav",
67
+ gpt_cond_len=3,
68
+ language="en",
69
+ )
70
+
71
+ write("audios/output.wav", 24000, outputs['wav'])
72
+
73
+
74
+ ```
75
+
76
+ ### BibTeX entry and citation info.
77
+ ```
78
+ @misc{ogun20241000,
79
+ title={1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis},
80
+ author={Sewade Ogun and Abraham T. Owodunni and Tobi Olatunji and Eniola Alese and Babatunde Oladimeji and Tejumade Afonja and Kayode Olaleye and Naome A. Etori and Tosin Adewumi},
81
+ year={2024},
82
+ eprint={2406.11727},
83
+ archivePrefix={arXiv},
84
+ primaryClass={id='eess.AS' full_name='Audio and Speech Processing' is_active=True alt_name=None in_archive='eess' is_general=False description='Theory and methods for processing signals representing audio, speech, and language, and their applications. This includes analysis, synthesis, enhancement, transformation, classification and interpretation of such signals as well as the design, development, and evaluation of associated signal processing systems. Machine learning and pattern analysis applied to any of the above areas is also welcome. Specific topics of interest include: auditory modeling and hearing aids; acoustic beamforming and source localization; classification of acoustic scenes; speaker separation; active noise control and echo cancellation; enhancement; de-reverberation; bioacoustics; music signals analysis, synthesis and modification; music information retrieval; audio for multimedia and joint audio-video processing; spoken and written language modeling, segmentation, tagging, parsing, understanding, and translation; text mining; speech production, perception, and psychoacoustics; speech analysis, synthesis, and perceptual modeling and coding; robust speech recognition; speaker recognition and characterization; deep learning, online learning, and graphical models applied to speech, audio, and language signals; and implementation aspects ranging from system architecture to fast algorithms.'}
85
+ }
86
+ ```
config.json ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "output_path": "/srv/storage/talc2@talc-data2.nancy/multispeech/calcul/users/sogun/coqui-ai-TTS/recipes/ljspeech/xtts_v2/run/training",
3
+ "logger_uri": null,
4
+ "run_name": "GPT_XTTS_v2.0_AfroTTS_FT",
5
+ "project_name": "XTTS_trainer",
6
+ "run_description": "\n GPT XTTS training\n ",
7
+ "print_step": 5000,
8
+ "plot_step": 5000,
9
+ "model_param_stats": false,
10
+ "wandb_entity": null,
11
+ "dashboard_logger": "tensorboard",
12
+ "save_on_interrupt": true,
13
+ "log_model_step": 1000,
14
+ "save_step": 5000,
15
+ "save_n_checkpoints": 3,
16
+ "save_checkpoints": true,
17
+ "save_all_best": false,
18
+ "save_best_after": 0,
19
+ "target_loss": null,
20
+ "print_eval": true,
21
+ "test_delay_epochs": 0,
22
+ "run_eval": true,
23
+ "run_eval_steps": null,
24
+ "distributed_backend": "nccl",
25
+ "distributed_url": "tcp://localhost:54321",
26
+ "mixed_precision": false,
27
+ "precision": "fp16",
28
+ "epochs": 1000,
29
+ "batch_size": 2,
30
+ "eval_batch_size": 2,
31
+ "grad_clip": 0.0,
32
+ "scheduler_after_epoch": true,
33
+ "lr": 5e-06,
34
+ "optimizer": "AdamW",
35
+ "optimizer_params": {
36
+ "betas": [
37
+ 0.9,
38
+ 0.96
39
+ ],
40
+ "eps": 1e-08,
41
+ "weight_decay": 0.01
42
+ },
43
+ "lr_scheduler": "MultiStepLR",
44
+ "lr_scheduler_params": {
45
+ "milestones": [
46
+ 900000,
47
+ 2700000,
48
+ 5400000
49
+ ],
50
+ "gamma": 0.5,
51
+ "last_epoch": -1
52
+ },
53
+ "use_grad_scaler": false,
54
+ "allow_tf32": false,
55
+ "cudnn_enable": true,
56
+ "cudnn_deterministic": false,
57
+ "cudnn_benchmark": false,
58
+ "training_seed": 1,
59
+ "model": "xtts",
60
+ "num_loader_workers": 8,
61
+ "num_eval_loader_workers": 0,
62
+ "use_noise_augment": false,
63
+ "audio": {
64
+ "sample_rate": 22050,
65
+ "output_sample_rate": 24000,
66
+ "dvae_sample_rate": 22050
67
+ },
68
+ "use_phonemes": false,
69
+ "phonemizer": null,
70
+ "phoneme_language": null,
71
+ "compute_input_seq_cache": false,
72
+ "text_cleaner": null,
73
+ "enable_eos_bos_chars": false,
74
+ "test_sentences_file": "",
75
+ "phoneme_cache_path": null,
76
+ "characters": null,
77
+ "add_blank": false,
78
+ "batch_group_size": 64,
79
+ "loss_masking": null,
80
+ "min_audio_len": 1,
81
+ "max_audio_len": Infinity,
82
+ "min_text_len": 1,
83
+ "max_text_len": Infinity,
84
+ "compute_f0": false,
85
+ "compute_energy": false,
86
+ "compute_linear_spec": false,
87
+ "precompute_num_workers": 0,
88
+ "start_by_longest": false,
89
+ "shuffle": false,
90
+ "drop_last": false,
91
+ "datasets": [
92
+ {
93
+ "formatter": "",
94
+ "dataset_name": "",
95
+ "path": "",
96
+ "meta_file_train": "",
97
+ "ignored_speakers": null,
98
+ "language": "",
99
+ "phonemizer": "",
100
+ "meta_file_val": "",
101
+ "meta_file_attn_mask": ""
102
+ }
103
+ ],
104
+ "test_sentences": [
105
+ {
106
+ "text": "It took me quite a long time to develop a voice, and now that I have it I'm not going to be silent.",
107
+ "speaker_wav": [
108
+ "/srv/storage/talc2@talc-data2.nancy/multispeech/calcul/users/sogun/AfriSpeech-TTS/afrispeech_16k_trimmed/AfriSpeech-TTS/train/defc5e03-926c-4e0b-a639-c821e5e7db89/14f64f13c57f9a64a2a1521253934a0b_KYA8MaKS.wav"
109
+ ],
110
+ "language": "en"
111
+ },
112
+ {
113
+ "text": "This cake is great. It's so delicious and moist.",
114
+ "speaker_wav": [
115
+ "/srv/storage/talc2@talc-data2.nancy/multispeech/calcul/users/sogun/AfriSpeech-TTS/afrispeech_16k_trimmed/AfriSpeech-TTS/train/defc5e03-926c-4e0b-a639-c821e5e7db89/14f64f13c57f9a64a2a1521253934a0b_KYA8MaKS.wav"
116
+ ],
117
+ "language": "en"
118
+ }
119
+ ],
120
+ "eval_split_max_size": 256,
121
+ "eval_split_size": 0.01,
122
+ "use_speaker_weighted_sampler": false,
123
+ "speaker_weighted_sampler_alpha": 1.0,
124
+ "use_language_weighted_sampler": false,
125
+ "language_weighted_sampler_alpha": 1.0,
126
+ "use_length_weighted_sampler": false,
127
+ "length_weighted_sampler_alpha": 1.0,
128
+ "model_args": {
129
+ "gpt_batch_size": 1,
130
+ "enable_redaction": false,
131
+ "kv_cache": true,
132
+ "gpt_checkpoint": "",
133
+ "clvp_checkpoint": null,
134
+ "decoder_checkpoint": null,
135
+ "num_chars": 255,
136
+ "tokenizer_file": "/srv/storage/talc2@talc-data2.nancy/multispeech/calcul/users/sogun/coqui-ai-TTS/recipes/ljspeech/xtts_v2/run/training/XTTS_v2.0_original_model_files/vocab.json",
137
+ "gpt_max_audio_tokens": 605,
138
+ "gpt_max_text_tokens": 402,
139
+ "gpt_max_prompt_tokens": 70,
140
+ "gpt_layers": 30,
141
+ "gpt_n_model_channels": 1024,
142
+ "gpt_n_heads": 16,
143
+ "gpt_number_text_tokens": 6681,
144
+ "gpt_start_text_token": 261,
145
+ "gpt_stop_text_token": 0,
146
+ "gpt_num_audio_tokens": 1026,
147
+ "gpt_start_audio_token": 1024,
148
+ "gpt_stop_audio_token": 1025,
149
+ "gpt_code_stride_len": 1024,
150
+ "gpt_use_masking_gt_prompt_approach": true,
151
+ "gpt_use_perceiver_resampler": true,
152
+ "input_sample_rate": 22050,
153
+ "output_sample_rate": 24000,
154
+ "output_hop_length": 256,
155
+ "decoder_input_dim": 1024,
156
+ "d_vector_dim": 512,
157
+ "cond_d_vector_in_each_upsampling_layer": true,
158
+ "duration_const": 102400,
159
+ "min_conditioning_length": 66150,
160
+ "max_conditioning_length": 132300,
161
+ "gpt_loss_text_ce_weight": 0.01,
162
+ "gpt_loss_mel_ce_weight": 1.0,
163
+ "debug_loading_failures": false,
164
+ "max_wav_length": 255995,
165
+ "max_text_length": 300,
166
+ "mel_norm_file": "/srv/storage/talc2@talc-data2.nancy/multispeech/calcul/users/sogun/coqui-ai-TTS/recipes/ljspeech/xtts_v2/run/training/XTTS_v2.0_original_model_files/mel_stats.pth",
167
+ "dvae_checkpoint": "/srv/storage/talc2@talc-data2.nancy/multispeech/calcul/users/sogun/coqui-ai-TTS/recipes/ljspeech/xtts_v2/run/training/XTTS_v2.0_original_model_files/dvae.pth",
168
+ "xtts_checkpoint": "/srv/storage/talc2@talc-data2.nancy/multispeech/calcul/users/sogun/coqui-ai-TTS/recipes/ljspeech/xtts_v2/run/training/GPT_XTTS_v2.0_AfroTTS_FT-March-06-2024_12+32AM-581cf506/checkpoint_30000.pth",
169
+ "vocoder": ""
170
+ },
171
+ "model_dir": null,
172
+ "languages": [
173
+ "en",
174
+ "es",
175
+ "fr",
176
+ "de",
177
+ "it",
178
+ "pt",
179
+ "pl",
180
+ "tr",
181
+ "ru",
182
+ "nl",
183
+ "cs",
184
+ "ar",
185
+ "zh-cn",
186
+ "hu",
187
+ "ko",
188
+ "ja",
189
+ "hi"
190
+ ],
191
+ "temperature": 0.85,
192
+ "length_penalty": 1.0,
193
+ "repetition_penalty": 2.0,
194
+ "top_k": 50,
195
+ "top_p": 0.85,
196
+ "num_gpt_outputs": 1,
197
+ "gpt_cond_len": 12,
198
+ "gpt_cond_chunk_len": 4,
199
+ "max_ref_len": 10,
200
+ "sound_norm_refs": false,
201
+ "optimizer_wd_only_on_weights": true,
202
+ "weighted_loss_attrs": {},
203
+ "weighted_loss_multipliers": {},
204
+ "github_branch": "* dev"
205
+ }
dvae.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b29bc227d410d4991e0a8c09b858f77415013eeb9fba9650258e96095557d97a
3
+ size 210514388
mel_stats.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f69422a8a8f344c4fca2f0c6b8d41d2151d6615b7321e48e6bb15ae949b119c
3
+ size 1067
model.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e90d054911b2fe01f85194ff7c4c5bfb63fd2104bc8995afe04ba392e0038415
3
+ size 5607927893
vocab.json ADDED
The diff for this file is too large to render. See raw diff