File size: 14,480 Bytes
46a75d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f79d99ef",
   "metadata": {},
   "source": [
    "# Train your first 🐸 TTS model πŸ’«\n",
    "\n",
    "### πŸ‘‹ Hello and welcome to Coqui (🐸) TTS\n",
    "\n",
    "The goal of this notebook is to show you a **typical workflow** for **training** and **testing** a TTS model with 🐸.\n",
    "\n",
    "Let's train a very small model on a very small amount of data so we can iterate quickly.\n",
    "\n",
    "In this notebook, we will:\n",
    "\n",
    "1. Download data and format it for 🐸 TTS.\n",
    "2. Configure the training and testing runs.\n",
    "3. Train a new model.\n",
    "4. Test the model and display its performance.\n",
    "\n",
    "So, let's jump right in!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa2aec78",
   "metadata": {},
   "outputs": [],
   "source": [
    "## Install Coqui TTS\n",
    "! pip install -U pip\n",
    "! pip install TTS"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be5fe49c",
   "metadata": {},
   "source": [
    "## βœ… Data Preparation\n",
    "\n",
    "### **First things first**: we need some data.\n",
    "\n",
    "We're training a Text-to-Speech model, so we need some _text_ and we need some _speech_. Specificially, we want _transcribed speech_. The speech must be divided into audio clips and each clip needs transcription. More details about data requirements such as recording characteristics, background noise and vocabulary coverage can be found in the [🐸TTS documentation](https://tts.readthedocs.io/en/latest/formatting_your_dataset.html).\n",
    "\n",
    "If you have a single audio file and you need to **split** it into clips. It is also important to use a lossless audio file format to prevent compression artifacts. We recommend using **wav** file format.\n",
    "\n",
    "The data format we will be adopting for this tutorial is taken from the widely-used  **LJSpeech** dataset, where **waves** are collected under a folder:\n",
    "\n",
    "<span style=\"color:purple;font-size:15px\">\n",
    "/wavs<br /> \n",
    " &emsp;| - audio1.wav<br /> \n",
    " &emsp;| - audio2.wav<br /> \n",
    " &emsp;| - audio3.wav<br /> \n",
    "  ...<br /> \n",
    "</span>\n",
    "\n",
    "and a **metadata.csv** file will have the audio file name in parallel to the transcript, delimited by `|`: \n",
    " \n",
    "<span style=\"color:purple;font-size:15px\">\n",
    "# metadata.csv <br /> \n",
    "audio1|This is my sentence. <br /> \n",
    "audio2|This is maybe my sentence. <br /> \n",
    "audio3|This is certainly my sentence. <br /> \n",
    "audio4|Let this be your sentence. <br /> \n",
    "...\n",
    "</span>\n",
    "\n",
    "In the end, we should have the following **folder structure**:\n",
    "\n",
    "<span style=\"color:purple;font-size:15px\">\n",
    "/MyTTSDataset <br /> \n",
    "&emsp;| <br /> \n",
    "&emsp;| -> metadata.csv<br /> \n",
    "&emsp;| -> /wavs<br /> \n",
    "&emsp;&emsp;| -> audio1.wav<br /> \n",
    "&emsp;&emsp;| -> audio2.wav<br /> \n",
    "&emsp;&emsp;| ...<br /> \n",
    "</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "69501a10-3b53-4e75-ae66-90221d6f2271",
   "metadata": {},
   "source": [
    "🐸TTS already provides tooling for the _LJSpeech_. if you use the same format, you can start training your models right away. <br /> \n",
    "\n",
    "After you collect and format your dataset, you need to check two things. Whether you need a **_formatter_** and a **_text_cleaner_**. <br /> The **_formatter_** loads the text file (created above) as a list and the **_text_cleaner_** performs a sequence of text normalization operations that converts the raw text into the spoken representation (e.g. converting numbers to text, acronyms, and symbols to the spoken format).\n",
    "\n",
    "If you use a different dataset format then the LJSpeech or the other public datasets that 🐸TTS supports, then you need to write your own **_formatter_** and  **_text_cleaner_**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7f226c8-4e55-48fa-937b-8415d539b17c",
   "metadata": {},
   "source": [
    "## ⏳️ Loading your dataset\n",
    "Load one of the dataset supported by 🐸TTS.\n",
    "\n",
    "We will start by defining dataset config and setting LJSpeech as our target dataset and define its path.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3cb0191-b8fc-4158-bd26-8423c2a8ba66",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# BaseDatasetConfig: defines name, formatter and path of the dataset.\n",
    "from TTS.tts.configs.shared_configs import BaseDatasetConfig\n",
    "\n",
    "output_path = \"tts_train_dir\"\n",
    "if not os.path.exists(output_path):\n",
    "    os.makedirs(output_path)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae6b7019-3685-4b48-8917-c152e288d7e3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download and extract LJSpeech dataset.\n",
    "\n",
    "!wget -O $output_path/LJSpeech-1.1.tar.bz2 https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2 \n",
    "!tar -xf $output_path/LJSpeech-1.1.tar.bz2 -C $output_path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "76cd3ab5-6387-45f1-b488-24734cc1beb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_config = BaseDatasetConfig(\n",
    "    formatter=\"ljspeech\", meta_file_train=\"metadata.csv\", path=os.path.join(output_path, \"LJSpeech-1.1/\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae82fd75",
   "metadata": {},
   "source": [
    "## βœ… Train a new model\n",
    "\n",
    "Let's kick off a training run πŸš€πŸš€πŸš€.\n",
    "\n",
    "Deciding on the model architecture you'd want to use is based on your needs and available resources. Each model architecture has it's pros and cons that define the run-time efficiency and the voice quality.\n",
    "We have many recipes under `TTS/recipes/` that provide a good starting point. For this tutorial, we will be using `GlowTTS`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5876e46-2aee-4bcf-b6b3-9e3c535c553f",
   "metadata": {},
   "source": [
    "We will begin by initializing the model training configuration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5483ca28-39d6-49f8-a18e-4fb53c50ad84",
   "metadata": {},
   "outputs": [],
   "source": [
    "# GlowTTSConfig: all model related values for training, validating and testing.\n",
    "from TTS.tts.configs.glow_tts_config import GlowTTSConfig\n",
    "config = GlowTTSConfig(\n",
    "    batch_size=32,\n",
    "    eval_batch_size=16,\n",
    "    num_loader_workers=4,\n",
    "    num_eval_loader_workers=4,\n",
    "    run_eval=True,\n",
    "    test_delay_epochs=-1,\n",
    "    epochs=100,\n",
    "    text_cleaner=\"phoneme_cleaners\",\n",
    "    use_phonemes=True,\n",
    "    phoneme_language=\"en-us\",\n",
    "    phoneme_cache_path=os.path.join(output_path, \"phoneme_cache\"),\n",
    "    print_step=25,\n",
    "    print_eval=False,\n",
    "    mixed_precision=True,\n",
    "    output_path=output_path,\n",
    "    datasets=[dataset_config],\n",
    "    save_step=1000,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b93ed377-80b7-447b-bd92-106bffa777ee",
   "metadata": {},
   "source": [
    "Next we will initialize the audio processor which is used for feature extraction and audio I/O."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b1b12f61-f851-4565-84dd-7640947e04ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "from TTS.utils.audio import AudioProcessor\n",
    "ap = AudioProcessor.init_from_config(config)\n",
    "# Modify sample rate if for a custom audio dataset:\n",
    "# ap.sample_rate = 22050\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d461683-b05e-403f-815f-8007bda08c38",
   "metadata": {},
   "source": [
    "Next we will initialize the tokenizer which is used to convert text to sequences of token IDs.  If characters are not defined in the config, default characters are passed to the config."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "014879b7-f18d-44c0-b24a-e10f8002113a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from TTS.tts.utils.text.tokenizer import TTSTokenizer\n",
    "tokenizer, config = TTSTokenizer.init_from_config(config)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df3016e1-9e99-4c4f-94e3-fa89231fd978",
   "metadata": {},
   "source": [
    "Next we will load data samples. Each sample is a list of ```[text, audio_file_path, speaker_name]```. You can define your custom sample loader returning the list of samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cadd6ada-c8eb-4f79-b8fe-6d72850af5a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from TTS.tts.datasets import load_tts_samples\n",
    "train_samples, eval_samples = load_tts_samples(\n",
    "    dataset_config,\n",
    "    eval_split=True,\n",
    "    eval_split_max_size=config.eval_split_max_size,\n",
    "    eval_split_size=config.eval_split_size,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db8b451e-1fe1-4aa3-b69e-ab22b925bd19",
   "metadata": {},
   "source": [
    "Now we're ready to initialize the model.\n",
    "\n",
    "Models take a config object and a speaker manager as input. Config defines the details of the model like the number of layers, the size of the embedding, etc. Speaker manager is used by multi-speaker models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac2ffe3e-ad0c-443e-800c-9b076ee811b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from TTS.tts.models.glow_tts import GlowTTS\n",
    "model = GlowTTS(config, ap, tokenizer, speaker_manager=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2832c56-889d-49a6-95b6-eb231892ecc6",
   "metadata": {},
   "source": [
    "Trainer provides a generic API to train all the 🐸TTS models with all its perks like mixed-precision training, distributed training, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f609945-4fe0-4d0d-b95e-11d7bfb63ebe",
   "metadata": {},
   "outputs": [],
   "source": [
    "from trainer import Trainer, TrainerArgs\n",
    "trainer = Trainer(\n",
    "    TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b320831-dd83-429b-bb6a-473f9d49d321",
   "metadata": {},
   "source": [
    "### AND... 3,2,1... START TRAINING πŸš€πŸš€πŸš€"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4c07f99-3d1d-4bea-801e-9f33bbff0e9f",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainer.fit()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cff0c40-2734-40a6-a905-e945a9fb3e98",
   "metadata": {},
   "source": [
    "#### πŸš€ Run the Tensorboard. πŸš€\n",
    "On the notebook and Tensorboard, you can monitor the progress of your model. Also Tensorboard provides certain figures and sample outputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a85cd3b-1646-40ad-a6c2-49323e08eeec",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install tensorboard\n",
    "!tensorboard --logdir=tts_train_dir"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f6dc959",
   "metadata": {},
   "source": [
    "## βœ… Test the model\n",
    "\n",
    "We made it! πŸ™Œ\n",
    "\n",
    "Let's kick off the testing run, which displays performance metrics.\n",
    "\n",
    "We're committing the cardinal sin of ML 😈 (aka - testing on our training data) so you don't want to deploy this model into production. In this notebook we're focusing on the workflow itself, so it's forgivable πŸ˜‡\n",
    "\n",
    "You can see from the test output that our tiny model has overfit to the data, and basically memorized this one sentence.\n",
    "\n",
    "When you start training your own models, make sure your testing data doesn't include your training data πŸ˜…"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "99fada7a-592f-4a09-9369-e6f3d82de3a0",
   "metadata": {},
   "source": [
    "Let's get the latest saved checkpoint. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6dd47ed5-da8e-4bf9-b524-d686630d6961",
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob, os\n",
    "output_path = \"tts_train_dir\"\n",
    "ckpts = sorted([f for f in glob.glob(output_path+\"/*/*.pth\")])\n",
    "configs = sorted([f for f in glob.glob(output_path+\"/*/*.json\")])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd42bc7a",
   "metadata": {},
   "outputs": [],
   "source": [
    " !tts --text \"Text for TTS\" \\\n",
    "      --model_path $test_ckpt \\\n",
    "      --config_path $test_config \\\n",
    "      --out_path out.wav"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81cbcb3f-d952-469b-a0d8-8941cd7af670",
   "metadata": {},
   "source": [
    "## πŸ“£ Listen to the synthesized wave πŸ“£"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0000bd6-6763-4a10-a74d-911dd08ebcff",
   "metadata": {},
   "outputs": [],
   "source": [
    "import IPython\n",
    "IPython.display.Audio(\"out.wav\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13914401-cad1-494a-b701-474e52829138",
   "metadata": {},
   "source": [
    "## πŸŽ‰ Congratulations! πŸŽ‰ You now have trained your first TTS model! \n",
    "Follow up with the next tutorials to learn more advanced material."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "950d9fc6-896f-4a2c-86fd-8fd1fcbbb3f7",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}