# Introduction

In this tutorial, we will prepare a dataset using our [TTS Dataset Processing Scripts](https://github.com/NVIDIA/NeMo/tree/main/scripts/dataset_processing/tts) and use it for training a FastPitch model.

**This tutorial uses a different workflow than all other existing TTS tutorials. The scripts and classes used are all experimental and not yet ready for production**.

# License

> Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
>
> Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
>
> http://www.apache.org/licenses/LICENSE-2.0
>
> Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

# Install

In [None]:
BRANCH = 'main'
NEMO_ROOT_DIR = '/content/nemo'

In [None]:
# Install NeMo library. If you are running locally (rather than on Google Colab), comment out the below lines
# and instead follow the instructions at https://github.com/NVIDIA/NeMo#Installation
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

In [None]:

# Download local version of NeMo scripts. If you are running locally and want to use your own local NeMo code,
# comment out the below lines and set NEMO_ROOT_DIR to your local path.
!git clone -b $BRANCH https://github.com/NVIDIA/NeMo.git $NEMO_ROOT_DIR

# Dataset Preparation

For our tutorial, we use a subset of [VCTK](https://datashare.ed.ac.uk/handle/10283/2950) dataset with 5 speakers (p225-p229).

In [None]:
import os
import tarfile
import wget
from pathlib import Path

from nemo.collections.asr.parts.utils.manifest_utils import read_manifest, write_manifest

In [None]:
# Configure nemo paths
NEMO_DIR = Path(NEMO_ROOT_DIR)
NEMO_EXAMPLES_DIR = NEMO_DIR / "examples" / "tts"
NEMO_CONFIG_DIR = NEMO_EXAMPLES_DIR / "conf"
NEMO_SCRIPT_DIR = NEMO_DIR / "scripts" / "dataset_processing" / "tts"

In [None]:
# Create dataset directory
root_dir = Path("/content")
data_root = root_dir / "data"

data_root.mkdir(parents=True, exist_ok=True)

In [None]:
# Download the dataset
dataset_url = "https://vctk-subset.s3.amazonaws.com/vctk_subset_multispeaker.tar.gz"
dataset_tar_filepath = data_root / "vctk.tar.gz"

if not os.path.exists(dataset_tar_filepath):
 wget.download(dataset_url, out=str(dataset_tar_filepath))

In [None]:
# Extract the dataset
with tarfile.open(dataset_tar_filepath) as tar_f:
 tar_f.extractall(data_root)

In [None]:
DATA_DIR = data_root / "vctk_subset_multispeaker"

In [None]:
# Visualize the raw dataset
train_raw_filepath = DATA_DIR / "train.json"
!head $train_raw_filepath

## Manifest Processing

The downloaded manifest uses our traditional format for TTS training. The scripts here require it to be formatted slightly differently.

The `speaker` field used to be an *integer* ID corresponding to an array index that the FastPitch model would query. Now we represent it as a *string* so we can give each speaker a human-friendly name. The mapping from speaker name to speaker index will be provided at training time.

As a best practice, we suggest prepending the `speaker` field with the name of the dataset so that it is guaranteed to be unique across all datasets (eg. *vctk_225*, instead of *225*).

The `audio_filepath` field used to require an *absolute path* which had to be manually updated depending on where the dataset was on your computer. Absolute paths still work, but now you can optionally provide it as a *relative path*, with the root directory provided as an argument to each script.

In [None]:
def update_metadata(data_type):
 input_filepath = DATA_DIR / f"{data_type}.json"
 output_filepath = DATA_DIR / f"{data_type}_raw.json"

 entries = read_manifest(input_filepath)
 for entry in entries:
 # Provide relative path instead of absolute path
 entry["audio_filepath"] = entry["audio_filepath"].replace("audio/", "")
 # Prepend speaker ID with the name of the dataset: 'vctk'
 entry["speaker"] = f"vctk_{entry['speaker']}"

 write_manifest(output_path=output_filepath, target_manifest=entries, ensure_ascii=False)

In [None]:
update_metadata("dev")
update_metadata("train")

In [None]:
# Visualize updated 'audio_filepath' and 'speaker' fields
train_filepath = DATA_DIR / "train_raw.json"
!head $train_filepath

## Text Preprocessing

First we will process the text transcripts using the script [preprocess_text.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/preprocess_text.py).

This step mainly passes the text through our NeMo *text normalizer* and then stores the output in the `normalized_text` field. It also has a few optional transformations, such as lowercasing the text.

In [None]:
text_preprocessing_script = NEMO_SCRIPT_DIR / "preprocess_text.py"

# Number of threads to parallelize text processing across
num_workers = 4
# Text normalizer to apply
normalizer_config_filepath = NEMO_CONFIG_DIR / "text" / "normalizer_en.yaml"
# Whether to lowercase output text. We can safely do this here because we will train on IPA phonemes.
# If training on graphemes only, then consider disabling this to leave text with its original capitalization.
lower_case = True
# Whether to overwrite output manifest, if it exists
overwrite_manifest = True
# Batch size for joblib parallelization. Increasing this value might speed up the script, depending on your CPU.
joblib_batch_size = 16

# Python wrapper to invoke the given bash script with the given input args
def run_script(script, args):
 args = ' \\'.join(args)
 cmd = f"python {script} \\{args}"

 print(cmd.replace(" \\", "\n"))
 print()
 !$cmd

def preprocess_text(data_type):
 input_filepath = DATA_DIR / f"{data_type}_raw.json"
 output_filepath = DATA_DIR / f"{data_type}_text.json"

 args = [
 f"--input_manifest={input_filepath}",
 f"--output_manifest={output_filepath}",
 f"--num_workers={num_workers}",
 f"--normalizer_config_path={normalizer_config_filepath}",
 f"--joblib_batch_size={joblib_batch_size}"
 ]
 if lower_case:
 args.append("--lower_case")
 if overwrite_manifest:
 args.append("--overwrite")

 run_script(text_preprocessing_script, args)

In [None]:
preprocess_text("dev")

In [None]:
preprocess_text("train")

In [None]:
# Visualize the output of the 'normalized_text' field.
train_text_filepath = DATA_DIR / "train_text.json"
!head $train_text_filepath

## Audio Preprocessing

Next we process the audio data using [preprocess_audio.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/preprocess_audio.py).

During this step we apply the following transformations:

1. Resample the audio from 48khz to 44.1khz so that it is compatible with our default training configuration.
2. Remove long silence from the beginning and end of each audio file. This can be done using an *energy* based approach which will work on clean audio, or using *voice activity detection (VAD)* which also works on audio with background or static noise (eg. from a microphone).
3. Scale the audio so that files have approximately the same volume level.
4. Filter out audio files which are too long or too short.



In [None]:
import IPython.display as ipd

In [None]:
audio_preprocessing_script = NEMO_SCRIPT_DIR / "preprocess_audio.py"

# Directory with raw audio data
input_audio_dir = DATA_DIR / "audio"
# Directory to write preprocessed audio to
output_audio_dir = DATA_DIR / "audio_preprocessed"
# Whether to overwrite existing audio, if it exists in the output directory
overwrite_audio = True
# Whether to overwrite output manifest, if it exists
overwrite_manifest = True
# Number of threads to parallelize audio processing across
num_workers = 4
# Downsample data from 48khz to 44.1khz for compatibility
output_sample_rate = 44100
# Format of output audio files. Use "flac" to compress to a smaller file size.
output_format = "flac"
# Method for silence trimming. Can use "energy.yaml" or "vad.yaml".
# We use VAD for VCTK because the audio has background noise.
trim_config_path = NEMO_CONFIG_DIR / "trim" / "vad.yaml"
# Volume level (0, 1] to normalize audio to
volume_level = 0.95
# Filter out audio shorter than min_duration or longer than max_duration seconds.
# We set these bounds relatively low/high, as we can place stricter limits at training time
min_duration = 0.25
max_duration = 30.0
# Output file with entries that are filtered out based on duration
filter_file = DATA_DIR / "filtered.json"

def preprocess_audio(data_type):
 input_filepath = DATA_DIR / f"{data_type}_text.json"
 output_filepath = DATA_DIR / f"{data_type}_manifest.json"

 args = [
 f"--input_manifest={input_filepath}",
 f"--output_manifest={output_filepath}",
 f"--input_audio_dir={input_audio_dir}",
 f"--output_audio_dir={output_audio_dir}",
 f"--num_workers={num_workers}",
 f"--output_sample_rate={output_sample_rate}",
 f"--output_format={output_format}",
 f"--trim_config_path={trim_config_path}",
 f"--volume_level={volume_level}",
 f"--min_duration={min_duration}",
 f"--max_duration={max_duration}",
 f"--filter_file={filter_file}",
 ]
 if overwrite_manifest:
 args.append("--overwrite_manifest")
 if overwrite_audio:
 args.append("--overwrite_audio")

 run_script(audio_preprocessing_script, args)

In [None]:
preprocess_audio("dev")

In [None]:
preprocess_audio("train")

We should listen to a few audio files before and after the processing so be sure we configured it correctly.

Note that the processed audio is louder. It is also shorter because we trimmed the leading and trailing silence.

In [None]:
audio_file = "p228_009.wav"
audio_filepath = input_audio_dir / audio_file
processed_audio_filepath = output_audio_dir / audio_file.replace(".wav", ".flac")

print("Original audio.")
ipd.display(ipd.Audio(audio_filepath))

print("Processed audio.")
ipd.display(ipd.Audio(processed_audio_filepath))

## Speaker Mapping

We can use [create_speaker_map.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/create_speaker_map.py) to easily create a mapping from speaker ID strings to integer indices that will be used at training time.

The script will simply sort the speaker IDs and assign them numbers `[0, num_speakers)` in alphabetical order.

In [None]:
speaker_map_script = NEMO_SCRIPT_DIR / "create_speaker_map.py"

train_manifest_filepath = DATA_DIR / "train_manifest.json"
dev_manifest_filepath = DATA_DIR / "dev_manifest.json"
speaker_filepath = DATA_DIR / "speakers.json"

args = [
 f"--manifest_path={train_manifest_filepath}",
 f"--manifest_path={dev_manifest_filepath}",
 f"--speaker_map_path={speaker_filepath}"
]

run_script(speaker_map_script, args)

In [None]:
# Visualize the speaker map file.
!head $speaker_filepath

## Feature Computation

Before training FastPitch, we need to compute some features for every audio file. The default [config file](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/feature/feature_44100.yaml) we will use has parameters for computing the **pitch** and **energy** of every audio frame. Be default it will also compute a **voiced_mask** indicating which audio frames have no pitch (eg. because they contain silence).

In [None]:
feature_script = NEMO_SCRIPT_DIR / "compute_features.py"

sample_rate = 44100

if sample_rate == 22050:
 feature_config_filename = "feature_22050.yaml"
elif sample_rate == 44100:
 feature_config_filename = "feature_44100.yaml"
else:
 raise ValueError(f"Unsupported sampling rate {sample_rate}")

feature_config_path = NEMO_CONFIG_DIR / "feature" / feature_config_filename
audio_dir = DATA_DIR / "audio_preprocessed"
feature_dir = DATA_DIR / "features"
num_workers = 4

def compute_features(data_type):
 input_filepath = DATA_DIR / f"{data_type}_manifest.json"

 args = [
 f"--feature_config_path={feature_config_path}",
 f"--manifest_path={input_filepath}",
 f"--audio_dir={audio_dir}",
 f"--feature_dir={feature_dir}",
 f"--num_workers={num_workers}"
 ]

 run_script(feature_script, args)

In [None]:
compute_features("dev")

In [None]:
compute_features("train")

The features are stored in the specified `feature_dir`.

In [None]:
!ls $feature_dir

## Feature Statistics

For training it is beneficial for us to *normalize* our features. The most standard approach is to apply *mean-variance normalization* so that each feature has a mean of 0 and variance of 1. To do this we need to compute the *dataset statistics* with the mean and variance of each feature.

For TTS it also helps
* Normalize features using speaker-level statistics.
* Use the `voiced_mask` to set the feature values of non-voiced audio frames to 0.

Using the [compute_feature_stats.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/compute_feature_stats.py) script we will compute the mean and variance of each feature for each speaker. The input to the script is the same [config file](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/feature/feature_44100.yaml) we used to compute the features.

In [None]:
feature_stats_script = NEMO_SCRIPT_DIR / "compute_feature_stats.py"

train_manifest_filepath = DATA_DIR / "train_manifest.json"
output_stats_path = DATA_DIR / "feature_stats.json"

args = [
 f"--feature_config_path={feature_config_path}",
 f"--manifest_path={train_manifest_filepath}",
 f"--audio_dir={audio_dir}",
 f"--feature_dir={feature_dir}",
 f"--stats_path={output_stats_path}"
]

run_script(feature_stats_script, args)

The output feature statistics file contains the mean and variance of the pitch and energy for the entire dataset (under the key `global`), and for each speaker in the dataset.

In [None]:
!head $output_stats_path

# HiFi-GAN Training

Our standard FastPitch model is a two-part recipe consisting of the **FastPitch** acoustic model which predicts a mel spectrogram from text, and **HiFi-GAN** vocoder which predicts audio from the mel spectrogram.

We will train HiFi-GAN first so that we can use it to help evaluate the performance of FastPitch as it is being trained.

HiFi-GAN training only requires a manifest with the `audio_filepath` field. All other fields in the manifest are for FastPitch training.

Here we show how to train these models from scratch. You can also fine-tune them from pretrained checkpoints as mentioned in our [FastPitch fine-tuning tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb), but pretrained checkpoints compatible with these experimental recipes are not yet available on NGC.


In [None]:
import torch

In [None]:
dataset_name = "vctk"
audio_dir = DATA_DIR / "audio_preprocessed"
train_manifest_filepath = DATA_DIR / "train_manifest.json"
dev_manifest_filepath = DATA_DIR / "dev_manifest.json"

In [None]:
hifigan_training_script = NEMO_EXAMPLES_DIR / "hifigan.py"

# The total number of training steps will be (epochs * steps_per_epoch)
epochs = 10
steps_per_epoch = 10

sample_rate = 44100

# Config files specifying all HiFi-GAN parameters
hifigan_config_dir = NEMO_CONFIG_DIR / "hifigan_dataset"

if sample_rate == 22050:
 hifigan_config_filename = "hifigan_22050.yaml"
elif sample_rate == 44100:
 hifigan_config_filename = "hifigan_44100.yaml"
else:
 raise ValueError(f"Unsupported sampling rate {sample_rate}")

# Name of the experiment that will determine where it is saved locally and in TensorBoard and WandB
run_id = "test_run"
exp_dir = root_dir / "exps"
hifigan_exp_output_dir = exp_dir / "HifiGan" / run_id
# Directory where predicted audio will be stored periodically throughout training
hifigan_log_dir = hifigan_exp_output_dir / "logs"

if torch.cuda.is_available():
 accelerator="gpu"
 batch_size = 16
else:
 accelerator="cpu"
 batch_size = 2

args = [
 f"--config-path={hifigan_config_dir}",
 f"--config-name={hifigan_config_filename}",
 f"max_epochs={epochs}",
 f"weighted_sampling_steps_per_epoch={steps_per_epoch}",
 f"batch_size={batch_size}",
 f"log_dir={hifigan_log_dir}",
 f"exp_manager.exp_dir={exp_dir}",
 f"+exp_manager.version={run_id}",
 f"trainer.accelerator={accelerator}",
 f"+train_ds_meta.{dataset_name}.manifest_path={train_manifest_filepath}",
 f"+train_ds_meta.{dataset_name}.audio_dir={audio_dir}",
 f"+val_ds_meta.{dataset_name}.manifest_path={dev_manifest_filepath}",
 f"+val_ds_meta.{dataset_name}.audio_dir={audio_dir}",
 f"+log_ds_meta.{dataset_name}.manifest_path={dev_manifest_filepath}",
 f"+log_ds_meta.{dataset_name}.audio_dir={audio_dir}"
]

In [None]:
# If an error occurs, log the entire stacktrace.
os.environ["HYDRA_FULL_ERROR"] = "1"

In [None]:
run_script(hifigan_training_script, args)

During training, the model will automatically save predictions for all files specified in the `log_ds_meta` manifest.

In [None]:
hifigan_log_epoch_dir = hifigan_log_dir / "epoch_10" / dataset_name
!ls $hifigan_log_epoch_dir

This makes it easy to listen to the audio to determine how well the model is performing. We can decide to stop training when either:

* The predicted audio sounds almost exactly the same as the original audio
* The predicted audio stops improving in between epochs.

**Note that the dataset in this tutorial is too small to get good quality audio output.**

In [None]:
audio_filepath = hifigan_log_epoch_dir / "p225_143.wav"
ipd.display(ipd.Audio(audio_filepath))

# FastPitch Training

Finally we can train the FastPitch model itself. The FastPitch training recipe requires:

1. Training manifest(s) with `audio_filepath` and `text` or `normalized_text` fields.
2. Precomputed features such as *pitch* and *energy* specified in the feature [config file](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/feature/feature_44100.yaml).
3. (Optional) Statistics file for normalizing features.
4. (Optional) For a multi-speaker model, the manifest needs a `speaker` field and JSON file mapping speaker IDs to speaker indices.
5. (Optional) To train with IPA phonemes, a [phoneme dictionary](https://github.com/NVIDIA/NeMo/blob/main/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt) and optional [heteronyms file](https://github.com/NVIDIA/NeMo/blob/main/scripts/tts_dataset_files/heteronyms-052722)
6. (Optional) HiFi-GAN checkpoint or [NGC model name](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/models/hifigan.py#L413) for generating audio predictions during training.



In [None]:
fastpitch_training_script = NEMO_EXAMPLES_DIR / "fastpitch.py"

# The total number of training steps will be (epochs * steps_per_epoch)
epochs = 10
steps_per_epoch = 10

num_speakers = 5
sample_rate = 44100

# Config files specifying all FastPitch parameters
fastpitch_config_dir = NEMO_CONFIG_DIR / "fastpitch"

if sample_rate == 22050:
 fastpitch_config_filename = "fastpitch_22050.yaml"
elif sample_rate == 44100:
 fastpitch_config_filename = "fastpitch_44100.yaml"
else:
 raise ValueError(f"Unsupported sampling rate {sample_rate}")

# Metadata files and directories
dataset_file_dir = NEMO_DIR / "scripts" / "tts_dataset_files"
phoneme_dict_path = dataset_file_dir / "ipa_cmudict-0.7b_nv23.01.txt"
heteronyms_path = dataset_file_dir / "heteronyms-052722"

speaker_path = DATA_DIR / "speakers.json"
feature_dir = DATA_DIR / "features"
stats_path = DATA_DIR / "feature_stats.json"

def get_latest_checkpoint(checkpoint_dir):
 output_path = None
 for checkpoint_path in checkpoint_dir.iterdir():
 checkpoint_name = str(checkpoint_path.name)
 if checkpoint_name.endswith(".nemo"):
 output_path = checkpoint_path
 break
 if checkpoint_name.endswith("last.ckpt"):
 output_path = checkpoint_path

 if not output_path:
 raise ValueError(f"Could not find latest checkpoint in {checkpoint_dir}")

 return output_path

# HiFi-GAN model for generating audio predictions from FastPitch output
vocoder_type = "hifigan"
vocoder_checkpoint_path = get_latest_checkpoint(hifigan_exp_output_dir / "checkpoints")

run_id = "test_run"
exp_dir = root_dir / "exps"
fastpitch_exp_output_dir = exp_dir / "FastPitch" / run_id
fastpitch_log_dir = fastpitch_exp_output_dir / "logs"

if torch.cuda.is_available():
 accelerator="gpu"
 batch_size = 32
else:
 accelerator="cpu"
 batch_size = 4

args = [
 f"--config-path={fastpitch_config_dir}",
 f"--config-name={fastpitch_config_filename}",
 f"n_speakers={num_speakers}",
 f"speaker_path={speaker_path}",
 f"max_epochs={epochs}",
 f"weighted_sampling_steps_per_epoch={steps_per_epoch}",
 f"phoneme_dict_path={phoneme_dict_path}",
 f"heteronyms_path={heteronyms_path}",
 f"feature_stats_path={stats_path}",
 f"log_dir={fastpitch_log_dir}",
 f"vocoder_type={vocoder_type}",
 f"vocoder_checkpoint_path=\\'{vocoder_checkpoint_path}\\'",
 f"trainer.accelerator={accelerator}",
 f"exp_manager.exp_dir={exp_dir}",
 f"+exp_manager.version={run_id}",
 f"+train_ds_meta.{dataset_name}.manifest_path={train_manifest_filepath}",
 f"+train_ds_meta.{dataset_name}.audio_dir={audio_dir}",
 f"+train_ds_meta.{dataset_name}.feature_dir={feature_dir}",
 f"+val_ds_meta.{dataset_name}.manifest_path={dev_manifest_filepath}",
 f"+val_ds_meta.{dataset_name}.audio_dir={audio_dir}",
 f"+val_ds_meta.{dataset_name}.feature_dir={feature_dir}",
 f"+log_ds_meta.{dataset_name}.manifest_path={dev_manifest_filepath}",
 f"+log_ds_meta.{dataset_name}.audio_dir={audio_dir}",
 f"+log_ds_meta.{dataset_name}.feature_dir={feature_dir}"
]

In [None]:
run_script(fastpitch_training_script, args)

During training, the model will automatically save spectrogram and audio predictions for all files specified in the `log_ds_meta` manifest.

In [None]:
faspitch_log_epoch_dir = fastpitch_log_dir / "epoch_10" / dataset_name
!ls $faspitch_log_epoch_dir

This makes it easy to listen to the audio to determine how well the model is performing. We can decide to stop training when either:

* The predicted audio stops improving in between epochs.
* The predicted spectrogram stops changing in between epochs.

**Note that the dataset in this tutorial is too small to get good quality audio output.**

In [None]:
audio_filepath = faspitch_log_epoch_dir / "p225_143.wav"
spectrogram_filepath = faspitch_log_epoch_dir / "p225_143_spec.png"

ipd.display(ipd.Audio(audio_filepath))
ipd.display(ipd.Image(spectrogram_filepath))