{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "gpuType": "T4",
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU",
    "gpuClass": "standard"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Introduction"
      ],
      "metadata": {
        "id": "rtBDkKqVGZJ8"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "In this tutorial, we will prepare a dataset using our [TTS Dataset Processing Scripts](https://github.com/NVIDIA/NeMo/tree/main/scripts/dataset_processing/tts) and use it for training a FastPitch model.\n",
        "\n",
        "**This tutorial uses a different workflow than all other existing TTS tutorials. The scripts and classes used are all experimental and not yet ready for production**."
      ],
      "metadata": {
        "id": "pZ2QSsXuGbMe"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# License"
      ],
      "metadata": {
        "id": "7X-TwhdTGmlc"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "> Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n",
        ">\n",
        "> Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
        ">\n",
        "> http://www.apache.org/licenses/LICENSE-2.0\n",
        ">\n",
        "> Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
      ],
      "metadata": {
        "id": "fCQUeZRPGnoe"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Install"
      ],
      "metadata": {
        "id": "3OZassNG5xff"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "BRANCH = 'main'\n",
        "NEMO_ROOT_DIR = '/content/nemo'"
      ],
      "metadata": {
        "id": "QLLoj7bD0W5f"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "WZvQvPkIhRi3"
      },
      "outputs": [],
      "source": [
        "# Install NeMo library. If you are running locally (rather than on Google Colab), comment out the below lines\n",
        "# and instead follow the instructions at https://github.com/NVIDIA/NeMo#Installation\n",
        "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "\n",
        "# Download local version of NeMo scripts. If you are running locally and want to use your own local NeMo code,\n",
        "# comment out the below lines and set NEMO_ROOT_DIR to your local path.\n",
        "!git clone -b $BRANCH https://github.com/NVIDIA/NeMo.git $NEMO_ROOT_DIR"
      ],
      "metadata": {
        "id": "tvsgWO_WhV3M"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Dataset Preparation"
      ],
      "metadata": {
        "id": "fM4QPsLTnzK7"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "For our tutorial, we use a subset of [VCTK](https://datashare.ed.ac.uk/handle/10283/2950) dataset with 5 speakers (p225-p229)."
      ],
      "metadata": {
        "id": "tkZC6Dl7KRl6"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "import tarfile\n",
        "import wget\n",
        "from pathlib import Path\n",
        "\n",
        "from nemo.collections.asr.parts.utils.manifest_utils import read_manifest, write_manifest"
      ],
      "metadata": {
        "id": "sYzvAYr2vo1K"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Configure nemo paths\n",
        "NEMO_DIR = Path(NEMO_ROOT_DIR)\n",
        "NEMO_EXAMPLES_DIR = NEMO_DIR / \"examples\" / \"tts\"\n",
        "NEMO_CONFIG_DIR = NEMO_EXAMPLES_DIR / \"conf\"\n",
        "NEMO_SCRIPT_DIR = NEMO_DIR / \"scripts\" / \"dataset_processing\" / \"tts\""
      ],
      "metadata": {
        "id": "APo1m5M-v3pB"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Create dataset directory\n",
        "root_dir = Path(\"/content\")\n",
        "data_root = root_dir / \"data\"\n",
        "\n",
        "data_root.mkdir(parents=True, exist_ok=True)"
      ],
      "metadata": {
        "id": "aoxN1QsUzX-k"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Download the dataset\n",
        "dataset_url = \"https://vctk-subset.s3.amazonaws.com/vctk_subset_multispeaker.tar.gz\"\n",
        "dataset_tar_filepath = data_root / \"vctk.tar.gz\"\n",
        "\n",
        "if not os.path.exists(dataset_tar_filepath):\n",
        "    wget.download(dataset_url, out=str(dataset_tar_filepath))"
      ],
      "metadata": {
        "id": "mArlQd5Hk36b"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Extract the dataset\n",
        "with tarfile.open(dataset_tar_filepath) as tar_f:\n",
        "    tar_f.extractall(data_root)"
      ],
      "metadata": {
        "id": "p987cjtOy9C7"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "DATA_DIR = data_root / \"vctk_subset_multispeaker\""
      ],
      "metadata": {
        "id": "Ko6dxYJW0i3G"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Visualize the raw dataset\n",
        "train_raw_filepath = DATA_DIR / \"train.json\"\n",
        "!head $train_raw_filepath"
      ],
      "metadata": {
        "id": "We5FHYQt5BeO"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Manifest Processing"
      ],
      "metadata": {
        "id": "i3jsk2HCMSU5"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "The downloaded manifest uses our traditional format for TTS training. The scripts here require it to be formatted slightly differently.\n",
        "\n",
        "The `speaker` field used to be an *integer* ID corresponding to an array index that the FastPitch model would query. Now we represent it as a *string* so we can give each speaker a human-friendly name. The mapping from speaker name to speaker index will be provided at training time.\n",
        "\n",
        "As a best practice, we suggest prepending the `speaker` field with the name of the dataset so that it is guaranteed to be unique across all datasets (eg. *vctk_225*, instead of *225*).\n",
        "\n",
        "The `audio_filepath` field used to require an *absolute path* which had to be manually updated depending on where the dataset was on your computer. Absolute paths still work, but now you can optionally provide it as a *relative path*, with the root directory provided as an argument to each script."
      ],
      "metadata": {
        "id": "N8WuAGJsMHRn"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def update_metadata(data_type):\n",
        "    input_filepath = DATA_DIR / f\"{data_type}.json\"\n",
        "    output_filepath = DATA_DIR / f\"{data_type}_raw.json\"\n",
        "\n",
        "    entries = read_manifest(input_filepath)\n",
        "    for entry in entries:\n",
        "        # Provide relative path instead of absolute path\n",
        "        entry[\"audio_filepath\"] = entry[\"audio_filepath\"].replace(\"audio/\", \"\")\n",
        "        # Prepend speaker ID with the name of the dataset: 'vctk'\n",
        "        entry[\"speaker\"] = f\"vctk_{entry['speaker']}\"\n",
        "\n",
        "    write_manifest(output_path=output_filepath, target_manifest=entries, ensure_ascii=False)"
      ],
      "metadata": {
        "id": "zoCRrKQ20VZP"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "update_metadata(\"dev\")\n",
        "update_metadata(\"train\")"
      ],
      "metadata": {
        "id": "PaCc3GCG1UbH"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Visualize updated 'audio_filepath' and 'speaker' fields\n",
        "train_filepath = DATA_DIR / \"train_raw.json\"\n",
        "!head $train_filepath"
      ],
      "metadata": {
        "id": "bVLIB3Ip1Aqn"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Text Preprocessing"
      ],
      "metadata": {
        "id": "e3jHTOhL1M5_"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "First we will process the text transcripts using the script [preprocess_text.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/preprocess_text.py).\n",
        "\n",
        "This step mainly passes the text through our NeMo *text normalizer* and then stores the output in the `normalized_text` field. It also has a few optional transformations, such as lowercasing the text."
      ],
      "metadata": {
        "id": "H2rYykFLSR5t"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "text_preprocessing_script = NEMO_SCRIPT_DIR / \"preprocess_text.py\"\n",
        "\n",
        "# Number of threads to parallelize text processing across\n",
        "num_workers = 4\n",
        "# Text normalizer to apply\n",
        "normalizer_config_filepath = NEMO_CONFIG_DIR / \"text\" / \"normalizer_en.yaml\"\n",
        "# Whether to lowercase output text. We can safely do this here because we will train on IPA phonemes.\n",
        "# If training on graphemes only, then consider disabling this to leave text with its original capitalization.\n",
        "lower_case = True\n",
        "# Whether to overwrite output manifest, if it exists\n",
        "overwrite_manifest = True\n",
        "# Batch size for joblib parallelization. Increasing this value might speed up the script, depending on your CPU.\n",
        "joblib_batch_size = 16\n",
        "\n",
        "# Python wrapper to invoke the given bash script with the given input args\n",
        "def run_script(script, args):\n",
        "    args = ' \\\\'.join(args)\n",
        "    cmd = f\"python {script} \\\\{args}\"\n",
        "\n",
        "    print(cmd.replace(\" \\\\\", \"\\n\"))\n",
        "    print()\n",
        "    !$cmd\n",
        "\n",
        "def preprocess_text(data_type):\n",
        "    input_filepath = DATA_DIR / f\"{data_type}_raw.json\"\n",
        "    output_filepath = DATA_DIR / f\"{data_type}_text.json\"\n",
        "\n",
        "    args = [\n",
        "        f\"--input_manifest={input_filepath}\",\n",
        "        f\"--output_manifest={output_filepath}\",\n",
        "        f\"--num_workers={num_workers}\",\n",
        "        f\"--normalizer_config_path={normalizer_config_filepath}\",\n",
        "        f\"--joblib_batch_size={joblib_batch_size}\"\n",
        "    ]\n",
        "    if lower_case:\n",
        "      args.append(\"--lower_case\")\n",
        "    if overwrite_manifest:\n",
        "        args.append(\"--overwrite\")\n",
        "\n",
        "    run_script(text_preprocessing_script, args)"
      ],
      "metadata": {
        "id": "6Z1vRsPd0g2s"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "preprocess_text(\"dev\")"
      ],
      "metadata": {
        "id": "qg6iK3NyrZvx"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "preprocess_text(\"train\")"
      ],
      "metadata": {
        "id": "DkLhSL_n1QAS"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Visualize the output of the 'normalized_text' field.\n",
        "train_text_filepath = DATA_DIR / \"train_text.json\"\n",
        "!head $train_text_filepath"
      ],
      "metadata": {
        "id": "6qHbl0Cf5kQn"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Audio Preprocessing"
      ],
      "metadata": {
        "id": "alrRDWio41qi"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Next we process the audio data using [preprocess_audio.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/preprocess_audio.py).\n",
        "\n",
        "During this step we apply the following transformations:\n",
        "\n",
        "1. Resample the audio from 48khz to 44.1khz so that it is compatible with our default training configuration.\n",
        "2. Remove long silence from the beginning and end of each audio file. This can be done using an *energy* based approach which will work on clean audio, or using *voice activity detection (VAD)* which also works on audio with background or static noise (eg. from a microphone).\n",
        "3. Scale the audio so that files have approximately the same volume level.\n",
        "4. Filter out audio files which are too long or too short.\n",
        "\n"
      ],
      "metadata": {
        "id": "4WfEaMwpUsFt"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import IPython.display as ipd"
      ],
      "metadata": {
        "id": "WEvIefjnd7AG"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "audio_preprocessing_script = NEMO_SCRIPT_DIR / \"preprocess_audio.py\"\n",
        "\n",
        "# Directory with raw audio data\n",
        "input_audio_dir = DATA_DIR / \"audio\"\n",
        "# Directory to write preprocessed audio to\n",
        "output_audio_dir = DATA_DIR / \"audio_preprocessed\"\n",
        "# Whether to overwrite existing audio, if it exists in the output directory\n",
        "overwrite_audio = True\n",
        "# Whether to overwrite output manifest, if it exists\n",
        "overwrite_manifest = True\n",
        "# Number of threads to parallelize audio processing across\n",
        "num_workers = 4\n",
        "# Downsample data from 48khz to 44.1khz for compatibility\n",
        "output_sample_rate = 44100\n",
        "# Format of output audio files. Use \"flac\" to compress to a smaller file size.\n",
        "output_format = \"flac\"\n",
        "# Method for silence trimming. Can use \"energy.yaml\" or \"vad.yaml\".\n",
        "# We use VAD for VCTK because the audio has background noise.\n",
        "trim_config_path = NEMO_CONFIG_DIR / \"trim\" / \"vad.yaml\"\n",
        "# Volume level (0, 1] to normalize audio to\n",
        "volume_level = 0.95\n",
        "# Filter out audio shorter than min_duration or longer than max_duration seconds.\n",
        "# We set these bounds relatively low/high, as we can place stricter limits at training time\n",
        "min_duration = 0.25\n",
        "max_duration = 30.0\n",
        "# Output file with entries that are filtered out based on duration\n",
        "filter_file = DATA_DIR / \"filtered.json\"\n",
        "\n",
        "def preprocess_audio(data_type):\n",
        "    input_filepath = DATA_DIR / f\"{data_type}_text.json\"\n",
        "    output_filepath = DATA_DIR / f\"{data_type}_manifest.json\"\n",
        "\n",
        "    args = [\n",
        "        f\"--input_manifest={input_filepath}\",\n",
        "        f\"--output_manifest={output_filepath}\",\n",
        "        f\"--input_audio_dir={input_audio_dir}\",\n",
        "        f\"--output_audio_dir={output_audio_dir}\",\n",
        "        f\"--num_workers={num_workers}\",\n",
        "        f\"--output_sample_rate={output_sample_rate}\",\n",
        "        f\"--output_format={output_format}\",\n",
        "        f\"--trim_config_path={trim_config_path}\",\n",
        "        f\"--volume_level={volume_level}\",\n",
        "        f\"--min_duration={min_duration}\",\n",
        "        f\"--max_duration={max_duration}\",\n",
        "        f\"--filter_file={filter_file}\",\n",
        "    ]\n",
        "    if overwrite_manifest:\n",
        "        args.append(\"--overwrite_manifest\")\n",
        "    if overwrite_audio:\n",
        "        args.append(\"--overwrite_audio\")\n",
        "\n",
        "    run_script(audio_preprocessing_script, args)"
      ],
      "metadata": {
        "id": "0kQ1UDnGfdX6"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "preprocess_audio(\"dev\")"
      ],
      "metadata": {
        "id": "ai0zbXSOriuY"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "preprocess_audio(\"train\")"
      ],
      "metadata": {
        "id": "NUKnidQYfgDo"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "We should listen to a few audio files before and after the processing so be sure we configured it correctly.\n",
        "\n",
        "Note that the processed audio is louder. It is also shorter because we trimmed the leading and trailing silence."
      ],
      "metadata": {
        "id": "x2yhJtsj2lDR"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "audio_file = \"p228_009.wav\"\n",
        "audio_filepath = input_audio_dir / audio_file\n",
        "processed_audio_filepath = output_audio_dir / audio_file.replace(\".wav\", \".flac\")\n",
        "\n",
        "print(\"Original audio.\")\n",
        "ipd.display(ipd.Audio(audio_filepath))\n",
        "\n",
        "print(\"Processed audio.\")\n",
        "ipd.display(ipd.Audio(processed_audio_filepath))"
      ],
      "metadata": {
        "id": "_fM3GwJxkjOA"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Speaker Mapping"
      ],
      "metadata": {
        "id": "d129p0nrr3PD"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "We can use [create_speaker_map.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/create_speaker_map.py) to easily create a mapping from speaker ID strings to integer indices that will be used at training time.\n",
        "\n",
        "The script will simply sort the speaker IDs and assign them numbers `[0, num_speakers)` in alphabetical order."
      ],
      "metadata": {
        "id": "ZJ1MWX3F3X9u"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "speaker_map_script = NEMO_SCRIPT_DIR / \"create_speaker_map.py\"\n",
        "\n",
        "train_manifest_filepath = DATA_DIR / \"train_manifest.json\"\n",
        "dev_manifest_filepath = DATA_DIR / \"dev_manifest.json\"\n",
        "speaker_filepath = DATA_DIR / \"speakers.json\"\n",
        "\n",
        "args = [\n",
        "    f\"--manifest_path={train_manifest_filepath}\",\n",
        "    f\"--manifest_path={dev_manifest_filepath}\",\n",
        "    f\"--speaker_map_path={speaker_filepath}\"\n",
        "]\n",
        "\n",
        "run_script(speaker_map_script, args)"
      ],
      "metadata": {
        "id": "b5gdccYhr5Gk"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Visualize the speaker map file.\n",
        "!head $speaker_filepath"
      ],
      "metadata": {
        "id": "CMcC2Nqmt5AR"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Feature Computation"
      ],
      "metadata": {
        "id": "jyFxOjy6t8vo"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Before training FastPitch, we need to compute some features for every audio file. The default [config file](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/feature/feature_44100.yaml) we will use has parameters for computing the **pitch** and **energy** of every audio frame. Be default it will also compute a **voiced_mask** indicating which audio frames have no pitch (eg. because they contain silence)."
      ],
      "metadata": {
        "id": "QNPpwkM49orB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "feature_script = NEMO_SCRIPT_DIR / \"compute_features.py\"\n",
        "\n",
        "sample_rate = 44100\n",
        "\n",
        "if sample_rate == 22050:\n",
        "    feature_config_filename = \"feature_22050.yaml\"\n",
        "elif sample_rate == 44100:\n",
        "    feature_config_filename = \"feature_44100.yaml\"\n",
        "else:\n",
        "    raise ValueError(f\"Unsupported sampling rate {sample_rate}\")\n",
        "\n",
        "feature_config_path = NEMO_CONFIG_DIR / \"feature\" / feature_config_filename\n",
        "audio_dir = DATA_DIR / \"audio_preprocessed\"\n",
        "feature_dir = DATA_DIR / \"features\"\n",
        "num_workers = 4\n",
        "\n",
        "def compute_features(data_type):\n",
        "    input_filepath = DATA_DIR / f\"{data_type}_manifest.json\"\n",
        "\n",
        "    args = [\n",
        "        f\"--feature_config_path={feature_config_path}\",\n",
        "        f\"--manifest_path={input_filepath}\",\n",
        "        f\"--audio_dir={audio_dir}\",\n",
        "        f\"--feature_dir={feature_dir}\",\n",
        "        f\"--num_workers={num_workers}\"\n",
        "    ]\n",
        "\n",
        "    run_script(feature_script, args)"
      ],
      "metadata": {
        "id": "AI4aLRFbt_NQ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "compute_features(\"dev\")"
      ],
      "metadata": {
        "id": "kQqPw3uRwEsO"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "compute_features(\"train\")"
      ],
      "metadata": {
        "id": "ct1fN_4pwCu9"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The features are stored in the specified `feature_dir`."
      ],
      "metadata": {
        "id": "db83_UcOCOIo"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!ls $feature_dir"
      ],
      "metadata": {
        "id": "_8bHP4j56LWG"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Feature Statistics"
      ],
      "metadata": {
        "id": "QsuxK1P0x7hZ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "For training it is beneficial for us to *normalize* our features. The most standard approach is to apply *mean-variance normalization* so that each feature has a mean of 0 and variance of 1. To do this we need to compute the *dataset statistics* with the mean and variance of each feature.\n",
        "\n",
        "For TTS it also helps\n",
        "*   Normalize features using speaker-level statistics.\n",
        "*   Use the `voiced_mask` to set the feature values of non-voiced audio frames to 0.\n",
        "\n",
        "Using the [compute_feature_stats.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/tts/compute_feature_stats.py) script we will compute the mean and variance of each feature for each speaker. The input to the script is the same [config file](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/feature/feature_44100.yaml) we used to compute the features."
      ],
      "metadata": {
        "id": "O8GiAnAMCNeh"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "feature_stats_script = NEMO_SCRIPT_DIR / \"compute_feature_stats.py\"\n",
        "\n",
        "train_manifest_filepath = DATA_DIR / \"train_manifest.json\"\n",
        "output_stats_path = DATA_DIR / \"feature_stats.json\"\n",
        "\n",
        "args = [\n",
        "    f\"--feature_config_path={feature_config_path}\",\n",
        "    f\"--manifest_path={train_manifest_filepath}\",\n",
        "    f\"--audio_dir={audio_dir}\",\n",
        "    f\"--feature_dir={feature_dir}\",\n",
        "    f\"--stats_path={output_stats_path}\"\n",
        "]\n",
        "\n",
        "run_script(feature_stats_script, args)"
      ],
      "metadata": {
        "id": "DC4c1L3CxH-h"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The output feature statistics file contains the mean and variance of the pitch and energy for the entire dataset (under the key `global`), and for each speaker in the dataset."
      ],
      "metadata": {
        "id": "zos96yaoFho1"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!head $output_stats_path"
      ],
      "metadata": {
        "id": "fOz1cpIdFcG9"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# HiFi-GAN Training"
      ],
      "metadata": {
        "id": "oRO842MUyODC"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Our standard FastPitch model is a two-part recipe consisting of the **FastPitch** acoustic model which predicts a mel spectrogram from text, and **HiFi-GAN** vocoder which predicts audio from the mel spectrogram.\n",
        "\n",
        "We will train HiFi-GAN first so that we can use it to help evaluate the performance of FastPitch as it is being trained.\n",
        "\n",
        "HiFi-GAN training only requires a manifest with the `audio_filepath` field. All other fields in the manifest are for FastPitch training.\n",
        "\n",
        "Here we show how to train these models from scratch. You can also fine-tune them from pretrained checkpoints as mentioned in our [FastPitch fine-tuning tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/FastPitch_Finetuning.ipynb), but pretrained checkpoints compatible with these experimental recipes are not yet available on NGC.\n"
      ],
      "metadata": {
        "id": "E4wUKYOfH8ax"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import torch"
      ],
      "metadata": {
        "id": "pqfl9jAYMJob"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "dataset_name = \"vctk\"\n",
        "audio_dir = DATA_DIR / \"audio_preprocessed\"\n",
        "train_manifest_filepath = DATA_DIR / \"train_manifest.json\"\n",
        "dev_manifest_filepath = DATA_DIR / \"dev_manifest.json\""
      ],
      "metadata": {
        "id": "jK2rr-Kr6Qg8"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "hifigan_training_script = NEMO_EXAMPLES_DIR / \"hifigan.py\"\n",
        "\n",
        "# The total number of training steps will be (epochs * steps_per_epoch)\n",
        "epochs = 10\n",
        "steps_per_epoch = 10\n",
        "\n",
        "sample_rate = 44100\n",
        "\n",
        "# Config files specifying all HiFi-GAN parameters\n",
        "hifigan_config_dir = NEMO_CONFIG_DIR / \"hifigan_dataset\"\n",
        "\n",
        "if sample_rate == 22050:\n",
        "    hifigan_config_filename = \"hifigan_22050.yaml\"\n",
        "elif sample_rate == 44100:\n",
        "    hifigan_config_filename = \"hifigan_44100.yaml\"\n",
        "else:\n",
        "    raise ValueError(f\"Unsupported sampling rate {sample_rate}\")\n",
        "\n",
        "# Name of the experiment that will determine where it is saved locally and in TensorBoard and WandB\n",
        "run_id = \"test_run\"\n",
        "exp_dir = root_dir / \"exps\"\n",
        "hifigan_exp_output_dir = exp_dir / \"HifiGan\" / run_id\n",
        "# Directory where predicted audio will be stored periodically throughout training\n",
        "hifigan_log_dir = hifigan_exp_output_dir / \"logs\"\n",
        "\n",
        "if torch.cuda.is_available():\n",
        "    accelerator=\"gpu\"\n",
        "    batch_size = 16\n",
        "else:\n",
        "    accelerator=\"cpu\"\n",
        "    batch_size = 2\n",
        "\n",
        "args = [\n",
        "    f\"--config-path={hifigan_config_dir}\",\n",
        "    f\"--config-name={hifigan_config_filename}\",\n",
        "    f\"max_epochs={epochs}\",\n",
        "    f\"weighted_sampling_steps_per_epoch={steps_per_epoch}\",\n",
        "    f\"batch_size={batch_size}\",\n",
        "    f\"log_dir={hifigan_log_dir}\",\n",
        "    f\"exp_manager.exp_dir={exp_dir}\",\n",
        "    f\"+exp_manager.version={run_id}\",\n",
        "    f\"trainer.accelerator={accelerator}\",\n",
        "    f\"+train_ds_meta.{dataset_name}.manifest_path={train_manifest_filepath}\",\n",
        "    f\"+train_ds_meta.{dataset_name}.audio_dir={audio_dir}\",\n",
        "    f\"+val_ds_meta.{dataset_name}.manifest_path={dev_manifest_filepath}\",\n",
        "    f\"+val_ds_meta.{dataset_name}.audio_dir={audio_dir}\",\n",
        "    f\"+log_ds_meta.{dataset_name}.manifest_path={dev_manifest_filepath}\",\n",
        "    f\"+log_ds_meta.{dataset_name}.audio_dir={audio_dir}\"\n",
        "]"
      ],
      "metadata": {
        "id": "Vr4D-NB-yQx8"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# If an error occurs, log the entire stacktrace.\n",
        "os.environ[\"HYDRA_FULL_ERROR\"] = \"1\""
      ],
      "metadata": {
        "id": "Bn8lQG0PxWGi"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "run_script(hifigan_training_script, args)"
      ],
      "metadata": {
        "id": "yUxFCNrE3Ywi"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "During training, the model will automatically save predictions for all files specified in the `log_ds_meta` manifest."
      ],
      "metadata": {
        "id": "BBPIpS-lL6z9"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "hifigan_log_epoch_dir = hifigan_log_dir / \"epoch_10\" / dataset_name\n",
        "!ls $hifigan_log_epoch_dir"
      ],
      "metadata": {
        "id": "rSFOm1Sg46Lh"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "This makes it easy to listen to the audio to determine how well the model is performing. We can decide to stop training when either:\n",
        "\n",
        "*   The predicted audio sounds almost exactly the same as the original audio\n",
        "*   The predicted audio stops improving in between epochs.\n",
        "\n",
        "**Note that the dataset in this tutorial is too small to get good quality audio output.**"
      ],
      "metadata": {
        "id": "oCJs7oCLMIjD"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "audio_filepath = hifigan_log_epoch_dir / \"p225_143.wav\"\n",
        "ipd.display(ipd.Audio(audio_filepath))"
      ],
      "metadata": {
        "id": "G6k4ymzfJ5Y6"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# FastPitch Training"
      ],
      "metadata": {
        "id": "lV--2Wph7NPG"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Finally we can train the FastPitch model itself. The FastPitch training recipe requires:\n",
        "\n",
        "1. Training manifest(s) with `audio_filepath` and `text` or `normalized_text` fields.\n",
        "2. Precomputed features such as *pitch* and *energy* specified in the feature [config file](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/feature/feature_44100.yaml).\n",
        "3. (Optional) Statistics file for normalizing features.\n",
        "4. (Optional) For a multi-speaker model, the manifest needs a `speaker` field and JSON file mapping speaker IDs to speaker indices.\n",
        "5. (Optional) To train with IPA phonemes, a [phoneme dictionary](https://github.com/NVIDIA/NeMo/blob/main/scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt) and optional [heteronyms file](https://github.com/NVIDIA/NeMo/blob/main/scripts/tts_dataset_files/heteronyms-052722)\n",
        "6. (Optional) HiFi-GAN checkpoint or [NGC model name](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/models/hifigan.py#L413) for generating audio predictions during training.\n",
        "\n"
      ],
      "metadata": {
        "id": "aOuoPXDhOVD7"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "fastpitch_training_script = NEMO_EXAMPLES_DIR / \"fastpitch.py\"\n",
        "\n",
        "# The total number of training steps will be (epochs * steps_per_epoch)\n",
        "epochs = 10\n",
        "steps_per_epoch = 10\n",
        "\n",
        "num_speakers = 5\n",
        "sample_rate = 44100\n",
        "\n",
        "# Config files specifying all FastPitch parameters\n",
        "fastpitch_config_dir = NEMO_CONFIG_DIR / \"fastpitch\"\n",
        "\n",
        "if sample_rate == 22050:\n",
        "    fastpitch_config_filename = \"fastpitch_22050.yaml\"\n",
        "elif sample_rate == 44100:\n",
        "    fastpitch_config_filename = \"fastpitch_44100.yaml\"\n",
        "else:\n",
        "    raise ValueError(f\"Unsupported sampling rate {sample_rate}\")\n",
        "\n",
        "# Metadata files and directories\n",
        "dataset_file_dir = NEMO_DIR / \"scripts\" / \"tts_dataset_files\"\n",
        "phoneme_dict_path = dataset_file_dir / \"ipa_cmudict-0.7b_nv23.01.txt\"\n",
        "heteronyms_path = dataset_file_dir / \"heteronyms-052722\"\n",
        "\n",
        "speaker_path = DATA_DIR / \"speakers.json\"\n",
        "feature_dir = DATA_DIR / \"features\"\n",
        "stats_path = DATA_DIR / \"feature_stats.json\"\n",
        "\n",
        "def get_latest_checkpoint(checkpoint_dir):\n",
        "    output_path = None\n",
        "    for checkpoint_path in checkpoint_dir.iterdir():\n",
        "        checkpoint_name = str(checkpoint_path.name)\n",
        "        if checkpoint_name.endswith(\".nemo\"):\n",
        "            output_path = checkpoint_path\n",
        "            break\n",
        "        if checkpoint_name.endswith(\"last.ckpt\"):\n",
        "            output_path = checkpoint_path\n",
        "\n",
        "    if not output_path:\n",
        "        raise ValueError(f\"Could not find latest checkpoint in {checkpoint_dir}\")\n",
        "\n",
        "    return output_path\n",
        "\n",
        "# HiFi-GAN model for generating audio predictions from FastPitch output\n",
        "vocoder_type = \"hifigan\"\n",
        "vocoder_checkpoint_path = get_latest_checkpoint(hifigan_exp_output_dir / \"checkpoints\")\n",
        "\n",
        "run_id = \"test_run\"\n",
        "exp_dir = root_dir / \"exps\"\n",
        "fastpitch_exp_output_dir = exp_dir / \"FastPitch\" / run_id\n",
        "fastpitch_log_dir = fastpitch_exp_output_dir / \"logs\"\n",
        "\n",
        "if torch.cuda.is_available():\n",
        "    accelerator=\"gpu\"\n",
        "    batch_size = 32\n",
        "else:\n",
        "    accelerator=\"cpu\"\n",
        "    batch_size = 4\n",
        "\n",
        "args = [\n",
        "    f\"--config-path={fastpitch_config_dir}\",\n",
        "    f\"--config-name={fastpitch_config_filename}\",\n",
        "    f\"n_speakers={num_speakers}\",\n",
        "    f\"speaker_path={speaker_path}\",\n",
        "    f\"max_epochs={epochs}\",\n",
        "    f\"weighted_sampling_steps_per_epoch={steps_per_epoch}\",\n",
        "    f\"phoneme_dict_path={phoneme_dict_path}\",\n",
        "    f\"heteronyms_path={heteronyms_path}\",\n",
        "    f\"feature_stats_path={stats_path}\",\n",
        "    f\"log_dir={fastpitch_log_dir}\",\n",
        "    f\"vocoder_type={vocoder_type}\",\n",
        "    f\"vocoder_checkpoint_path=\\\\'{vocoder_checkpoint_path}\\\\'\",\n",
        "    f\"trainer.accelerator={accelerator}\",\n",
        "    f\"exp_manager.exp_dir={exp_dir}\",\n",
        "    f\"+exp_manager.version={run_id}\",\n",
        "    f\"+train_ds_meta.{dataset_name}.manifest_path={train_manifest_filepath}\",\n",
        "    f\"+train_ds_meta.{dataset_name}.audio_dir={audio_dir}\",\n",
        "    f\"+train_ds_meta.{dataset_name}.feature_dir={feature_dir}\",\n",
        "    f\"+val_ds_meta.{dataset_name}.manifest_path={dev_manifest_filepath}\",\n",
        "    f\"+val_ds_meta.{dataset_name}.audio_dir={audio_dir}\",\n",
        "    f\"+val_ds_meta.{dataset_name}.feature_dir={feature_dir}\",\n",
        "    f\"+log_ds_meta.{dataset_name}.manifest_path={dev_manifest_filepath}\",\n",
        "    f\"+log_ds_meta.{dataset_name}.audio_dir={audio_dir}\",\n",
        "    f\"+log_ds_meta.{dataset_name}.feature_dir={feature_dir}\"\n",
        "]"
      ],
      "metadata": {
        "id": "8MdMXnOAIFvj"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "run_script(fastpitch_training_script, args)"
      ],
      "metadata": {
        "id": "apl7TvW0TaEG"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "During training, the model will automatically save spectrogram and audio predictions for all files specified in the `log_ds_meta` manifest."
      ],
      "metadata": {
        "id": "Z01Fq7WRl7Di"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "faspitch_log_epoch_dir = fastpitch_log_dir / \"epoch_10\" / dataset_name\n",
        "!ls $faspitch_log_epoch_dir"
      ],
      "metadata": {
        "id": "E8rVKnKN5HDa"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "This makes it easy to listen to the audio to determine how well the model is performing. We can decide to stop training when either:\n",
        "\n",
        "*   The predicted audio stops improving in between epochs.\n",
        "*   The predicted spectrogram stops changing in between epochs.\n",
        "\n",
        "**Note that the dataset in this tutorial is too small to get good quality audio output.**"
      ],
      "metadata": {
        "id": "PeNaxoCzN7Ii"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "audio_filepath = faspitch_log_epoch_dir / \"p225_143.wav\"\n",
        "spectrogram_filepath = faspitch_log_epoch_dir / \"p225_143_spec.png\"\n",
        "\n",
        "ipd.display(ipd.Audio(audio_filepath))\n",
        "ipd.display(ipd.Image(spectrogram_filepath))"
      ],
      "metadata": {
        "id": "ynZdcnKc3CRF"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}