# Hybrid ASR-TTS Models Tutorial

This tutorial is intended to introduce you to using ASR-TTS Hybrid Models, also known as `ASRWithTTSModel`, to finetune existing ASR models using an integrated text-to-mel-spectrogram generator. 

## ASR-TTS Models: Description

### Problem

Adapting ASR models to a new text domain is a challenging task. Modern end-to-end systems can require several hundreds and thousands of hours to perform recognition with high accuracy. Acquiring audio-text paired data for a specific domain can be prohibitively expensive. Text-only data, on the other side, is widely available. 

One of the approaches for efficient adaptation is synthesizing audio data from text and using such data for training the ASR model conventionally. We modify this approach, incorporating TTS and ASR systems into a single model. We use only a lightweight multi-speaker text-to-mel-spectrogram generator (without vocoder) with an optional enhancer that mitigates the mismatch between natural and synthetic spectrograms.

### Architecture

<img width="400px" height="auto"
     src="https://github.com/NVIDIA/NeMo/blob/stable/docs/source/asr/images/hybrid_asr_tts_model.png?raw=true"
     alt="ASR-TTS model architecture"
     style="float: right; margin-left: 20px;">

`ASRWithTTSModel` is a transparent wrapper for three models:
- ASR model (`EncDecCTCModelBPE`, `EncDecRNNTBPEModel` or `EncDecHybridRNNTCTCBPEModel` are supported)
- frozen text-to-mel-spectrogram model (currently, only `FastPitch` model is supported)
- optional frozen enhancer model

The architecture is shown in the figure. 

The model can take text or audio as input during training. In the case of audio input, a mel spectrogram is extracted as usual and passed to the ASR neural network. In the case of textual input, the mel spectrogram generator produces a spectrogram on the fly from the text. The spectrogram is improved by the enhancer (if present) and fed into the ASR model. 

### Capabilities and Limitations

This approach can be used to finetune the pretrained ASR model using text-only data. Training new models from scratch is also possible. The text should contain phrases and sentences and be split into sentences (~45 words maximum, corresponding to ~16.7 seconds of synthesized audio). Using only separate words is not recommended since this doesn't allow to adapt ASR model adapts to recognize new words in context. 

Mixing audio-text pairs with text-only data from the original domain is recommended to preserve performance on the original data. 
Also, fusing BatchNorm (see parameters below) is recommended for the best performance when using a large proportion of text compared to the amount of audio-text pairs in finetuning process.


### Implementation Details and Experiments

Further details about implementation and experiments can be found in the paper [Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator](https://arxiv.org/abs/2302.14036)


## Example: Finetuning ASR Model Using Text-Only Data

In this example, we will finetune a pretrained small Conformer-CTC model using text-only data from the AN4 dataset. [AN4 dataset](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html#an4-dataset) is a small dataset that consists of sentences of people spelling out addresses, names, and other entities.

The model is pretrained on LibriSpeech data and performs poorly on AN4 data (`~17.7%` WER on test data).
We will use only text from the train part to construct text-only training data for our model and will achieve a good performance on the test part of the AN4 dataset (`~2%` WER).

You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run the following cell to set up dependencies.

NOTE: The user is responsible for checking the content of datasets and the applicable licenses and determining if they are suitable for the intended use.

### Install Dependencies

In [None]:
try:
    import google.colab

    IN_COLAB = True
except (ImportError, ModuleNotFoundError):
    IN_COLAB = False

In [None]:
BRANCH = 'main'

In [None]:
# If you're using Google Colab and not running locally, run this cell.

if IN_COLAB:
    ## Install dependencies
    !pip install wget
    !apt-get install sox libsndfile1 ffmpeg
    !pip install text-unidecode

    ## Install NeMo
    !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

### Import necessary libraries and utils

In [None]:
import os
from pathlib import Path
import string
import tempfile

from omegaconf import OmegaConf
import lightning.pytorch as pl
import torch
from tqdm.auto import tqdm
import wget

from nemo.collections.asr.models import EncDecCTCModelBPE
from nemo.collections.asr.models.hybrid_asr_tts_models import ASRWithTTSModel
from nemo.collections.asr.parts.utils.manifest_utils import read_manifest, write_manifest
from nemo.collections.tts.models import FastPitchModel, SpectrogramEnhancerModel
from nemo.utils.notebook_utils import download_an4

try:
    from nemo_text_processing.text_normalization.normalize import Normalizer
except ModuleNotFoundError:
    raise ModuleNotFoundError(
        "The package `nemo_text_processing` was not installed in this environment. Please refer to"
        " https://github.com/NVIDIA/NeMo-text-processing and install this package before using "
        "this script"
    )

### Prepare Data

Download and preprocess AN4 data.

In [None]:
DATASETS_DIR = Path("./datasets")  # directory for data
CHECKPOINTS_DIR = Path("./checkpoints/")  # directory for checkpoints

In [None]:
# create directories if necessary
DATASETS_DIR.mkdir(parents=True, exist_ok=True)
CHECKPOINTS_DIR.mkdir(parents=True, exist_ok=True)

In [None]:
download_an4(data_dir=f"{DATASETS_DIR}")

In [None]:
AN4_DATASET = DATASETS_DIR / "an4"

### Construct text-only training data

In [None]:
# read original training data
an4_train_data = read_manifest(AN4_DATASET / "train_manifest.json")

Text-only manifest should contain three fields:
- `text`: target text for the ASR model
- `tts_text`: text to use as a source for the TTS model (unnormalized)
- `tts_text_normalized`: text to use as a source for TTS model (normalized)

If `tts_text_normalized` is not present, `tts_text` will be used, and normalization will be done when loading the dataset.
It is highly recommended to normalize the text and manually create the `tts_text_normalized` field since current normalizers are unsuitable for processing a large amount of text on the fly.

In [None]:
# fill `text` and `tts_text` fields with the source data
textonly_data = []
for record in an4_train_data:
    text = record["text"]
    textonly_data.append({"text": text, "tts_text": text})

In [None]:
WHITELIST_URL = (
    "https://raw.githubusercontent.com/NVIDIA/NeMo-text-processing/main/"
    "nemo_text_processing/text_normalization/en/data/whitelist/lj_speech.tsv"
)


def get_normalizer() -> Normalizer:
    with tempfile.TemporaryDirectory() as data_dir:
        whitelist_path = Path(data_dir) / "lj_speech.tsv"
        if not whitelist_path.exists():
            wget.download(WHITELIST_URL, out=str(data_dir))

        normalizer = Normalizer(
            lang="en",
            input_case="cased",
            whitelist=str(whitelist_path),
            overwrite_cache=True,
            cache_dir=None,
        )
    return normalizer

Ð¡onstruct the `tts_text_normalized` field by applying an English normalizer to the text.

AN4 data doesn't contain numbers, currency, and other entities, so the normalizer is used here only for demonstration purposes.

In [None]:
normalizer = get_normalizer()

In [None]:
for record in tqdm(textonly_data):
    record["tts_text_normalized"] = normalizer.normalize(
        record["tts_text"], verbose=False, punct_pre_process=True, punct_post_process=True
    )

Save manifest

In [None]:
write_manifest(AN4_DATASET / "train_text_manifest.json", textonly_data)

### Save pretrained checkpoints

Firstly we will load pretrained models from NGC and save them as `nemo` checkpoints. 
Our hybrid model will be constructed from these checkpoints.
We will use:
- small Conformer-CTC ASR model trained on LibriSpeech data (for finetuning)
- multi-speaker TTS FastPitch model is trained on LibriTTS data. Spectrogram parameters for this model are the same as those used in the ASR model
- enhancer, which is trained adversarially on the output of the TTS model and natural spectrograms

In [None]:
ASR_MODEL_PATH = CHECKPOINTS_DIR / "stt_en_conformer_ctc_small_ls.nemo"
TTS_MODEL_PATH = CHECKPOINTS_DIR / "fastpitch.nemo"
ENHANCER_MODEL_PATH = CHECKPOINTS_DIR / "enhancer.nemo"

In [None]:
# asr model: stt_en_conformer_ctc_small_ls
asr_model = EncDecCTCModelBPE.from_pretrained(model_name="stt_en_conformer_ctc_small_ls")
asr_model.save_to(f"{ASR_MODEL_PATH}")

# tts model: tts_en_fastpitch_for_asr_finetuning
tts_model = FastPitchModel.from_pretrained(model_name="tts_en_fastpitch_for_asr_finetuning")
tts_model.save_to(f"{TTS_MODEL_PATH}")

# enhancer model: tts_en_spectrogram_enhancer_for_asr_finetuning
enhancer_model = SpectrogramEnhancerModel.from_pretrained(model_name="tts_en_spectrogram_enhancer_for_asr_finetuning")
enhancer_model.save_to(f"{ENHANCER_MODEL_PATH}")

### Construct hybrid ASR-TTS model 

#### Config Parameters

`Hybrid ASR-TTS model` consists of three parts:

* ASR model (``EncDecCTCModelBPE``, ``EncDecRNNTBPEModel`` or ``EncDecHybridRNNTCTCBPEModel``)
* TTS Mel Spectrogram Generator (currently, only `FastPitch` model is supported)
* Enhancer model (optional)

Also, the config allows to specify a text-only dataset.

Main parts of the config:

* ASR model
    * ``asr_model_path``: path to the ASR model checkpoint (`.nemo`) file, loaded only once, then the config of the ASR model is stored in the ``asr_model`` field
    * ``asr_model_type``: needed only when training from scratch. ``rnnt_bpe`` corresponds to ``EncDecRNNTBPEModel``, ``ctc_bpe`` to ``EncDecCTCModelBPE``, ``hybrid_rnnt_ctc_bpe`` to ``EncDecHybridRNNTCTCBPEModel``
    * ``asr_model_fuse_bn``: fusing BatchNorm in the pretrained ASR model, can improve quality in finetuning scenario
* TTS model
    * ``tts_model_path``: path to the pretrained TTS model checkpoint (`.nemo`) file, loaded only once, then the config of the model is stored in the ``tts_model`` field
* Enhancer model
    * ``enhancer_model_path``: optional path to the enhancer model. Loaded only once, the config is stored in the ``enhancer_model`` field
* ``train_ds``
    * ``text_data``: properties related to text-only data
        * ``manifest_filepath``: path (or paths) to text-only dataset manifests
        * ``speakers_filepath``: path (or paths) to the text file containing speaker ids for the multi-speaker TTS model (speakers are sampled randomly during training)
        * ``min_words`` and ``max_words``: parameters to filter text-only manifests by the number of words
        * ``tokenizer_workers``: number of workers for initial tokenization (when loading the data). ``num_CPUs / num_GPUs`` is a recommended value.
    * ``asr_tts_sampling_technique``, ``asr_tts_sampling_temperature``, ``asr_tts_sampling_probabilities``: sampling parameters for text-only and audio-text data (if both specified). Correspond to ``sampling_technique``, ``sampling_temperature``, and ``sampling_probabilities`` parameters of the `nemo.collections.common.data.dataset.ConcatDataset`.
    * all other components are similar to conventional ASR models
* ``validation_ds`` and ``test_ds`` correspond to the underlying ASR model

In [None]:
# load config
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/asr_tts/hybrid_asr_tts.yaml

In [None]:
config = OmegaConf.load("./configs/hybrid_asr_tts.yaml")

In [None]:
NUM_EPOCHS = 10

We will use all available speakers (sampled uniformly).

In [None]:
TTS_SPEAKERS_PATH = Path("./checkpoints/speakers.txt")

with open(TTS_SPEAKERS_PATH, "w", encoding="utf-8") as f:
    for speaker_id in range(tts_model.cfg.n_speakers):
        print(speaker_id, file=f)

In [None]:
config.model.asr_model_path = ASR_MODEL_PATH
config.model.tts_model_path = TTS_MODEL_PATH
config.model.enhancer_model_path = ENHANCER_MODEL_PATH

# fuse BathNorm automatically in Conformer for better performance
config.model.asr_model_fuse_bn = True

# training data
# constructed dataset
config.model.train_ds.text_data.manifest_filepath = str(AN4_DATASET / "train_text_manifest.json")
# speakers for TTS model
config.model.train_ds.text_data.speakers_filepath = f"{TTS_SPEAKERS_PATH}"
config.model.train_ds.manifest_filepath = None  # audio-text pairs - we don't use them here
config.model.train_ds.batch_size = 8

# validation data
config.model.validation_ds.manifest_filepath = str(AN4_DATASET / "test_manifest.json")
config.model.validation_ds.batch_size = 8

config.trainer.max_epochs = NUM_EPOCHS

config.trainer.devices = 1
config.trainer.strategy = 'auto'  # use 1 device, no need for ddp strategy

OmegaConf.resolve(config)

#### Construct trainer and ASRWithTTSModel

In [None]:
trainer = pl.Trainer(**config.trainer)

In [None]:
hybrid_model = ASRWithTTSModel(config.model)

#### Validate the model

Expect `~17.7%` WER on the AN4 test data.

In [None]:
trainer.validate(hybrid_model)

#### Train the model

In [None]:
trainer.fit(hybrid_model)

#### Validate the model after training

Expect `~2%` WER on the AN4 test data.

In [None]:
trainer.validate(hybrid_model)

### Save final model. Extract pure ASR model

In [None]:
# save full model: the model can be further used for finetuning
hybrid_model.save_to("checkpoints/finetuned_hybrid_model.nemo")

In [None]:
# extract the resulting ASR model from the hybrid model
hybrid_model.save_asr_model_to("checkpoints/finetuned_asr_model.nemo")

## Using Scripts (examples)

`<NeMo_git_root>/examples/asr/asr_with_tts/` contains scripts for finetuning existing models and training new models from scratch.

### Finetuning Existing Model

To finetune existing ASR model using text-only data use `<NeMo_git_root>/examples/asr/asr_with_tts/speech_to_text_bpe_with_text_finetune.py` script with the corresponding config `<NeMo_git_root>/examples/asr/conf/asr_tts/hybrid_asr_tts.yaml`.

Please specify paths to all the required models (ASR, TTS, and Enhancer checkpoints), along with `train_ds.text_data.manifest_filepath` and `train_ds.text_data.speakers_filepath`.

```shell
python speech_to_text_bpe_with_text_finetune.py \
    model.asr_model_path=<path to ASR model> \
    model.tts_model_path=<path to compatible TTS model> \
    model.enhancer_model_path=<optional path to enhancer model> \
    model.asr_model_fuse_bn=<true recommended if ConformerEncoder with BatchNorm, false otherwise> \
    model.train_ds.manifest_filepath=<path to manifest with audio-text pairs or null> \
    model.train_ds.text_data.manifest_filepath=<path(s) to manifest with train text> \
    model.train_ds.text_data.speakers_filepath=<path(s) to speakers list> \
    model.train_ds.text_data.tokenizer_workers=4 \
    model.validation_ds.manifest_filepath=<path to validation manifest> \
    model.train_ds.batch_size=<batch_size>
```

### Training a New Model from Scratch

```shell
python speech_to_text_bpe_with_text.py \
    # (Optional: --config-path=<path to dir of configs> --config-name=<name of config without .yaml>) \
    ++asr_model_type=<rnnt_bpe, ctc_bpe or hybrid_rnnt_ctc_bpe> \
    ++tts_model_path=<path to compatible tts model> \
    ++enhancer_model_path=<optional path to enhancer model> \
    model.tokenizer.dir=<path to tokenizer> \
    model.tokenizer.type="bpe" \
    model.train_ds.manifest_filepath=<path(s) to manifest with audio-text pairs or null> \
    ++model.train_ds.text_data.manifest_filepath=<path(s) to manifests with train text> \
    ++model.train_ds.text_data.speakers_filepath=<path(s) to speakers list> \
    ++model.train_ds.text_data.min_words=1 \
    ++model.train_ds.text_data.max_words=45 \
    ++model.train_ds.text_data.tokenizer_workers=4 \
    model.validation_ds.manifest_filepath=<path(s) to val/test manifest> \
    model.train_ds.batch_size=<batch size> \
    trainer.max_epochs=<num epochs> \
    trainer.num_nodes=<number of nodes> \
    trainer.accumulate_grad_batches=<grad accumultion> \
```

## Training TTS Models for ASR Finetuning

### TTS Model (FastPitch)

TTS model for the purpose of ASR model finetuning should be trained with the same mel spectrogram parameters as used in the ASR model. The typical parameters are `10ms` hop length, `25ms` window length, and the highest band of 8kHz (for 16kHz data). Other parameters are the same as for common multi-speaker TTS models.

Mainly we observed two differences specific to TTS models for ASR:
- adding more speakers and more data improves the final ASR model quality (but not the perceptual quality of the TTS model)
- training for more epochs can also improve the quality of the ASR system (but MSE loss used for the TTS model can be higher than optimal on validation data)

Use script `<NeMo_git_root>/examples/tts/fastpitch.py` to train a FastPitch model.
More details about the FastPitch model can be found in the [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/tts/models.html#fastpitch). 

### Enhancer
Use script `<NeMo_git_root>/examples/tts/spectrogram_enhancer.py` to train an Enhancer model. More details can be found in the 
[documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/tts/models.html).

### Models Used in This Tutorial

Some details about the models used in this tutorial can be found on [NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/tts_en_fastpitch_spectrogram_enhancer_for_asr_finetuning).

The system is also described in detail in the paper in the paper [Text-only domain adaptation for end-to-end ASR using integrated text-to-mel-spectrogram generator](https://arxiv.org/abs/2302.14036).

## Summary

The tutorial demonstrated the main concepts related to hybrid ASR-TTS models to finetune ASR models and train new ones from scratch. 
The ability to achieve good text-only adaptation results is demonstrated by finetuning a small Conformer model on text-only data from the AN4 dataset.