{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {
                "id": "RYGnI-EZp_nK"
            },
            "source": [
                "# Getting Started: Sample Conversational AI application\n",
                "This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translate Mandarin audio file into English one.\n",
                "\n",
                "The demo demonstrates how to: \n",
                "\n",
                "* Instantiate pre-trained NeMo models from NVIDIA NGC.\n",
                "* Transcribe audio with (Mandarin) speech recognition model.\n",
                "* Translate text with machine translation model.\n",
                "* Generate audio with text-to-speech models."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "id": "V72HXYuQ_p9a"
            },
            "source": [
                "## Installation\n",
                "NeMo can be installed via simple pip command.\n",
                "This will take about 4 minutes.\n",
                "\n",
                "(The installation method below should work inside your new Conda environment or in an NVIDIA docker container.)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "id": "efDmTWf1_iYK"
            },
            "outputs": [],
            "source": [
                "BRANCH = 'r1.17.0'\n",
                "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "id": "EyJ5HiiPrPKA"
            },
            "source": [
                "## Import all necessary packages"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "id": "tdUqxeUEA8nw"
            },
            "outputs": [],
            "source": [
                "# Import NeMo and it's ASR, NLP and TTS collections\n",
                "import nemo\n",
                "# Import Speech Recognition collection\n",
                "import nemo.collections.asr as nemo_asr\n",
                "# Import Natural Language Processing colleciton\n",
                "import nemo.collections.nlp as nemo_nlp\n",
                "# Import Speech Synthesis collection\n",
                "import nemo.collections.tts as nemo_tts\n",
                "# We'll use this to listen to audio\n",
                "import IPython"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "id": "bt2EZyU3A1aq"
            },
            "source": [
                "## Instantiate pre-trained NeMo models\n",
                "\n",
                "Every NeMo model has these methods:\n",
                "\n",
                "* ``list_available_models()`` - it will list all models currently available on NGC and their names.\n",
                "\n",
                "* ``from_pretrained(...)`` API downloads and initialized model directly from the NGC using model name.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "id": "YNNHs5Xjr8ox",
                "scrolled": true
            },
            "outputs": [],
            "source": [
                "# Here is an example of all CTC-based models:\n",
                "nemo_asr.models.EncDecCTCModel.list_available_models()\n",
                "# More ASR Models are available - see: nemo_asr.models.ASRModel.list_available_models()"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "id": "1h9nhICjA5Dk",
                "scrolled": true
            },
            "outputs": [],
            "source": [
                "# Speech Recognition model - Citrinet initially trained on Multilingual LibriSpeech English corpus, and fine-tuned on the open source Aishell-2\n",
                "asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name=\"stt_zh_citrinet_1024_gamma_0_25\").cuda()\n",
                "\n",
                "# Neural Machine Translation model\n",
                "nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_zh_en_transformer6x6').cuda()\n",
                "\n",
                "# Spectrogram generator which takes text as an input and produces spectrogram\n",
                "spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name=\"tts_en_fastpitch\").cuda()\n",
                "\n",
                "# Vocoder model which takes spectrogram and produces actual audio\n",
                "vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name=\"tts_en_hifigan\").cuda()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "id": "KPota-JtsqSY"
            },
            "source": [
                "## Get an audio sample in Mandarin"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "id": "7cGCEKkcLr52"
            },
            "outputs": [],
            "source": [
                "# Download audio sample which we'll try\n",
                "# This is a sample from MCV 6.1 Dev dataset - the model hasn't seen it before\n",
                "# IMPORTANT: The audio must be mono with 16Khz sampling rate\n",
                "audio_sample = 'common_voice_zh-CN_21347786.mp3'\n",
                "!wget 'https://nemo-public.s3.us-east-2.amazonaws.com/zh-samples/common_voice_zh-CN_21347786.mp3'\n",
                "# To listen it, click on the play button below\n",
                "IPython.display.Audio(audio_sample)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "id": "BaCdNJhhtBfM"
            },
            "source": [
                "## Transcribe audio file\n",
                "We will use speech recognition model to convert audio into text.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "id": "KTA7jM6sL6yC"
            },
            "outputs": [],
            "source": [
                "transcribed_text = asr_model.transcribe([audio_sample])\n",
                "print(transcribed_text)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "id": "BjYb2TMtttCc"
            },
            "source": [
                "## Translate Chinese text into English\n",
                "NeMo's NMT models have a handy ``.translate()`` method."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "id": "kQTdE4b9Nm9O"
            },
            "outputs": [],
            "source": [
                "english_text = nmt_model.translate(transcribed_text)\n",
                "print(english_text)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "id": "9Rppc59Ut7uy"
            },
            "source": [
                "## Generate English audio from text\n",
                "Speech generation from text typically has two steps:\n",
                "* Generate spectrogram from the text. In this example we will use FastPitch model for this.\n",
                "* Generate actual audio from the spectrogram. In this example we will use HifiGan model for this.\n"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {
                "id": "wpMYfufgNt15"
            },
            "outputs": [],
            "source": [
                "# A helper function which combines FastPitch and HifiGan to go directly from \n",
                "# text to audio\n",
                "def text_to_audio(text):\n",
                "  parsed = spectrogram_generator.parse(text)\n",
                "  spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed)\n",
                "  audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)\n",
                "  return audio.to('cpu').detach().numpy()"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Listen to generated audio in English\n",
                "IPython.display.Audio(text_to_audio(english_text[0]), rate=22050)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {
                "id": "LiQ_GQpcBYUs"
            },
            "source": [
                "## Next steps\n",
                "A demo like this is great for prototyping and experimentation. However, for real production deployment, you would want to use a service like [NVIDIA Riva](https://developer.nvidia.com/riva).\n",
                "\n",
                "**NeMo is built for training.** You can fine-tune, or train from scratch on your data all models used in this example. We recommend you checkout the following, more in-depth, tutorials next:\n",
                "\n",
                "* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb)\n",
                "* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)\n",
                "* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n",
                "* [Punctuation and Capitalization](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Punctuation_and_Capitalization.ipynb)\n",
                "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)\n",
                "\n",
                "\n",
                "You can find scripts for training and fine-tuning ASR, NLP and TTS models [here](https://github.com/NVIDIA/NeMo/tree/main/examples). "
            ]
        }
    ],
    "metadata": {
        "accelerator": "GPU",
        "colab": {
            "name": "NeMo Getting Started",
            "provenance": [],
            "toc_visible": true
        },
        "kernelspec": {
            "display_name": "Python 3 (ipykernel)",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.8.12"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}