{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "RYGnI-EZp_nK" }, "source": [ "# Getting Started: Sample Conversational AI application\n", "This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translate Mandarin audio file into English one.\n", "\n", "The demo demonstrates how to: \n", "\n", "* Instantiate pre-trained NeMo models from NVIDIA NGC.\n", "* Transcribe audio with (Mandarin) speech recognition model.\n", "* Translate text with machine translation model.\n", "* Generate audio with text-to-speech models." ] }, { "cell_type": "markdown", "metadata": { "id": "V72HXYuQ_p9a" }, "source": [ "## Installation\n", "NeMo can be installed via simple pip command.\n", "This will take about 4 minutes.\n", "\n", "(The installation method below should work inside your new Conda environment or in an NVIDIA docker container.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "efDmTWf1_iYK" }, "outputs": [], "source": [ "BRANCH = 'r1.17.0'\n", "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n" ] }, { "cell_type": "markdown", "metadata": { "id": "EyJ5HiiPrPKA" }, "source": [ "## Import all necessary packages" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tdUqxeUEA8nw" }, "outputs": [], "source": [ "# Import NeMo and it's ASR, NLP and TTS collections\n", "import nemo\n", "# Import Speech Recognition collection\n", "import nemo.collections.asr as nemo_asr\n", "# Import Natural Language Processing colleciton\n", "import nemo.collections.nlp as nemo_nlp\n", "# Import Speech Synthesis collection\n", "import nemo.collections.tts as nemo_tts\n", "# We'll use this to listen to audio\n", "import IPython" ] }, { "cell_type": "markdown", "metadata": { "id": "bt2EZyU3A1aq" }, "source": [ "## Instantiate pre-trained NeMo models\n", "\n", "Every NeMo model has these methods:\n", "\n", "* ``list_available_models()`` - it will list all models currently available on NGC and their names.\n", "\n", "* ``from_pretrained(...)`` API downloads and initialized model directly from the NGC using model name.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YNNHs5Xjr8ox", "scrolled": true }, "outputs": [], "source": [ "# Here is an example of all CTC-based models:\n", "nemo_asr.models.EncDecCTCModel.list_available_models()\n", "# More ASR Models are available - see: nemo_asr.models.ASRModel.list_available_models()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1h9nhICjA5Dk", "scrolled": true }, "outputs": [], "source": [ "# Speech Recognition model - Citrinet initially trained on Multilingual LibriSpeech English corpus, and fine-tuned on the open source Aishell-2\n", "asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name=\"stt_zh_citrinet_1024_gamma_0_25\").cuda()\n", "\n", "# Neural Machine Translation model\n", "nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_zh_en_transformer6x6').cuda()\n", "\n", "# Spectrogram generator which takes text as an input and produces spectrogram\n", "spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name=\"tts_en_fastpitch\").cuda()\n", "\n", "# Vocoder model which takes spectrogram and produces actual audio\n", "vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name=\"tts_en_hifigan\").cuda()" ] }, { "cell_type": "markdown", "metadata": { "id": "KPota-JtsqSY" }, "source": [ "## Get an audio sample in Mandarin" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7cGCEKkcLr52" }, "outputs": [], "source": [ "# Download audio sample which we'll try\n", "# This is a sample from MCV 6.1 Dev dataset - the model hasn't seen it before\n", "# IMPORTANT: The audio must be mono with 16Khz sampling rate\n", "audio_sample = 'common_voice_zh-CN_21347786.mp3'\n", "!wget 'https://nemo-public.s3.us-east-2.amazonaws.com/zh-samples/common_voice_zh-CN_21347786.mp3'\n", "# To listen it, click on the play button below\n", "IPython.display.Audio(audio_sample)" ] }, { "cell_type": "markdown", "metadata": { "id": "BaCdNJhhtBfM" }, "source": [ "## Transcribe audio file\n", "We will use speech recognition model to convert audio into text.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "KTA7jM6sL6yC" }, "outputs": [], "source": [ "transcribed_text = asr_model.transcribe([audio_sample])\n", "print(transcribed_text)" ] }, { "cell_type": "markdown", "metadata": { "id": "BjYb2TMtttCc" }, "source": [ "## Translate Chinese text into English\n", "NeMo's NMT models have a handy ``.translate()`` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kQTdE4b9Nm9O" }, "outputs": [], "source": [ "english_text = nmt_model.translate(transcribed_text)\n", "print(english_text)" ] }, { "cell_type": "markdown", "metadata": { "id": "9Rppc59Ut7uy" }, "source": [ "## Generate English audio from text\n", "Speech generation from text typically has two steps:\n", "* Generate spectrogram from the text. In this example we will use FastPitch model for this.\n", "* Generate actual audio from the spectrogram. In this example we will use HifiGan model for this.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wpMYfufgNt15" }, "outputs": [], "source": [ "# A helper function which combines FastPitch and HifiGan to go directly from \n", "# text to audio\n", "def text_to_audio(text):\n", " parsed = spectrogram_generator.parse(text)\n", " spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed)\n", " audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)\n", " return audio.to('cpu').detach().numpy()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Listen to generated audio in English\n", "IPython.display.Audio(text_to_audio(english_text[0]), rate=22050)" ] }, { "cell_type": "markdown", "metadata": { "id": "LiQ_GQpcBYUs" }, "source": [ "## Next steps\n", "A demo like this is great for prototyping and experimentation. However, for real production deployment, you would want to use a service like [NVIDIA Riva](https://developer.nvidia.com/riva).\n", "\n", "**NeMo is built for training.** You can fine-tune, or train from scratch on your data all models used in this example. We recommend you checkout the following, more in-depth, tutorials next:\n", "\n", "* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb)\n", "* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)\n", "* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)\n", "* [Punctuation and Capitalization](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Punctuation_and_Capitalization.ipynb)\n", "* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)\n", "\n", "\n", "You can find scripts for training and fine-tuning ASR, NLP and TTS models [here](https://github.com/NVIDIA/NeMo/tree/main/examples). " ] } ], "metadata": { "accelerator": "GPU", "colab": { "name": "NeMo Getting Started", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" } }, "nbformat": 4, "nbformat_minor": 4 }