{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n", "\n", "Instructions for setting up Colab are as follows:\n", "1. Open a new Python 3 notebook.\n", "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n", "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n", "4. Run this cell to set up dependencies.\n", "5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect\n", "\"\"\"\n", "\n", "NEMO_DIR_PATH = \"NeMo\"\n", "BRANCH = 'r1.17.0'\n", "\n", "! git clone https://github.com/NVIDIA/NeMo\n", "%cd NeMo\n", "! python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]\n", "%cd .." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Install dependencies\n", "!apt-get install sox libsndfile1 ffmpeg\n", "!pip install wget\n", "!pip install text-unidecode\n", "!pip install \"matplotlib>=3.3.2\"\n", "!pip install seaborn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Multispeaker Simulator\n", "\n", "This tutorial shows how to use the speech data simulator to generate synthetic multispeaker audio sessions that can be used to train or evaluate models for multispeaker ASR or speaker diarization. This tool aims to address the lack of labelled multispeaker training data and to help models deal with overlapping speech." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 1: Download Required Datasets\n", "\n", "The LibriSpeech dataset and corresponding word alignments are required for generating synthetic multispeaker audio sessions. For simplicity, only the dev-clean dataset is used for generating synthetic sessions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "# download scripts if not already there \n", "if not os.path.exists('NeMo/scripts'):\n", " print(\"Downloading necessary scripts\")\n", " !mkdir -p NeMo/scripts/dataset_processing\n", " !mkdir -p NeMo/scripts/speaker_tasks\n", " !wget -P NeMo/scripts/dataset_processing/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/dataset_processing/get_librispeech_data.py\n", " !wget -P NeMo/scripts/speaker_tasks/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/speaker_tasks/create_alignment_manifest.py" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!mkdir -p LibriSpeech\n", "!python {NEMO_DIR_PATH}/scripts/dataset_processing/get_librispeech_data.py \\\n", " --data_root LibriSpeech \\\n", " --data_sets dev_clean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The LibriSpeech forced word alignments are from [this repository](https://github.com/CorentinJ/librispeech-alignments). You can access to the whole LibriSpeech splits at this google drive [link](https://drive.google.com/file/d/1WYfgr31T-PPwMcxuAq09XZfHQO5Mw8fE/view?usp=sharing). We will download the dev-clean part for demo purpose." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget -nc https://dldata-public.s3.us-east-2.amazonaws.com/LibriSpeech_Alignments.tar.gz\n", "!tar -xzf LibriSpeech_Alignments.tar.gz\n", "!rm -f LibriSpeech_Alignments.tar.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 2: Produce Manifest File with Forced Alignments\n", "\n", "The LibriSpeech manifest file and LibriSpeech forced alignments will now be merged into one manifest file for ease of use when generating synthetic data. \n", "\n", "Here, the input LibriSpeech alignments are first converted to CTM files, and the CTM files are then combined with the base manifest in order to produce the manifest file with word alignments. When using another dataset, the --use_ctm argument can be used to generate the manifest file using alignments in CTM files." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python NeMo/scripts/speaker_tasks/create_alignment_manifest.py \\\n", " --input_manifest_filepath LibriSpeech/dev_clean.json \\\n", " --base_alignment_path LibriSpeech_Alignments \\\n", " --output_manifest_filepath ./dev-clean-align.json \\\n", " --ctm_output_directory ./ctm_out \\\n", " --libri_dataset_split dev-clean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 3: Download Background Noise Data\n", "\n", "The background noise is constructed from [here](https://www.openslr.org/resources/28/rirs_noises.zip) (although it can be constructed from other background noise datasets instead)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!wget -nc https://www.openslr.org/resources/28/rirs_noises.zip\n", "!unzip -o rirs_noises.zip\n", "!rm -f rirs_noises.zip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Simulator Overview\n", "\n", "The simulator creates a speech session using utterances from a desired number of speakers. The simulator first selects the LibriSpeech speaker IDs that will be used for the current session, and sets speaker dominance values for each speaker to determine how often each speaker will talk in the session. The session is then constructed by iterating through the following steps:\n", "\n", "* The next speaker is selected (which could be the same speaker again with some probability, and which accounts for the speaker dominance values).\n", "* The sentence length is determined using a probability distribution, and an utterance of the desired length is then constructed by concatenating together (or truncating) LibriSpeech sentences corresponding to the desired speaker. Individual word alignments are used to truncate the last LibriSpeech sentence such that the entire utterance has the desired length.\n", "* Next, either the current utterance is overlapped with a previous utterance or silence is introduced before inserting the current utterance. \n", "\n", "The simulator includes a multi-microphone far-field mode that incorporates synthetic room impulse response generation in order to simulate multi-microphone multispeaker sessions. When using RIR generation, the RIR is computed once per batch of sessions, and then each constructed utterance is convolved with the RIR in order to get the sound recorded by each microphone before adding the utterance to the audio session. In this tutorial, only near field sessions will be generated.\n", "\n", "The simulator also has a speaker enforcement mode which ensures that the correct number of speakers appear in each session, since it is possible that fewer than the desired number may be present since speaker turns are probabilistic. In speaker enforcement mode, the length of the session or speaker probabilities may be adjusted to ensure all speakers are included before the session finishes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Parameters" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from omegaconf import OmegaConf\n", "ROOT = os.getcwd()\n", "conf_dir = os.path.join(ROOT,'conf')\n", "!mkdir -p $conf_dir\n", "CONFIG_PATH = os.path.join(conf_dir, 'data_simulator.yaml')\n", "if not os.path.exists(CONFIG_PATH):\n", " !wget -P $conf_dir https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/tools/speech_data_simulator/conf/data_simulator.yaml\n", "\n", "config = OmegaConf.load(CONFIG_PATH)\n", "print(OmegaConf.to_yaml(config))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 4: Create Background Noise Manifest" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!find RIRS_NOISES/real_rirs_isotropic_noises/*.wav > bg_noise.list \n", "\n", "# this function can also be run using the pathfiles_to_diarize_manifest.py script\n", "from nemo.collections.asr.parts.utils.manifest_utils import create_manifest\n", "create_manifest('bg_noise.list', 'bg_noise.json')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 5: Generate Simulated Audio Session\n", "\n", "A single 4-speaker session of 60 seconds is generated as an example. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nemo.collections.asr.data.data_simulation import MultiSpeakerSimulator\n", "\n", "ROOT = os.getcwd()\n", "data_dir = os.path.join(ROOT,'simulated_data')\n", "config.data_simulator.random_seed=42\n", "config.data_simulator.manifest_filepath=\"./dev-clean-align.json\"\n", "config.data_simulator.outputs.output_dir=data_dir\n", "config.data_simulator.session_config.num_sessions=1\n", "config.data_simulator.session_config.session_length=30\n", "config.data_simulator.background_noise.add_bg=True\n", "config.data_simulator.background_noise.background_manifest=\"./bg_noise.json\"\n", "config.data_simulator.session_params.mean_silence=0.2\n", "config.data_simulator.session_params.mean_silence_var=0.02\n", "config.data_simulator.session_params.mean_overlap=0.15\n", "\n", "lg = MultiSpeakerSimulator(cfg=config)\n", "lg.generate_sessions()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 5: Listen to and Visualize Session\n", "\n", "Listen to the audio and visualize the corresponding speaker timestamps (recorded in a RTTM file for each session)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import os\n", "import wget\n", "import IPython\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import librosa\n", "from nemo.collections.asr.parts.utils.speaker_utils import rttm_to_labels, labels_to_pyannote_object\n", "\n", "ROOT = os.getcwd()\n", "data_dir = os.path.join(ROOT,'simulated_data')\n", "audio = os.path.join(data_dir,'multispeaker_session_0.wav')\n", "rttm = os.path.join(data_dir,'multispeaker_session_0.rttm')\n", "\n", "sr = 16000\n", "signal, sr = librosa.load(audio,sr=sr) \n", "\n", "fig,ax = plt.subplots(1,1)\n", "fig.set_figwidth(20)\n", "fig.set_figheight(2)\n", "plt.plot(np.arange(len(signal)),signal,'gray')\n", "fig.suptitle('Synthetic Audio Session', fontsize=16)\n", "plt.xlabel('Time (s)', fontsize=18)\n", "ax.margins(x=0)\n", "plt.ylabel('Signal Strength', fontsize=16);\n", "a,_ = plt.xticks();plt.xticks(a,a/sr);\n", "\n", "IPython.display.Audio(audio)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The visualization is useful for seeing both the distribution of utterance lengths, the differing speaker dominance values, and the amount of overlap in the session." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#display speaker labels for reference\n", "labels = rttm_to_labels(rttm)\n", "reference = labels_to_pyannote_object(labels)\n", "reference" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Step 6: Get Simulated Data Statistics " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if not os.path.exists(\"multispeaker_data_analysis.py\"):\n", " !wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/speaker_tasks/multispeaker_data_analysis.py\n", "\n", "from multispeaker_data_analysis import run_multispeaker_data_analysis\n", "\n", "session_dur = 30\n", "silence_mean = 0.2\n", "silence_var = 0.1\n", "overlap_mean = 0.15\n", "run_multispeaker_data_analysis(data_dir, session_dur=session_dur, silence_var=silence_var, overlap_mean=overlap_mean, precise=True);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Files Produced\n", "\n", "The following files are produced by the simulator:\n", "\n", "* wav files (one per audio session) - the output audio sessions\n", "* rttm files (one per audio session) - the speaker timestamps for the corresponding audio session (used for diarization training)\n", "* json files (one per audio session) - the output manifest file for the corresponding audio session (containing text transcriptions, utterance durations, full paths to audio files, words, and word alignments)\n", "* ctm files (one per audio session) - contains word-by-word alignments, speaker ID, and word\n", "* txt files (one per audio session) - contains the full text transcription for a given session\n", "* list files (one per file type per batch of sessions) - a list of generated files of the given type (wav, rttm, json, ctm, or txt), used primarily for manifest creation\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!ls simulated_data" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } } }, "nbformat": 4, "nbformat_minor": 4 }