File size: 14,182 Bytes

b0ae254

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6a5c0357",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Ensure datasets is installed from main. Uncomment the following line if you face issues running this script:\n",
    "# !pip install git+https://github.com/huggingface/datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "794aaced",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from datasets import Audio, interleave_datasets, IterableDataset, load_dataset\n",
    "from typing import List, Optional"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f210ca9a-486b-46a2-a675-2526a9bd83f5",
   "metadata": {},
   "source": [
    "### Define the dataset attributes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc07293f-3ba4-4e89-a4ca-8e39409a8373",
   "metadata": {},
   "source": [
    "In this example, we'll show to combine the Common Voice 11, VoxPopuli, Mulitlingual LibriSpeech and FLEURS datasets for Spanish, giving a training corpus equal to the sum of the individual datasets. This is particularly beneficial in low-resource settings, where any one of the datasets alone might have insufficient data to train a model.\n",
    "\n",
    "We need to specify the dataset names on the Hub, the corresponding configs and finally the text column names for the transcriptions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c53344f3-c315-430a-a2f3-57aea6bb0e17",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_names = [\"mozilla-foundation/common_voice_11_0\", \"facebook/voxpopuli\", \"facebook/multilingual_librispeech\", \"google/fleurs\"]\n",
    "dataset_config_names = [\"es\", \"es\", \"spanish\", \"es_419\"]\n",
    "text_column_names = [\"sentence\", \"normalized_text\", \"text\", \"transcription\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "215541f6-ee1c-4104-b43c-fa3f7fce0494",
   "metadata": {},
   "source": [
    "### Define the merging function"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b722a48b-c576-4a63-b2a2-3c264890a75f",
   "metadata": {},
   "source": [
    "We define a function, `load_multiple_streaming_datasets`, that takes as argument a list of datasets, configs, splits (optional) and text column names (optional). It sets them to a specified sampling rate and interleaves them together, giving one merged dataset. This is all \n",
    "done in _streaming mode_: as we iterate over the merged dataset we load samples one-by-one on the fly. No data is\n",
    "saved to disk.\n",
    "\n",
    "We can also specify our strategy for interleaving datasets. The default strategy, `all_exhausted` is an oversampling \n",
    "strategy. In this case, the dataset construction is stopped as soon as every samples in every dataset \n",
    "has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the \n",
    "beginning of this dataset until the stop criterion has been reached. You can specify `stopping_strategy=first_exhausted` \n",
    "for a subsampling strategy, i.e the dataset construction is stopped as soon one of the dataset runs out of samples. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "61eb4cb1-ee27-4270-a474-1bb33e1df65f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_multiple_streaming_datasets(\n",
    "    dataset_names: List,\n",
    "    dataset_config_names: List,\n",
    "    splits: Optional[List] = None,\n",
    "    text_column_names: Optional[List] = None,\n",
    "    sampling_rate: Optional[int] = 16000,\n",
    "    stopping_strategy: Optional[str] = \"all_exhausted\",\n",
    "    **kwargs\n",
    ") -> IterableDataset:\n",
    "\n",
    "    if len(dataset_names) != len(dataset_config_names):\n",
    "        raise ValueError(\n",
    "            f\"Ensure one config is passed for each dataset, got {len(dataset_names)} datasets and\"\n",
    "            f\" {len(dataset_config_names)} configs.\"\n",
    "        )\n",
    "\n",
    "    if splits is not None and len(splits) != len(dataset_names):\n",
    "        raise ValueError(\n",
    "            f\"Ensure one split is passed for each dataset, got {len(dataset_names)} datasets and {len(splits)} splits.\"\n",
    "        )\n",
    "\n",
    "    if text_column_names is not None and len(text_column_names) != len(dataset_names):\n",
    "        raise ValueError(\n",
    "            f\"Ensure one text column name is passed for each dataset, got {len(dataset_names)} datasets and\"\n",
    "            f\" {len(text_column_names)} text column names.\"\n",
    "        )\n",
    "\n",
    "    splits = splits if splits is not None else [\"train\" for i in range(len(dataset_names))]\n",
    "    text_column_names = (\n",
    "        text_column_names if text_column_names is not None else [\"text\" for i in range(len(dataset_names))]\n",
    "    )\n",
    "\n",
    "    all_datasets = []\n",
    "    # iterate over the datasets we want to interleave\n",
    "    for i, dataset_name in enumerate(dataset_names):\n",
    "        dataset = load_dataset(dataset_name, dataset_config_names[i], split=splits[i], streaming=True, **kwargs)\n",
    "        # resample to specified sampling rate\n",
    "        dataset = dataset.cast_column(\"audio\", Audio(sampling_rate))\n",
    "        #  normalise columns to [\"audio\", \"sentence\"]\n",
    "        if text_column_names[i] != \"sentence\":\n",
    "            dataset = dataset.rename_column(text_column_names[i], \"sentence\")\n",
    "        dataset = dataset.remove_columns(set(dataset.features.keys()) - set([\"audio\", \"sentence\"]))\n",
    "        all_datasets.append(dataset)\n",
    "\n",
    "    interleaved_dataset = interleave_datasets(all_datasets, stopping_strategy=stopping_strategy)\n",
    "    return interleaved_dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "29bc228b-ce9b-4cee-9092-1223ddfa51ad",
   "metadata": {},
   "source": [
    "Let's apply this function to load and merge our four datasets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8ae90f83-4ecd-46a3-98be-bd75706e0d88",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = load_multiple_streaming_datasets(dataset_names, dataset_config_names=dataset_config_names, text_column_names=text_column_names, use_auth_token=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6056a693-1fb0-45f4-ad43-be5f1812c1a5",
   "metadata": {},
   "source": [
    "### Iterate over the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ffe011f-f905-4027-ab67-5c9c3b2b5ac0",
   "metadata": {},
   "source": [
    "We iterate over the dataset, loading and merging samples on the fly. Let's print the transcriptions for the first 10 samples of our merged dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "75b3355a-3c06-4d23-af43-2b93b1ad70b2",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Reading metadata...: 230467it [00:41, 5545.80it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 ¿ Qué tal a tres de cinco ?\n",
      "1 y desde luego esa razón no puede tener que ver con la explicación surrealista que hemos escuchado más de una vez de que se trata de una conspiración izquierdista.\n",
      "2 para exclamar con voz de acción de gracias y para contar todas tus maravillas jehová la habitación de tu casa he amado y el lugar del tabernáculo de tu gloria no juntes con los pecadores mi alma ni con los hombres de sangres mi vida\n",
      "3 el uso de internet y de la red informática mundial permite que los estudiantes tengan acceso a la información en todo momento\n",
      "4 vamos , quiero decir , que no soy de citas especiales .\n",
      "5 si bien esta lista no es perfecta sí que resulta necesario que las entidades financieras refuercen sus controles.\n",
      "6 oye oh jehová mi voz con que á ti clamo y ten misericordia de mí respóndeme mi corazón ha dicho de ti buscad mi rostro tu rostro buscaré oh jehová\n",
      "7 los deportes de nieve en descenso como el esquí y la tablanieve son disciplinas populares que consisten en deslizarse con esquís o una tabla fijada a los pies sobre un terreno nevado\n",
      "8 fray Lope , en aquel momento , colmaba otro vaso igual :\n",
      "9 señora presidenta la competitividad es importante pero no puede ser el único criterio.\n"
     ]
    }
   ],
   "source": [
    "for i, sample in enumerate(ds):\n",
    "    print(i, sample[\"sentence\"])\n",
    "    if i == 9:\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42d5ad08-b20e-4cba-a1a9-909fdbf030d4",
   "metadata": {},
   "source": [
    "We can see that the transcriptions take several different formats. Those from Common Voice 11 are cased and punctuated. Those from VoxPopuli are punctuated only. Those from Multilingual LibriSpeech and FLEURS are neither cased not punctuated. We need to normalise the transcriptions to a uniform format before training our model. \n",
    "\n",
    "The following code cell is lifted from the Whisper training notebook: https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/fine-tune-whisper-streaming.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "ed20e9cd-31c2-44cb-872b-333378a92fd1",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/sanchitgandhi/venv/lib/python3.8/site-packages/jax/_src/lib/__init__.py:33: UserWarning: JAX on Mac ARM machines is experimental and minimally tested. Please see https://github.com/google/jax/issues/5501 in the event of problems.\n",
      "  warnings.warn(\"JAX on Mac ARM machines is experimental and minimally tested. \"\n"
     ]
    }
   ],
   "source": [
    "from transformers.models.whisper.english_normalizer import BasicTextNormalizer\n",
    "\n",
    "do_lower_case = True\n",
    "do_remove_punctuation = True\n",
    "\n",
    "normalizer = BasicTextNormalizer()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "01d13029-c24f-4a51-aff2-9251a2ceb4ce",
   "metadata": {},
   "source": [
    "Now we define a function to normalise our transcriptions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "26e42417-4bd2-46f8-914e-3a6f9f3471ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "def normalize_transcriptions(batch):\n",
    "    # optional pre-processing steps\n",
    "    transcription = batch[\"sentence\"]\n",
    "    if do_lower_case:\n",
    "        transcription = transcription.lower()\n",
    "    if do_remove_punctuation:\n",
    "        transcription = normalizer(transcription).strip()\n",
    "    batch[\"sentence\"] = transcription\n",
    "    return batch"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b1c67fe-be4b-4ee5-9a1f-0d444f2b5c62",
   "metadata": {},
   "source": [
    "Let's apply the data pre-processing steps to our dataset and view the first 10 samples again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "0babac71-9157-4d0f-a8a8-184547bdf501",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Reading metadata...: 230467it [00:32, 6984.59it/s] \n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0  qué tal a tres de cinco \n",
      "1 y desde luego esa razón no puede tener que ver con la explicación surrealista que hemos escuchado más de una vez de que se trata de una conspiración izquierdista \n",
      "2 para exclamar con voz de acción de gracias y para contar todas tus maravillas jehová la habitación de tu casa he amado y el lugar del tabernáculo de tu gloria no juntes con los pecadores mi alma ni con los hombres de sangres mi vida\n",
      "3 el uso de internet y de la red informática mundial permite que los estudiantes tengan acceso a la información en todo momento\n",
      "4 vamos quiero decir que no soy de citas especiales \n",
      "5 si bien esta lista no es perfecta sí que resulta necesario que las entidades financieras refuercen sus controles \n",
      "6 oye oh jehová mi voz con que á ti clamo y ten misericordia de mí respóndeme mi corazón ha dicho de ti buscad mi rostro tu rostro buscaré oh jehová\n",
      "7 los deportes de nieve en descenso como el esquí y la tablanieve son disciplinas populares que consisten en deslizarse con esquís o una tabla fijada a los pies sobre un terreno nevado\n",
      "8 fray lope en aquel momento colmaba otro vaso igual \n",
      "9 señora presidenta la competitividad es importante pero no puede ser el único criterio \n"
     ]
    }
   ],
   "source": [
    "ds = ds.map(normalize_transcriptions)\n",
    "\n",
    "for i, sample in enumerate(ds):\n",
    "    print(i, sample[\"sentence\"])\n",
    "    if i == 9:\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d135627a-a7aa-458c-94b8-57ddeae74a72",
   "metadata": {},
   "source": [
    "This time the transcriptions are in a consistent format. We can use this data to fine-tune our Whisper model. Note that since we've removed punctuation and casing, the Whisper model won't learn to predict these features."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}