File size: 14,182 Bytes
b0ae254 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 |
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "6a5c0357",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Ensure datasets is installed from main. Uncomment the following line if you face issues running this script:\n",
"# !pip install git+https://github.com/huggingface/datasets"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "794aaced",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"from datasets import Audio, interleave_datasets, IterableDataset, load_dataset\n",
"from typing import List, Optional"
]
},
{
"cell_type": "markdown",
"id": "f210ca9a-486b-46a2-a675-2526a9bd83f5",
"metadata": {},
"source": [
"### Define the dataset attributes"
]
},
{
"cell_type": "markdown",
"id": "fc07293f-3ba4-4e89-a4ca-8e39409a8373",
"metadata": {},
"source": [
"In this example, we'll show to combine the Common Voice 11, VoxPopuli, Mulitlingual LibriSpeech and FLEURS datasets for Spanish, giving a training corpus equal to the sum of the individual datasets. This is particularly beneficial in low-resource settings, where any one of the datasets alone might have insufficient data to train a model.\n",
"\n",
"We need to specify the dataset names on the Hub, the corresponding configs and finally the text column names for the transcriptions:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "c53344f3-c315-430a-a2f3-57aea6bb0e17",
"metadata": {},
"outputs": [],
"source": [
"dataset_names = [\"mozilla-foundation/common_voice_11_0\", \"facebook/voxpopuli\", \"facebook/multilingual_librispeech\", \"google/fleurs\"]\n",
"dataset_config_names = [\"es\", \"es\", \"spanish\", \"es_419\"]\n",
"text_column_names = [\"sentence\", \"normalized_text\", \"text\", \"transcription\"]"
]
},
{
"cell_type": "markdown",
"id": "215541f6-ee1c-4104-b43c-fa3f7fce0494",
"metadata": {},
"source": [
"### Define the merging function"
]
},
{
"cell_type": "markdown",
"id": "b722a48b-c576-4a63-b2a2-3c264890a75f",
"metadata": {},
"source": [
"We define a function, `load_multiple_streaming_datasets`, that takes as argument a list of datasets, configs, splits (optional) and text column names (optional). It sets them to a specified sampling rate and interleaves them together, giving one merged dataset. This is all \n",
"done in _streaming mode_: as we iterate over the merged dataset we load samples one-by-one on the fly. No data is\n",
"saved to disk.\n",
"\n",
"We can also specify our strategy for interleaving datasets. The default strategy, `all_exhausted` is an oversampling \n",
"strategy. In this case, the dataset construction is stopped as soon as every samples in every dataset \n",
"has been added at least once. In practice, it means that if a dataset is exhausted, it will return to the \n",
"beginning of this dataset until the stop criterion has been reached. You can specify `stopping_strategy=first_exhausted` \n",
"for a subsampling strategy, i.e the dataset construction is stopped as soon one of the dataset runs out of samples. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "61eb4cb1-ee27-4270-a474-1bb33e1df65f",
"metadata": {},
"outputs": [],
"source": [
"def load_multiple_streaming_datasets(\n",
" dataset_names: List,\n",
" dataset_config_names: List,\n",
" splits: Optional[List] = None,\n",
" text_column_names: Optional[List] = None,\n",
" sampling_rate: Optional[int] = 16000,\n",
" stopping_strategy: Optional[str] = \"all_exhausted\",\n",
" **kwargs\n",
") -> IterableDataset:\n",
"\n",
" if len(dataset_names) != len(dataset_config_names):\n",
" raise ValueError(\n",
" f\"Ensure one config is passed for each dataset, got {len(dataset_names)} datasets and\"\n",
" f\" {len(dataset_config_names)} configs.\"\n",
" )\n",
"\n",
" if splits is not None and len(splits) != len(dataset_names):\n",
" raise ValueError(\n",
" f\"Ensure one split is passed for each dataset, got {len(dataset_names)} datasets and {len(splits)} splits.\"\n",
" )\n",
"\n",
" if text_column_names is not None and len(text_column_names) != len(dataset_names):\n",
" raise ValueError(\n",
" f\"Ensure one text column name is passed for each dataset, got {len(dataset_names)} datasets and\"\n",
" f\" {len(text_column_names)} text column names.\"\n",
" )\n",
"\n",
" splits = splits if splits is not None else [\"train\" for i in range(len(dataset_names))]\n",
" text_column_names = (\n",
" text_column_names if text_column_names is not None else [\"text\" for i in range(len(dataset_names))]\n",
" )\n",
"\n",
" all_datasets = []\n",
" # iterate over the datasets we want to interleave\n",
" for i, dataset_name in enumerate(dataset_names):\n",
" dataset = load_dataset(dataset_name, dataset_config_names[i], split=splits[i], streaming=True, **kwargs)\n",
" # resample to specified sampling rate\n",
" dataset = dataset.cast_column(\"audio\", Audio(sampling_rate))\n",
" # normalise columns to [\"audio\", \"sentence\"]\n",
" if text_column_names[i] != \"sentence\":\n",
" dataset = dataset.rename_column(text_column_names[i], \"sentence\")\n",
" dataset = dataset.remove_columns(set(dataset.features.keys()) - set([\"audio\", \"sentence\"]))\n",
" all_datasets.append(dataset)\n",
"\n",
" interleaved_dataset = interleave_datasets(all_datasets, stopping_strategy=stopping_strategy)\n",
" return interleaved_dataset"
]
},
{
"cell_type": "markdown",
"id": "29bc228b-ce9b-4cee-9092-1223ddfa51ad",
"metadata": {},
"source": [
"Let's apply this function to load and merge our four datasets:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "8ae90f83-4ecd-46a3-98be-bd75706e0d88",
"metadata": {},
"outputs": [],
"source": [
"ds = load_multiple_streaming_datasets(dataset_names, dataset_config_names=dataset_config_names, text_column_names=text_column_names, use_auth_token=True)"
]
},
{
"cell_type": "markdown",
"id": "6056a693-1fb0-45f4-ad43-be5f1812c1a5",
"metadata": {},
"source": [
"### Iterate over the dataset"
]
},
{
"cell_type": "markdown",
"id": "7ffe011f-f905-4027-ab67-5c9c3b2b5ac0",
"metadata": {},
"source": [
"We iterate over the dataset, loading and merging samples on the fly. Let's print the transcriptions for the first 10 samples of our merged dataset:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "75b3355a-3c06-4d23-af43-2b93b1ad70b2",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Reading metadata...: 230467it [00:41, 5545.80it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 ¿ Qué tal a tres de cinco ?\n",
"1 y desde luego esa razón no puede tener que ver con la explicación surrealista que hemos escuchado más de una vez de que se trata de una conspiración izquierdista.\n",
"2 para exclamar con voz de acción de gracias y para contar todas tus maravillas jehová la habitación de tu casa he amado y el lugar del tabernáculo de tu gloria no juntes con los pecadores mi alma ni con los hombres de sangres mi vida\n",
"3 el uso de internet y de la red informática mundial permite que los estudiantes tengan acceso a la información en todo momento\n",
"4 vamos , quiero decir , que no soy de citas especiales .\n",
"5 si bien esta lista no es perfecta sí que resulta necesario que las entidades financieras refuercen sus controles.\n",
"6 oye oh jehová mi voz con que á ti clamo y ten misericordia de mí respóndeme mi corazón ha dicho de ti buscad mi rostro tu rostro buscaré oh jehová\n",
"7 los deportes de nieve en descenso como el esquí y la tablanieve son disciplinas populares que consisten en deslizarse con esquís o una tabla fijada a los pies sobre un terreno nevado\n",
"8 fray Lope , en aquel momento , colmaba otro vaso igual :\n",
"9 señora presidenta la competitividad es importante pero no puede ser el único criterio.\n"
]
}
],
"source": [
"for i, sample in enumerate(ds):\n",
" print(i, sample[\"sentence\"])\n",
" if i == 9:\n",
" break"
]
},
{
"cell_type": "markdown",
"id": "42d5ad08-b20e-4cba-a1a9-909fdbf030d4",
"metadata": {},
"source": [
"We can see that the transcriptions take several different formats. Those from Common Voice 11 are cased and punctuated. Those from VoxPopuli are punctuated only. Those from Multilingual LibriSpeech and FLEURS are neither cased not punctuated. We need to normalise the transcriptions to a uniform format before training our model. \n",
"\n",
"The following code cell is lifted from the Whisper training notebook: https://github.com/huggingface/community-events/blob/main/whisper-fine-tuning-event/fine-tune-whisper-streaming.ipynb"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "ed20e9cd-31c2-44cb-872b-333378a92fd1",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/sanchitgandhi/venv/lib/python3.8/site-packages/jax/_src/lib/__init__.py:33: UserWarning: JAX on Mac ARM machines is experimental and minimally tested. Please see https://github.com/google/jax/issues/5501 in the event of problems.\n",
" warnings.warn(\"JAX on Mac ARM machines is experimental and minimally tested. \"\n"
]
}
],
"source": [
"from transformers.models.whisper.english_normalizer import BasicTextNormalizer\n",
"\n",
"do_lower_case = True\n",
"do_remove_punctuation = True\n",
"\n",
"normalizer = BasicTextNormalizer()"
]
},
{
"cell_type": "markdown",
"id": "01d13029-c24f-4a51-aff2-9251a2ceb4ce",
"metadata": {},
"source": [
"Now we define a function to normalise our transcriptions:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "26e42417-4bd2-46f8-914e-3a6f9f3471ac",
"metadata": {},
"outputs": [],
"source": [
"def normalize_transcriptions(batch):\n",
" # optional pre-processing steps\n",
" transcription = batch[\"sentence\"]\n",
" if do_lower_case:\n",
" transcription = transcription.lower()\n",
" if do_remove_punctuation:\n",
" transcription = normalizer(transcription).strip()\n",
" batch[\"sentence\"] = transcription\n",
" return batch"
]
},
{
"cell_type": "markdown",
"id": "3b1c67fe-be4b-4ee5-9a1f-0d444f2b5c62",
"metadata": {},
"source": [
"Let's apply the data pre-processing steps to our dataset and view the first 10 samples again:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "0babac71-9157-4d0f-a8a8-184547bdf501",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Reading metadata...: 230467it [00:32, 6984.59it/s] \n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 qué tal a tres de cinco \n",
"1 y desde luego esa razón no puede tener que ver con la explicación surrealista que hemos escuchado más de una vez de que se trata de una conspiración izquierdista \n",
"2 para exclamar con voz de acción de gracias y para contar todas tus maravillas jehová la habitación de tu casa he amado y el lugar del tabernáculo de tu gloria no juntes con los pecadores mi alma ni con los hombres de sangres mi vida\n",
"3 el uso de internet y de la red informática mundial permite que los estudiantes tengan acceso a la información en todo momento\n",
"4 vamos quiero decir que no soy de citas especiales \n",
"5 si bien esta lista no es perfecta sí que resulta necesario que las entidades financieras refuercen sus controles \n",
"6 oye oh jehová mi voz con que á ti clamo y ten misericordia de mí respóndeme mi corazón ha dicho de ti buscad mi rostro tu rostro buscaré oh jehová\n",
"7 los deportes de nieve en descenso como el esquí y la tablanieve son disciplinas populares que consisten en deslizarse con esquís o una tabla fijada a los pies sobre un terreno nevado\n",
"8 fray lope en aquel momento colmaba otro vaso igual \n",
"9 señora presidenta la competitividad es importante pero no puede ser el único criterio \n"
]
}
],
"source": [
"ds = ds.map(normalize_transcriptions)\n",
"\n",
"for i, sample in enumerate(ds):\n",
" print(i, sample[\"sentence\"])\n",
" if i == 9:\n",
" break"
]
},
{
"cell_type": "markdown",
"id": "d135627a-a7aa-458c-94b8-57ddeae74a72",
"metadata": {},
"source": [
"This time the transcriptions are in a consistent format. We can use this data to fine-tune our Whisper model. Note that since we've removed punctuation and casing, the Whisper model won't learn to predict these features."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
|