Transformers

You are viewing v4.22.1 version. A newer version v4.51.3 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Automatic speech recognition

Automatic speech recognition (ASR) converts a speech signal to text. It is an example of a sequence-to-sequence task, going from a sequence of audio inputs to textual outputs. Voice assistants like Siri and Alexa utilize ASR models to assist users.

This guide will show you how to fine-tune Wav2Vec2 on the MInDS-14 dataset to transcribe audio to text.

See the automatic speech recognition task page for more information about its associated models, datasets, and metrics.

Load MInDS-14 dataset

Load the MInDS-14 from the 🤗 Datasets library:

>>> from datasets import load_dataset, Audio

>>> minds = load_dataset("PolyAI/minds14", name="en-US", split="train")

Split this dataset into a train and test set:

>>> minds = minds.train_test_split(test_size=0.2)

Then take a look at the dataset:

>>> minds
DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

While the dataset contains a lot of helpful information, like lang_id and intent_class, you will focus on the audio and transcription columns in this guide. Remove the other columns:

>>> minds = minds.remove_columns(["english_transcription", "intent_class", "lang_id"])

Take a look at the example again:

>>> minds["train"][0]
{'audio': {'array': array([-0.00024414,  0.        ,  0.        , ...,  0.00024414,
          0.00024414,  0.00024414], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
  'sampling_rate': 8000},
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}

The audio column contains a 1-dimensional array of the speech signal that must be called to load and resample the audio file.

Preprocess

Load the Wav2Vec2 processor to process the audio signal and transcribed text:

>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base")

The MInDS-14 dataset has a sampling rate of 8000khz. You will need to resample the dataset to use the pretrained Wav2Vec2 model:

>>> minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
>>> minds["train"][0]
{'audio': {'array': array([-2.38064706e-04, -1.58618059e-04, -5.43987835e-06, ...,
          2.78103951e-04,  2.38446111e-04,  1.18740834e-04], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
  'sampling_rate': 16000},
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602ba9e2963e11ccd901cd4f.wav',
 'transcription': "hi I'm trying to use the banking app on my phone and currently my checking and savings account balance is not refreshing"}

The preprocessing function needs to:

Call the audio column to load and resample the audio file.
Extract the input_values from the audio file.
Typically, when you call the processor, you call the feature extractor. Since you also want to tokenize text, instruct the processor to call the tokenizer instead with a context manager.

>>> def prepare_dataset(batch):
...     audio = batch["audio"]

...     batch = processor(audio=audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
...     batch["input_length"] = len(batch["input_values"])

...     batch["labels"] = processor(text=batch["transcription"]).input_ids
...     return batch

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by increasing the number of processes with num_proc. Remove the columns you don’t need:

>>> encoded_minds = minds.map(prepare_dataset, remove_columns=minds.column_names["train"], num_proc=4)

🤗 Transformers doesn’t have a data collator for automatic speech recognition, so you will need to create one. You can adapt the DataCollatorWithPadding to create a batch of examples for automatic speech recognition. It will also dynamically pad your text and labels to the length of the longest element in its batch, so they are a uniform length. While it is possible to pad your text in the tokenizer function by setting padding=True, dynamic padding is more efficient.

Unlike other data collators, this specific data collator needs to apply a different padding method to input_values and labels. You can apply a different padding method with a context manager:

>>> import torch

>>> from dataclasses import dataclass, field
>>> from typing import Any, Dict, List, Optional, Union


>>> @dataclass
... class DataCollatorCTCWithPadding:

...     processor: AutoProcessor
...     padding: Union[bool, str] = True

...     def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
...         # split inputs and labels since they have to be of different lengths and need
...         # different padding methods
...         input_features = [{"input_values": feature["input_values"]} for feature in features]
...         label_features = [{"input_ids": feature["labels"]} for feature in features]

...         batch = self.processor.pad(input_features, padding=self.padding, return_tensors="pt")

...         labels_batch = self.processor.pad(labels=label_features, padding=self.padding, return_tensors="pt")

...         # replace padding with -100 to ignore loss correctly
...         labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

...         batch["labels"] = labels

...         return batch

Create a batch of examples and dynamically pad them with DataCollatorForCTCWithPadding:

>>> data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

Train

Pytorch

Hide Pytorch content

Load Wav2Vec2 with AutoModelForCTC. For ctc_loss_reduction, it is often better to use the average instead of the default summation:

>>> from transformers import AutoModelForCTC, TrainingArguments, Trainer

>>> model = AutoModelForCTC.from_pretrained(
...     "facebook/wav2vec2-base",
...     ctc_loss_reduction="mean",
...     pad_token_id=processor.tokenizer.pad_token_id,
... )

If you aren’t familiar with fine-tuning a model with the Trainer, take a look at the basic tutorial here!

At this point, only three steps remain:

Define your training hyperparameters in TrainingArguments.
Pass the training arguments to Trainer along with the model, datasets, tokenizer, and data collator.
Call train() to fine-tune your model.

>>> training_args = TrainingArguments(
...     output_dir="./results",
...     group_by_length=True,
...     per_device_train_batch_size=16,
...     evaluation_strategy="steps",
...     num_train_epochs=3,
...     fp16=True,
...     gradient_checkpointing=True,
...     learning_rate=1e-4,
...     weight_decay=0.005,
...     save_total_limit=2,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=encoded_minds["train"],
...     eval_dataset=encoded_minds["test"],
...     tokenizer=processor.feature_extractor,
...     data_collator=data_collator,
... )

>>> trainer.train()

For a more in-depth example of how to fine-tune a model for automatic speech recognition, take a look at this blog post for English ASR and this post for multilingual ASR.

←Audio classification Image classification→