Transformers documentation

Audio classification

You are viewing v4.18.0 version. A newer version v4.29.1 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Audio classification

Audio classification assigns a label or class to audio data. It is similar to text classification, except an audio input is continuous and must be discretized, whereas text can be split into tokens. Some practical applications of audio classification include identifying intent, speakers, and even animal species by their sounds.

This guide will show you how to fine-tune Wav2Vec2 on the Keyword Spotting subset of the SUPERB benchmark to classify utterances.

See the audio classification task page for more information about its associated models, datasets, and metrics.

Load SUPERB dataset

Load the SUPERB dataset from the 🤗 Datasets library:

>>> from datasets import load_dataset

>>> ks = load_dataset("superb", "ks")

Then take a look at an example:

>>> ks["train"][0]
{'audio': {'array': array([ 0. , 0. , 0. , ..., -0.00592041, -0.00405884, -0.00253296], dtype=float32), 'path': '/root/.cache/huggingface/datasets/downloads/extracted/05734a36d88019a09725c20cc024e1c4e7982e37d7d55c0c1ca1742ea1cdd47f/_background_noise_/doing_the_dishes.wav', 'sampling_rate': 16000}, 'file': '/root/.cache/huggingface/datasets/downloads/extracted/05734a36d88019a09725c20cc024e1c4e7982e37d7d55c0c1ca1742ea1cdd47f/_background_noise_/doing_the_dishes.wav', 'label': 10}

The audio column contains a 1-dimensional array of the speech signal that must be called to load and resample the audio file. The label column is an integer that represents the utterance class. Create a dictionary that maps a label name to an integer and vice versa. The mapping will help the model recover the label name from the label number:

>>> labels = ks["train"].features["label"].names
>>> label2id, id2label = dict(), dict()
>>> for i, label in enumerate(labels):
...     label2id[label] = str(i)
...     id2label[str(i)] = label

Now you can convert the label number to a label name for more information:

>>> id2label[str(10)]

Each keyword - or label - corresponds to a number; 10 indicates silence in the example above.


Load the Wav2Vec2 feature extractor to process the audio signal:

>>> from transformers import AutoFeatureExtractor

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

The preprocessing function needs to:

  1. Call the audio column to load and if necessary resample the audio file.
  2. Check the sampling rate of the audio file matches the sampling rate of the audio data a model was pretrained with. You can find this information on the Wav2Vec2 model card.
  3. Set a maximum input length so longer inputs are batched without being truncated.
>>> def preprocess_function(examples):
...     audio_arrays = [x["array"] for x in examples["audio"]]
...     inputs = feature_extractor(
...         audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
...     )
...     return inputs

Use 🤗 Datasets map function to apply the preprocessing function over the entire dataset. You can speed up the map function by setting batched=True to process multiple elements of the dataset at once. Remove the columns you don’t need:

>>> encoded_ks =, remove_columns=["audio", "file"], batched=True)


Hide Pytorch content

Load Wav2Vec2 with AutoModelForAudioClassification. Specify the number of labels, and pass the model the mapping between label number and label class:

>>> from transformers import AutoModelForAudioClassification, TrainingArguments, Trainer

>>> num_labels = len(id2label)
>>> model = AutoModelForAudioClassification.from_pretrained(
...     "facebook/wav2vec2-base", num_labels=num_labels, label2id=label2id, id2label=id2label
... )

If you aren’t familiar with fine-tuning a model with the Trainer, take a look at the basic tutorial here!

At this point, only three steps remain:

  1. Define your training hyperparameters in TrainingArguments.
  2. Pass the training arguments to Trainer along with the model, datasets, and feature extractor.
  3. Call train() to fine-tune your model.
>>> training_args = TrainingArguments(
...     output_dir="./results",
...     evaluation_strategy="epoch",
...     save_strategy="epoch",
...     learning_rate=3e-5,
...     num_train_epochs=5,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=encoded_ks["train"],
...     eval_dataset=encoded_ks["validation"],
...     tokenizer=feature_extractor,
... )

>>> trainer.train()

For a more in-depth example of how to fine-tune a model for audio classification, take a look at the corresponding PyTorch notebook.