Audio Classification

Audio classification is the task of assigning a label or class to a given audio. It can be used for recognizing which command a user is giving or the emotion of a statement, as well as identifying a speaker.

Audio Classification Model

About Audio Classification

Use Cases

Command Recognition

Command recognition or keyword spotting classifies utterances into a predefined set of commands. This is often done on-device for fast response time.

As an example, using the Google Speech Commands dataset, given an input, a model can classify which of the following commands the user is typing:

'yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go', 'unknown', 'silence'

Speechbrain models can easily perform this task with just a couple of lines of code!

from speechbrain.pretrained import EncoderClassifier
model = EncoderClassifier.from_hparams(

Language Identification

Datasets such as VoxLingua107 allow anyone to train language identification models for up to 107 languages! This can be extremely useful as a preprocessing step for other systems. Here's an example modeltrained on VoxLingua107.

Emotion recognition

Emotion recognition is self explanatory. In addition to trying the widgets, you can use Inference Endpoints to perform audio classification. Here is a simple example that uses a HuBERT model fine-tuned for this task.

import json
import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/superb/hubert-large-superb-er"

def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query("sample1.flac")
# [{'label': 'neu', 'score': 0.60},
# {'label': 'hap', 'score': 0.20},
# {'label': 'ang', 'score': 0.13},
# {'label': 'sad', 'score': 0.07}]

You can use huggingface.js to infer with audio classification models on Hugging Face Hub.

import { HfInference } from "@huggingface/inference";

const inference = new HfInference(HF_TOKEN);
await inference.audioClassification({
    data: await (await fetch("sample.flac")).blob(),
    model: "facebook/mms-lid-126",

Speaker Identification

Speaker Identification is classifying the audio of the person speaking. Speakers are usually predefined. You can try out this task with this model. A useful dataset for this task is VoxCeleb1.

Solving audio classification for your own data

We have some great news! You can do fine-tuning (transfer learning) to train a well-performing model without requiring as much data. Pretrained models such as Wav2Vec2 and HuBERT exist. Facebook's Wav2Vec2 XLS-R model is a large multilingual model trained on 128 languages and with 436K hours of speech. Similarly, you can also use OpenAI's Whisper trained on up to 4 Million hours of multilingual speech data for this task too!

Useful Resources

Would you like to learn more about the topic? Awesome! Here you can find some curated resources that you may find helpful!


Scripts for training


Compatible libraries

Audio Classification demo
Models for Audio Classification
Browse Models (1,797)
Datasets for Audio Classification
Browse Datasets (134)

Note A benchmark of 10 different audio tasks.

Spaces using Audio Classification

Note An application that can classify music into different genre.

Metrics for Audio Classification
Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative
Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation: Recall = TP / (TP + FN) Where TP is the true positives and FN is the false negatives.
Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation: Precision = TP / (TP + FP) where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).
The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation: F1 = 2 * (precision * recall) / (precision + recall)