What libraries can I use for Audio Classification?

The speechbrain, transformers, and transformers.js libraries are compatible with Audio Classification.

What models can I use for Audio Classification?

The speechbrain/google_speech_command_xvector, ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition, and facebook/mms-lid-126 models can be used for Audio Classification.

What datasets can I use for Audio Classification?

The s3prl/superband agkphysics/AudioSet datasets can be used for Audio Classification.

What metrics can I use for Audio Classification?

The accuracy, recall, precision, and f1 metrics can be used for Audio Classification.

Tasks

Audio Classification

Audio classification is the task of assigning a label or class to a given audio. It can be used for recognizing which command a user is giving or the emotion of a statement, as well as identifying a speaker.

Inputs

Audio Classification Model

Output

0.200

Down

0.800

About Audio Classification

Use Cases

Command Recognition

Command recognition or keyword spotting classifies utterances into a predefined set of commands. This is often done on-device for fast response time.

As an example, using the Google Speech Commands dataset, given an input, a model can classify which of the following commands the user is typing:

'yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go', 'unknown', 'silence'

Speechbrain models can easily perform this task with just a couple of lines of code!

from speechbrain.pretrained import EncoderClassifier
model = EncoderClassifier.from_hparams(
  "speechbrain/google_speech_command_xvector"
)
model.classify_file("file.wav")

Language Identification

Datasets such as VoxLingua107 allow anyone to train language identification models for up to 107 languages! This can be extremely useful as a preprocessing step for other systems. Here's an example modeltrained on VoxLingua107.

Emotion recognition

Emotion recognition is self explanatory. In addition to trying the widgets, you can use Inference Endpoints to perform audio classification. Here is a simple example that uses a HuBERT model fine-tuned for this task.

import json
import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://router.huggingface.co/hf-inference/models/superb/hubert-large-superb-er"

def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query("sample1.flac")
# [{'label': 'neu', 'score': 0.60},
# {'label': 'hap', 'score': 0.20},
# {'label': 'ang', 'score': 0.13},
# {'label': 'sad', 'score': 0.07}]

You can use huggingface.js to infer with audio classification models on Hugging Face Hub.

import { InferenceClient } from "@huggingface/inference";

const inference = new InferenceClient(HF_TOKEN);
await inference.audioClassification({
    data: await (await fetch("sample.flac")).blob(),
    model: "facebook/mms-lid-126",
});

Speaker Identification

Speaker Identification is classifying the audio of the person speaking. Speakers are usually predefined. You can try out this task with this model. A useful dataset for this task is VoxCeleb1.

Solving audio classification for your own data

We have some great news! You can do fine-tuning (transfer learning) to train a well-performing model without requiring as much data. Pretrained models such as Wav2Vec2 and HuBERT exist. Facebook's Wav2Vec2 XLS-R model is a large multilingual model trained on 128 languages and with 436K hours of speech. Similarly, you can also use OpenAI's Whisper trained on up to 4 Million hours of multilingual speech data for this task too!

Useful Resources

Would you like to learn more about the topic? Awesome! Here you can find some curated resources that you may find helpful!

Notebooks

PyTorch

Scripts for training

PyTorch

Documentation

Compatible libraries

speechbrain

Transformers

Transformers.js

using MIT/ast-finetuned-audioset-10-10-0.4593

Inference Providers NEW

Audio Classification

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Models for Audio Classification

Browse Models (3,215)

speechbrain/google_speech_command_xvector

Audio Classification • Updated Feb 19, 2024 • 62 • 7

Note An easy-to-use model for command recognition.

ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition

Audio Classification • Updated Oct 24, 2024 • 38.5k • 224

Note An emotion recognition model.

facebook/mms-lid-126

Audio Classification • Updated Jun 13, 2023 • 1.14M • 27

Note A language identification model.

Datasets for Audio Classification

Browse Datasets (365)

s3prl/superb

Updated Aug 8, 2024 • 997 • 30

Note A benchmark of 10 different audio tasks.

agkphysics/AudioSet

Updated Feb 3, 2024 • 9.82k • 44

Note A dataset of YouTube clips and their sound categories.

Spaces using Audio Classification

💻

kurianbenoy/audioclassification

Note An application that can classify music into different genre.

Metrics for Audio Classification

accuracy: Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with: Accuracy = (TP + TN) / (TP + TN + FP + FN) Where: TP: True positive TN: True negative FP: False positive FN: False negative

recall: Recall is the fraction of the positive examples that were correctly labeled by the model as positive. It can be computed with the equation: Recall = TP / (TP + FN) Where TP is the true positives and FN is the false negatives.

precision: Precision is the fraction of correctly labeled positive examples out of all of the examples that were labeled as positive. It is computed via the equation: Precision = TP / (TP + FP) where TP is the True positives (i.e. the examples correctly labeled as positive) and FP is the False positive examples (i.e. the examples incorrectly labeled as positive).

f1: The F1 score is the harmonic mean of the precision and recall. It can be computed with the equation: F1 = 2 * (precision * recall) / (precision + recall)