Edit model card

Emotion Recognition in English Using RAVDESS and Wav2Vec 2.0

This model extracts emotions from audio recordings. It was trained on RAVDESS, a dataset containing English audio recordings. The model recognises six emotions: anger, disgust, fear, happiness, sadness and surprise.

The model recreates the work of this Greek emotion extractor using a pre-trained Wav2Vec2 model to process the data.

Model Details

Model Description

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

The RAVDESS dataset was split into training, validation and test sets with 60, 20 and 20 splits, respectively.

Training Procedure

The fine-tuning process was centred on four hyper-parameters:

  • the number of batches (4, 8),
  • gradient accumulation steps (GAS) (2, 4, 6, 8),
  • number of epochs (10, 20) and
  • the learning rate (1e-3, 1e-4, 1e-5).

Each experiment was repeated 10 times.

Evaluation

The set of hyper-parameters resulting in the best performance is: 4 batches, 4 GAS, 10 epochs and 1e-4 learning rate

Testing

The model was retrained on the combined train and validation sets using the best hyper-parameter set. The performance on the test set has an average Accuracy and F1 scores of 84.84% (SD 2 and 2.08, respectively)

Results

We retained the model providing the highest performance over the 10 runs.

Emotion Accuracy Precision Recall F1
Anger 96.55 87.50
Disgust 90.91 93.75
Fear 96.30 81.25
Happiness 93.10 84.38
Sad 81.58 96.88
Surprise 77.78 87.50
Total 88.54 89.37 88.54 88.62
Downloads last month
4
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train AreejB/wav2vec2-xlsr-english-speech-emotion-recognition