File size: 3,181 Bytes

342564b
 
 
 
 
 
 
 
 
 
e267027
b243e55
 
 
 
 
 
f782ed5
 
 
 
b243e55
 
 
 
 
 
 
 
79daa1b
f782ed5
 
79daa1b
b243e55
 
 
 
 
 
 
 
 
 
 
 
 
 
ffa287e
b243e55
 
 
ffa287e
 
 
 
 
b243e55
ffa287e
b243e55
 
 
ffa287e
b243e55
ffa287e
b243e55
ffa287e
b243e55
 
ffa287e
b243e55
0786134
 
 
 
 
 
 
 
 
 
 
b243e55
 
ffa287e
b243e55
 
 
ffa287e
b243e55
 
 
 
 
ffa287e

---
datasets:
- narad/ravdess
language:
- en
metrics:
- f1
- accuracy
- recall
- precision
pipeline_tag: audio-classification
---

# Emotion Recognition in English Using RAVDESS and Wav2Vec 2.0

<!-- Provide a quick summary of what the model is/does. -->

This model extracts emotions from audio recordings. It was trained on RAVDESS, a dataset containing English audio recordings. The model recognises six emotions: anger, disgust, fear, happiness, sadness and surprise.

The model recreates the work of this [Greek emotion extractor](https://huggingface.co/m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition/blob/main/README.md) using a pre-trained [Wav2Vec2](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english) model to process the data. 


## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->


- **Adapted from:** [Emotion Recognition in Greek](https://huggingface.co/m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition/blob/main/README.md)
- **Model type:** NN with CTC
- **Language(s) (NLP):** English
- **Finetuned from model:** [wav2vec2](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english)


## How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The RAVDESS dataset was split into training, validation and test sets with 60, 20 and 20 splits, respectively. 

### Training Procedure

The fine-tuning process was centred on four hyper-parameters: 
- the number of batches (4, 8),
- gradient accumulation steps (GAS) (2, 4, 6, 8),
- number of epochs (10, 20) and
- the learning rate (1e-3, 1e-4, 1e-5).

Each experiment was repeated 10 times.

## Evaluation

The set of hyper-parameters resulting in the best performance is: 4 batches, 4 GAS, 10 epochs and 1e-4 learning rate

## Testing

The model was retrained on the combined train and validation sets using the best hyper-parameter set. The performance on the test set has an average Accuracy and F1 scores of 84.84% (SD 2 and 2.08, respectively)


## Results

We retained the model providing the highest performance over the 10 runs.

| Emotion   | Accuracy | Precision |  Recall  |    F1    |
|-----------|:-------:|-----------:|---------:|---------:|
| Anger     |         |    96.55   |  87.50   |          |
| Disgust   |         |    90.91   |  93.75   |          |
| Fear      |         |    96.30   |  81.25   |          |
| Happiness |         |    93.10   |  84.38   |          |
| Sad       |         |    81.58   |  96.88   |          |
| Surprise  |         |    77.78   |  87.50   |          |
| Total     |  88.54  |    89.37   |  88.54   |  88.62   |


<!-- ## Citation [optional] -->

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

<!-- **BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed] -->