metadata
license: mit
language:
- en
pipeline_tag: audio-classification
tags:
- wavlm
- msp-podcast
- emotion-recognition
- audio
- speech
- categorical
- lucas
The model was trained on MSP-Podcast for the Odyssey 2024 Emotion Recognition competition baseline
This particular model is the categorical based model which predict "Angry", "Sad", "Happy", "Surprise", "Fear", "Disgust", "Contempt" and "Neutral".
Benchmarks
CCC based on test3 and Development sets of the Odyssey Competition
Categorical Setup | |||||||
---|---|---|---|---|---|---|---|
Test 3 | Development | ||||||
F1-Mic. | F1-Ma. | Prec. | Rec. | F1-Mic. | F1-Ma. | Prec. | Rec. |
0.327 | 0.311 | 0.332 | 0.325 | 0.409 | 0.307 | 0.316 | 0.345 |
For more details: demo, paper/soon and GitHub.
@InProceedings{Goncalves_2024,
author={L. Goncalves and A. N. Salman and A. {Reddy Naini} and L. Moro-Velazquez and T. Thebaud and L. {Paola Garcia} and N. Dehak and B. Sisman and C. Busso},
title={Odyssey2024 - Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results},
booktitle={Odyssey 2024: The Speaker and Language Recognition Workshop)},
volume={To appear},
year={2024},
month={June},
address = {Quebec, Canada},
}
Usage
from transformers import AutoModelForAudioClassification
import librosa, torch
#load model
model = AutoModelForAudioClassification.from_pretrained("3loi/SER-Odyssey-Baseline-WavLM-Categorical-Attributes", trust_remote_code=True)
#get mean/std
mean = model.config.mean
std = model.config.std
#load an audio file
audio_path = "/path/to/audio.wav"
raw_wav, _ = librosa.load(audio_path, sr=model.config.sampling_rate)
#normalize the audio by mean/std
norm_wav = (raw_wav - mean) / (std+0.000001)
#generate the mask
mask = torch.ones(1, len(norm_wav))
#batch it (add dim)
wavs = torch.tensor(norm_wav).unsqueeze(0)
#predict
with torch.no_grad():
pred = model(wavs, mask)
print(model.config.id2label)
print(pred)
#{0: 'Angry', 1: 'Sad', 2: 'Happy', 3: 'Surprise', 4: 'Fear', 5: 'Disgust', 6: 'Contempt', 7: 'Neutral'}
#tensor([[0.0015, 0.3651, 0.0593, 0.0315, 0.0600, 0.0125, 0.0319, 0.4382]])
#convert logits to probability
probabilities = torch.nn.functional.softmax(pred, dim=1)
print(probabilities)
#[[0.0015, 0.3651, 0.0593, 0.0315, 0.0600, 0.0125, 0.0319, 0.4382]]