3loi's picture
Create README.md
e73f3cc verified
|
raw
history blame
2.89 kB
metadata
license: mit
language:
  - en
pipeline_tag: audio-classification
tags:
  - wavlm
  - msp-podcast
  - emotion-recognition
  - audio
  - speech
  - categorical
  - lucas

The model was trained on MSP-Podcast for the Odyssey 2024 Emotion Recognition competition baseline
This particular model is the categorical based model which predict "Angry", "Sad", "Happy", "Surprise", "Fear", "Disgust", "Contempt" and "Neutral".

Benchmarks

CCC based on test3 and Development sets of the Odyssey Competition

Categorical Setup
Test 3Development
F1-Mic. F1-Ma. Prec. Rec. F1-Mic. F1-Ma. Prec. Rec.
0.327 0.311 0.332 0.325 0.409 0.307 0.316 0.345

For more details: demo, paper/soon and GitHub.

@InProceedings{Goncalves_2024,
            author={L. Goncalves and A. N. Salman and A. {Reddy Naini} and L. Moro-Velazquez and T. Thebaud and L. {Paola Garcia} and N. Dehak and B. Sisman and C. Busso},
            title={Odyssey2024 - Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results},
            booktitle={Odyssey 2024: The Speaker and Language Recognition Workshop)},
            volume={To appear},
            year={2024},
            month={June},
            address =  {Quebec, Canada},
}

Usage

from transformers import AutoModelForAudioClassification
import librosa, torch

#load model
model = AutoModelForAudioClassification.from_pretrained("3loi/SER-Odyssey-Baseline-WavLM-Categorical-Attributes", trust_remote_code=True)

#get mean/std
mean = model.config.mean
std = model.config.std


#load an audio file
audio_path = "/path/to/audio.wav"
raw_wav, _ = librosa.load(audio_path, sr=model.config.sampling_rate)

#normalize the audio by mean/std
norm_wav = (raw_wav - mean) / (std+0.000001)

#generate the mask
mask = torch.ones(1, len(norm_wav))

#batch it (add dim)
wavs = torch.tensor(norm_wav).unsqueeze(0)


#predict
with torch.no_grad():
    pred = model(wavs, mask)

print(model.config.id2label)  
print(pred)
#{0: 'Angry', 1: 'Sad', 2: 'Happy', 3: 'Surprise', 4: 'Fear', 5: 'Disgust', 6: 'Contempt', 7: 'Neutral'}
#tensor([[0.0015, 0.3651, 0.0593, 0.0315, 0.0600, 0.0125, 0.0319, 0.4382]])

#convert logits to probability
probabilities = torch.nn.functional.softmax(pred, dim=1)
print(probabilities)
#[[0.0015, 0.3651, 0.0593, 0.0315, 0.0600, 0.0125, 0.0319, 0.4382]]