File size: 1,532 Bytes

3dfc212
620ae71
4e169cc
620ae71
6d19ce7
620ae71
3dfc212
620ae71
 
 
 
 
 
 
3dfc212
620ae71
 
 
 
 
 
 
 
3dfc212
620ae71

This model is fine tuned on the IEMOCAP dataset. We applied volume normalization and data augmentation (noise injection, pitch shift and audio stretching). Also, this is a speaker independent model: We use Ses05F in the IEMOCAP dataset as validation speaker and Ses05M as test speaker. 

The initial pre-trained model is **facebook/wav2vec2-base**. The fine tune dataset only contains 4 common emotions of IEMOCAP (happy, angry, sad, neutral), *without frustration*. The audios are either padded or trimed to 8-sec-long before fine tuning. 

After **10** epoches of training, the validation accuracy is around **67%**.

In order to impliment this model: Please run the following code in a python script:

```
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
import librosa
import torch

target_sampling_rate = 16000
model_name = 'canlinzhang/wav2vec2_speech_emotion_recognition_trained_on_IEMOCAP'
audio_path = your_audio_path

#build id and label dicts    
id2label = {0:'neu', 1:'ang', 2:'sad', 3:'hap'}
label2id = {'neu':0, 'ang':1, 'sad':2, 'hap':3}

feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)

model = AutoModelForAudioClassification.from_pretrained(model_name)

y_ini, sr_ini = librosa.load(audio_path, sr=target_sampling_rate)

inputs = feature_extractor(y_ini, sampling_rate=target_sampling_rate, return_tensors="pt")

logits = model(**inputs).logits

predicted_class_ids = torch.argmax(logits).item()

pred_class = id2label[predicted_class_ids]

print(pred_class)
```