|
This model is fine tuned on the IEMOCAP dataset. We applied volume normalization and data augmentation (noise injection, pitch shift and audio stretching). Also, this is a speaker independent model: We use Ses05F in the IEMOCAP dataset as validation speaker and Ses05M as test speaker. |
|
|
|
The initial pre-trained model is facebook/wav2vec2-base. The fine tune dataset only contains 4 common emotions of IEMOCAP (happy, angry, sad, neutral), *without frustration*. The audios are either padded or trimed to 8-sec-long before fine tuning. |
|
|
|
After **10** epoches of training, the validation accuracy is around **67%**. |
|
|
|
In order to impliment this model: Please run the following code in a python script: |
|
|
|
``` |
|
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification |
|
import librosa |
|
import torch |
|
|
|
target_sampling_rate = 16000 |
|
model_name = 'canlinzhang/wav2vec2_speech_emotion_recognition_trained_on_IEMOCAP' |
|
audio_path = your_audio_path |
|
|
|
#build id and label dicts |
|
id2label = {0:'neu', 1:'ang', 2:'sad', 3:'hap'} |
|
label2id = {'neu':0, 'ang':1, 'sad':2, 'hap':3} |
|
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name) |
|
|
|
model = AutoModelForAudioClassification.from_pretrained(model_name) |
|
|
|
y_ini, sr_ini = librosa.load(audio_path, sr=target_sampling_rate) |
|
|
|
inputs = feature_extractor(y_ini, sampling_rate=target_sampling_rate, return_tensors="pt") |
|
|
|
logits = model(**inputs).logits |
|
|
|
predicted_class_ids = torch.argmax(logits).item() |
|
|
|
pred_class = id2label[predicted_class_ids] |
|
|
|
print(pred_class) |
|
``` |