README.md · canlinzhang/wav2vec2_speech_emotion_recognition_trained_on_IEMOCAP at 6d19ce7572f4ab9ceee36724d506370baa1d58f3

This model is fine tuned on the IEMOCAP_speaker_indpt_Ses05F_Ses05M.pickle dataset, which use Ses05F as validation speaker and Ses05M as test speaker. So it is a speaker independent model.

The initial pre-trained model is facebook/wav2vec2-base. The fine tune dataset only contains 4 common emotions of IEMOCAP (happy, angry, sad, neutral), without frustration. No audio augmentation is applied. Fine-tune dataset audios are also not padded or trimed to fixed length. The length setting is done when fine tuning the transoformer using max_length = 8 sec in the feature extractor.

After 10 epoches of training, the validation accuracy is around 67%.

In order to impliment this model: run the following code in a python script:

from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
import librosa
import torch

target_sampling_rate = 16000
model_name = 'canlinzhang/Sorenson_fine_tune_wav2vec2-on_IEMOCAP_no_aug_no_fru_2'
my_token = my_token
audio_path = your_audio_path

#build id and label dicts    
id2label = {0:'neu', 1:'ang', 2:'sad', 3:'hap'}
label2id = {'neu':0, 'ang':1, 'sad':2, 'hap':3}

feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)

model = AutoModelForAudioClassification.from_pretrained(model_name, use_auth_token = my_token)

y_ini, sr_ini = librosa.load(audio_path, sr=target_sampling_rate)

inputs = feature_extractor(y_ini, sampling_rate=target_sampling_rate, return_tensors="pt")

logits = model(**inputs).logits

predicted_class_ids = torch.argmax(logits).item()

pred_class = id2label[predicted_class_ids]

print(pred_class)