canlinzhang's picture
Update README.md
4e169cc
|
raw
history blame
1.53 kB

This model is fine tuned on the IEMOCAP dataset. We applied volume normalization and data augmentation (noise injection, pitch shift and audio stretching). Also, this is a speaker independent model: We use Ses05F in the IEMOCAP dataset as validation speaker and Ses05M as test speaker.

The initial pre-trained model is facebook/wav2vec2-base. The fine tune dataset only contains 4 common emotions of IEMOCAP (happy, angry, sad, neutral), without frustration. The audios are either padded or trimed to 8-sec-long before fine tuning.

After 10 epoches of training, the validation accuracy is around 67%.

In order to impliment this model: Please run the following code in a python script:

from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
import librosa
import torch

target_sampling_rate = 16000
model_name = 'canlinzhang/wav2vec2_speech_emotion_recognition_trained_on_IEMOCAP'
audio_path = your_audio_path

#build id and label dicts    
id2label = {0:'neu', 1:'ang', 2:'sad', 3:'hap'}
label2id = {'neu':0, 'ang':1, 'sad':2, 'hap':3}

feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)

model = AutoModelForAudioClassification.from_pretrained(model_name)

y_ini, sr_ini = librosa.load(audio_path, sr=target_sampling_rate)

inputs = feature_extractor(y_ini, sampling_rate=target_sampling_rate, return_tensors="pt")

logits = model(**inputs).logits

predicted_class_ids = torch.argmax(logits).item()

pred_class = id2label[predicted_class_ids]

print(pred_class)