canlinzhang
/

wav2vec2_speech_emotion_recognition_trained_on_IEMOCAP

Audio Classification

Inference Endpoints

Model card Files Files and versions Community

wav2vec2_speech_emotion_recognition_trained_on_IEMOCAP / README.md

canlinzhang's picture

Update README.md

3dfc212 11 months ago

|

1.53 kB

	This model is fine tuned on the IEMOCAP dataset. We applied volume normalization and data augmentation (noise injection, pitch shift and audio stretching). Also, this is a speaker independent model: We use Ses05F in the IEMOCAP dataset as validation speaker and Ses05M as test speaker.

	The initial pre-trained model is facebook/wav2vec2-base. The fine tune dataset only contains 4 common emotions of IEMOCAP (happy, angry, sad, neutral), without frustration. The audios are either padded or trimed to 8-sec-long before fine tuning.

	After 10 epoches of training, the validation accuracy is around 67%.

	In order to impliment this model: Please run the following code in a python script:

	```
	from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
	import librosa
	import torch

	target_sampling_rate = 16000
	model_name = 'canlinzhang/wav2vec2_speech_emotion_recognition_trained_on_IEMOCAP'
	audio_path = your_audio_path

	#build id and label dicts
	id2label = {0:'neu', 1:'ang', 2:'sad', 3:'hap'}
	label2id = {'neu':0, 'ang':1, 'sad':2, 'hap':3}

	feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)

	model = AutoModelForAudioClassification.from_pretrained(model_name)

	y_ini, sr_ini = librosa.load(audio_path, sr=target_sampling_rate)

	inputs = feature_extractor(y_ini, sampling_rate=target_sampling_rate, return_tensors="pt")

	logits = model(**inputs).logits

	predicted_class_ids = torch.argmax(logits).item()

	pred_class = id2label[predicted_class_ids]

	print(pred_class)
	```