canlinzhang
/

wav2vec2_speech_emotion_recognition_trained_on_IEMOCAP

Audio Classification

Inference Endpoints

Model card Files Files and versions Community

wav2vec2_speech_emotion_recognition_trained_on_IEMOCAP / README.md

canlinzhang's picture

Update README.md

4e169cc 9 months ago

|

history blame contribute delete

No virus

1.53 kB

	This model is fine tuned on the IEMOCAP dataset. We applied volume normalization and data augmentation (noise injection, pitch shift and audio stretching). Also, this is a speaker independent model: We use Ses05F in the IEMOCAP dataset as validation speaker and Ses05M as test speaker.

	The initial pre-trained model is facebook/wav2vec2-base. The fine tune dataset only contains 4 common emotions of IEMOCAP (happy, angry, sad, neutral), without frustration. The audios are either padded or trimed to 8-sec-long before fine tuning.

	After 10 epoches of training, the validation accuracy is around 67%.

	In order to impliment this model: Please run the following code in a python script:

	```
	from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
	import librosa
	import torch

	target_sampling_rate = 16000
	model_name = 'canlinzhang/wav2vec2_speech_emotion_recognition_trained_on_IEMOCAP'
	audio_path = your_audio_path

	#build id and label dicts
	id2label = {0:'neu', 1:'ang', 2:'sad', 3:'hap'}
	label2id = {'neu':0, 'ang':1, 'sad':2, 'hap':3}

	feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)

	model = AutoModelForAudioClassification.from_pretrained(model_name)

	y_ini, sr_ini = librosa.load(audio_path, sr=target_sampling_rate)

	inputs = feature_extractor(y_ini, sampling_rate=target_sampling_rate, return_tensors="pt")

	logits = model(**inputs).logits

	predicted_class_ids = torch.argmax(logits).item()

	pred_class = id2label[predicted_class_ids]

	print(pred_class)
	```