canlinzhang commited on
Commit
3dfc212
1 Parent(s): 0969d04

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -6
README.md CHANGED
@@ -1,10 +1,10 @@
1
- This model is fine tuned on the IEMOCAP_speaker_indpt_Ses05F_Ses05M.pickle dataset, which use Ses05F as validation speaker and Ses05M as test speaker. **We do not apply volume normalization, neither data augmentation.** So it is a speaker independent model.
2
 
3
- The initial pre-trained model is facebook/wav2vec2-base. The fine tune dataset only contains 4 common emotions of IEMOCAP (happy, angry, sad, neutral), **without frustration**. No audio augmentation is applied. Fine-tune dataset audios are also not padded or trimed to fixed length. The length setting is done when fine tuning the transoformer using max_length = 8 sec in the feature extractor.
4
 
5
  After **10** epoches of training, the validation accuracy is around **67%**.
6
 
7
- In order to impliment this model: run the following code in a python script:
8
 
9
  ```
10
  from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
@@ -12,8 +12,7 @@ import librosa
12
  import torch
13
 
14
  target_sampling_rate = 16000
15
- model_name = 'canlinzhang/Sorenson_fine_tune_wav2vec2-on_IEMOCAP_no_aug_no_fru_2'
16
- my_token = my_token
17
  audio_path = your_audio_path
18
 
19
  #build id and label dicts
@@ -22,7 +21,7 @@ label2id = {'neu':0, 'ang':1, 'sad':2, 'hap':3}
22
 
23
  feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
24
 
25
- model = AutoModelForAudioClassification.from_pretrained(model_name, use_auth_token = my_token)
26
 
27
  y_ini, sr_ini = librosa.load(audio_path, sr=target_sampling_rate)
28
 
 
1
+ This model is fine tuned on the IEMOCAP dataset. We applied volume normalization and data augmentation (noise injection, pitch shift and audio stretching). Also, this is a speaker independent model: We use Ses05F in the IEMOCAP dataset as validation speaker and Ses05M as test speaker.
2
 
3
+ The initial pre-trained model is facebook/wav2vec2-base. The fine tune dataset only contains 4 common emotions of IEMOCAP (happy, angry, sad, neutral), *without frustration*. The audios are either padded or trimed to 8-sec-long before fine tuning.
4
 
5
  After **10** epoches of training, the validation accuracy is around **67%**.
6
 
7
+ In order to impliment this model: Please run the following code in a python script:
8
 
9
  ```
10
  from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
 
12
  import torch
13
 
14
  target_sampling_rate = 16000
15
+ model_name = 'canlinzhang/wav2vec2_speech_emotion_recognition_trained_on_IEMOCAP'
 
16
  audio_path = your_audio_path
17
 
18
  #build id and label dicts
 
21
 
22
  feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
23
 
24
+ model = AutoModelForAudioClassification.from_pretrained(model_name)
25
 
26
  y_ini, sr_ini = librosa.load(audio_path, sr=target_sampling_rate)
27