BoJack commited on
Commit
7714a5c
1 Parent(s): 11231b8

Upload 8 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ logo.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,157 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: model-license
4
+ license_link: https://github.com/alibaba-damo-academy/FunASR
5
+ frameworks:
6
+ - Pytorch
7
+ tasks:
8
+ - emotion-recognition
9
+ widgets:
10
+ - enable: true
11
+ version: 1
12
+ task: emotion-recognition
13
+ examples:
14
+ - inputs:
15
+ - data: git://example/test.wav
16
+ inputs:
17
+ - type: audio
18
+ displayType: AudioUploader
19
+ validator:
20
+ max_size: 10M
21
+ name: input
22
+ output:
23
+ displayType: Prediction
24
+ displayValueMapping:
25
+ labels: labels
26
+ scores: scores
27
+ inferencespec:
28
+ cpu: 8
29
+ gpu: 0
30
+ gpu_memory: 0
31
+ memory: 4096
32
+ model_revision: master
33
+ extendsParameters:
34
+ extract_embedding: false
35
+ ---
36
+
37
+ <div align="center">
38
+ <h1>
39
+ EMOTION2VEC+
40
+ </h1>
41
+ <p>
42
+ emotion2vec+: speech emotion recognition foundation model <br>
43
+ <b>emotion2vec+ large model</b>
44
+ </p>
45
+ <p>
46
+ <img src="logo.png" style="width: 200px; height: 200px;">
47
+ </p>
48
+ <p>
49
+ </p>
50
+ </div>
51
+
52
+
53
+ # Guides
54
+ emotion2vec+ is a series of foundational models for speech emotion recognition (SER). We aim to train a "whisper" in the field of speech emotion recognition, overcoming the effects of language and recording environments through data-driven methods to achieve universal, robust emotion recognition capabilities. The performance of emotion2vec+ significantly exceeds other highly downloaded open-source models on Hugging Face.
55
+
56
+ ![](emotion2vec+radar.png)
57
+
58
+ This version (emotion2vec_plus_large) uses a large-scale pseudo-labeled data for finetuning to obtain a large size model (~300M), and currently supports the following categories:
59
+ 0: angry
60
+ 1: happy
61
+ 2: neutral
62
+ 3: sad
63
+ 4: unknown
64
+
65
+ # Model Card
66
+ GitHub Repo: [emotion2vec](https://github.com/ddlBoJack/emotion2vec)
67
+ |Model|⭐Model Scope|🤗Hugging Face|Fine-tuning Data (Hours)|
68
+ |:---:|:-------------:|:-----------:|:-------------:|
69
+ |emotion2vec|[Link](https://www.modelscope.cn/models/iic/emotion2vec_base/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_base)|/|
70
+ emotion2vec+ seed|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_seed/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_seed)|201|
71
+ emotion2vec+ base|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_base/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_base)|4788|
72
+ emotion2vec+ large|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_large/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_large)|42526|
73
+
74
+
75
+ # Data Iteration
76
+
77
+ We offer 3 versions of emotion2vec+, each derived from the data of its predecessor. If you need a model focusing on spech emotion representation, refer to [emotion2vec: universal speech emotion representation model](https://huggingface.co/emotion2vec/emotion2vec).
78
+
79
+ - emotion2vec+ seed: Fine-tuned with academic speech emotion data
80
+ - emotion2vec+ base: Fine-tuned with filtered large-scale pseudo-labeled data to obtain the base size model (~90M)
81
+ - emotion2vec+ large: Fine-tuned with filtered large-scale pseudo-labeled data to obtain the large size model (~300M)
82
+
83
+ The iteration process is illustrated below, culminating in the training of the emotion2vec+ large model with 40k out of 160k hours of speech emotion data. Details of data engineering will be announced later.
84
+
85
+ ![](emotion2vec+data.png)
86
+
87
+ # Installation
88
+
89
+ `pip install -U funasr modelscope`
90
+
91
+ # Usage
92
+
93
+ input: 16k Hz speech recording
94
+
95
+ granularity:
96
+ - "utterance": Extract features from the entire utterance
97
+ - "frame": Extract frame-level features (50 Hz)
98
+
99
+ extract_embedding: Whether to extract features; set to False if using only the classification model
100
+
101
+ ## Inference based on ModelScope
102
+
103
+ ```python
104
+ from modelscope.pipelines import pipeline
105
+ from modelscope.utils.constant import Tasks
106
+
107
+ inference_pipeline = pipeline(
108
+ task=Tasks.emotion_recognition,
109
+ model="iic/emotion2vec_plus_large")
110
+
111
+ rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav', granularity="utterance", extract_embedding=False)
112
+ print(rec_result)
113
+ ```
114
+
115
+
116
+ ## Inference based on FunASR
117
+
118
+ ```python
119
+ from funasr import AutoModel
120
+
121
+ model = AutoModel(model="iic/emotion2vec_plus_large")
122
+
123
+ wav_file = f"{model.model_path}/example/test.wav"
124
+ res = model.generate(wav_file, output_dir="./outputs", granularity="utterance", extract_embedding=False)
125
+ print(res)
126
+ ```
127
+ Note: The model will automatically download.
128
+
129
+
130
+ Supports input file list, wav.scp (Kaldi style):
131
+ ```cat wav.scp
132
+ wav_name1 wav_path1.wav
133
+ wav_name2 wav_path2.wav
134
+ ...
135
+ ```
136
+
137
+ Outputs are emotion representation, saved in the output_dir in numpy format (can be loaded with np.load())
138
+
139
+ # Note
140
+
141
+ This repository is the Huggingface version of emotion2vec, with identical model parameters as the original model and Model Scope version.
142
+
143
+ Original repository: [https://github.com/ddlBoJack/emotion2vec](https://github.com/ddlBoJack/emotion2vec)
144
+
145
+ Model Scope repository:[https://github.com/alibaba-damo-academy/FunASR](https://github.com/alibaba-damo-academy/FunASR/tree/funasr1.0/examples/industrial_data_pretraining/emotion2vec)
146
+
147
+ Hugging Face repository:[https://huggingface.co/emotion2vec](https://huggingface.co/emotion2vec)
148
+
149
+ # Citation
150
+ ```BibTeX
151
+ @article{ma2023emotion2vec,
152
+ title={emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation},
153
+ author={Ma, Ziyang and Zheng, Zhisheng and Ye, Jiaxin and Li, Jinchao and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
154
+ journal={arXiv preprint arXiv:2312.15185},
155
+ year={2023}
156
+ }
157
+ ```
config.yaml ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # network architecture
3
+ model: Emotion2vec
4
+ model_conf:
5
+ _name: data2vec_multi
6
+ activation_dropout: 0.0
7
+ adversarial_hidden_dim: 128
8
+ adversarial_training: false
9
+ adversarial_weight: 0.1
10
+ attention_dropout: 0.1
11
+ average_top_k_layers: 16
12
+ batch_norm_target_layer: false
13
+ clone_batch: 12
14
+ cls_loss: 1.0
15
+ cls_type: chunk
16
+ d2v_loss: 1.0
17
+ decoder_group: false
18
+ depth: 8
19
+ dropout_input: 0.0
20
+ ema_anneal_end_step: 20000
21
+ ema_decay: 0.9997
22
+ ema_encoder_only: false
23
+ ema_end_decay: 1.0
24
+ ema_same_dtype: true
25
+ embed_dim: 1024
26
+ encoder_dropout: 0.1
27
+ end_drop_path_rate: 0.0
28
+ end_of_block_targets: false
29
+ instance_norm_target_layer: true
30
+ instance_norm_targets: false
31
+ layer_norm_first: false
32
+ layer_norm_target_layer: false
33
+ layer_norm_targets: false
34
+ layerdrop: 0.0
35
+ log_norms: true
36
+ loss_beta: 0.0
37
+ loss_scale: null
38
+ mae_init: false
39
+ max_update: 100000
40
+ min_pred_var: 0.01
41
+ min_target_var: 0.1
42
+ mlp_ratio: 4.0
43
+ normalize: true
44
+ modalities:
45
+ _name: null
46
+ audio:
47
+ add_masks: false
48
+ alibi_max_pos: null
49
+ alibi_scale: 1.0
50
+ conv_pos_depth: 5
51
+ conv_pos_groups: 16
52
+ conv_pos_pre_ln: false
53
+ conv_pos_width: 95
54
+ decoder:
55
+ add_positions_all: false
56
+ add_positions_masked: false
57
+ decoder_dim: 768
58
+ decoder_groups: 16
59
+ decoder_kernel: 7
60
+ decoder_layers: 4
61
+ decoder_residual: true
62
+ input_dropout: 0.1
63
+ projection_layers: 1
64
+ projection_ratio: 2.0
65
+ ema_local_encoder: false
66
+ encoder_zero_mask: true
67
+ end_drop_path_rate: 0.0
68
+ extractor_mode: layer_norm
69
+ feature_encoder_spec: '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]'
70
+ init_extra_token_zero: true
71
+ inverse_mask: false
72
+ keep_masked_pct: 0.0
73
+ learned_alibi: false
74
+ learned_alibi_scale: true
75
+ learned_alibi_scale_per_head: true
76
+ learned_alibi_scale_per_layer: false
77
+ local_grad_mult: 1.0
78
+ mask_channel_length: 64
79
+ mask_channel_prob: 0.0
80
+ mask_dropout: 0.0
81
+ mask_length: 5
82
+ mask_noise_std: 0.01
83
+ mask_prob: 0.55
84
+ mask_prob_adjust: 0.1
85
+ mask_prob_min: null
86
+ model_depth: 8
87
+ num_alibi_heads: 16
88
+ num_extra_tokens: 10
89
+ prenet_depth: 4
90
+ prenet_dropout: 0.1
91
+ prenet_layerdrop: 0.0
92
+ remove_masks: false
93
+ start_drop_path_rate: 0.0
94
+ type: AUDIO
95
+ use_alibi_encoder: true
96
+ image:
97
+ add_masks: false
98
+ alibi_dims: 2
99
+ alibi_distance: manhattan
100
+ alibi_max_pos: null
101
+ alibi_scale: 1.0
102
+ decoder:
103
+ add_positions_all: false
104
+ add_positions_masked: false
105
+ decoder_dim: 384
106
+ decoder_groups: 16
107
+ decoder_kernel: 5
108
+ decoder_layers: 5
109
+ decoder_residual: true
110
+ input_dropout: 0.1
111
+ projection_layers: 1
112
+ projection_ratio: 2.0
113
+ ema_local_encoder: false
114
+ embed_dim: 768
115
+ enc_dec_transformer: false
116
+ encoder_zero_mask: true
117
+ end_drop_path_rate: 0.0
118
+ fixed_positions: true
119
+ in_chans: 3
120
+ init_extra_token_zero: true
121
+ input_size: 224
122
+ inverse_mask: false
123
+ keep_masked_pct: 0.0
124
+ learned_alibi: false
125
+ learned_alibi_scale: false
126
+ learned_alibi_scale_per_head: false
127
+ learned_alibi_scale_per_layer: false
128
+ local_grad_mult: 1.0
129
+ mask_channel_length: 64
130
+ mask_channel_prob: 0.0
131
+ mask_dropout: 0.0
132
+ mask_length: 5
133
+ mask_noise_std: 0.01
134
+ mask_prob: 0.7
135
+ mask_prob_adjust: 0.0
136
+ mask_prob_min: null
137
+ model_depth: 8
138
+ num_alibi_heads: 16
139
+ num_extra_tokens: 0
140
+ patch_size: 16
141
+ prenet_depth: 4
142
+ prenet_dropout: 0.0
143
+ prenet_layerdrop: 0.0
144
+ remove_masks: false
145
+ start_drop_path_rate: 0.0
146
+ transformer_decoder: false
147
+ type: IMAGE
148
+ use_alibi_encoder: false
149
+ text:
150
+ add_masks: false
151
+ alibi_max_pos: null
152
+ alibi_scale: 1.0
153
+ decoder:
154
+ add_positions_all: false
155
+ add_positions_masked: false
156
+ decoder_dim: 384
157
+ decoder_groups: 16
158
+ decoder_kernel: 5
159
+ decoder_layers: 5
160
+ decoder_residual: true
161
+ input_dropout: 0.1
162
+ projection_layers: 1
163
+ projection_ratio: 2.0
164
+ dropout: 0.1
165
+ ema_local_encoder: false
166
+ encoder_zero_mask: true
167
+ end_drop_path_rate: 0.0
168
+ init_extra_token_zero: true
169
+ inverse_mask: false
170
+ keep_masked_pct: 0.0
171
+ layernorm_embedding: true
172
+ learned_alibi: false
173
+ learned_alibi_scale: false
174
+ learned_alibi_scale_per_head: false
175
+ learned_alibi_scale_per_layer: false
176
+ learned_pos: true
177
+ local_grad_mult: 1.0
178
+ mask_channel_length: 64
179
+ mask_channel_prob: 0.0
180
+ mask_dropout: 0.0
181
+ mask_length: 5
182
+ mask_noise_std: 0.01
183
+ mask_prob: 0.7
184
+ mask_prob_adjust: 0.0
185
+ mask_prob_min: null
186
+ max_source_positions: 512
187
+ model_depth: 8
188
+ no_scale_embedding: true
189
+ no_token_positional_embeddings: false
190
+ num_alibi_heads: 16
191
+ num_extra_tokens: 0
192
+ prenet_depth: 4
193
+ prenet_dropout: 0.0
194
+ prenet_layerdrop: 0.0
195
+ remove_masks: false
196
+ start_drop_path_rate: 0.0
197
+ type: TEXT
198
+ use_alibi_encoder: false
199
+ norm_affine: true
200
+ norm_eps: 1.0e-05
201
+ num_heads: 16
202
+ post_mlp_drop: 0.1
203
+ recon_loss: 0.0
204
+ seed: 1
205
+ shared_decoder: null
206
+ skip_ema: false
207
+ start_drop_path_rate: 0.0
208
+ supported_modality: AUDIO
209
+
210
+ tokenizer: CharTokenizer
211
+ tokenizer_conf:
212
+ unk_symbol: <unk>
213
+ split_with_space: true
214
+
215
+ scope_map:
216
+ - 'd2v_model.'
217
+ - none
218
+
219
+
configuration.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "framework": "pytorch",
3
+ "task" : "emotion-recognition",
4
+ "pipeline": {"type":"funasr-pipeline"},
5
+ "model": {"type" : "funasr"},
6
+ "file_path_metas": {
7
+ "init_param":"model.pt",
8
+ "tokenizer_conf": {"token_list": "tokens.txt"},
9
+ "config":"config.yaml"},
10
+ "model_name_in_hub": {
11
+ "ms":"iic/emotion2vec_base",
12
+ "hf":""}
13
+ }
emotion2vec+data.png ADDED
emotion2vec+radar.png ADDED
example/test.wav ADDED
Binary file (131 kB). View file
 
logo.png ADDED

Git LFS Details

  • SHA256: 8a1aa31431bfb2bf126d7cf383c8b681b2372c333f1328b342bab5969dc0a569
  • Pointer size: 132 Bytes
  • Size of remote file: 1.85 MB
tokens.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ 生气/angry
2
+ unuse_0
3
+ unuse_1
4
+ 开心/happy
5
+ 中立/neutral
6
+ unuse_2
7
+ 难过/sad
8
+ unuse_3
9
+ <unk>