emotion2vec
/

emotion2vec_plus_large

Model card Files Files and versions Community

BoJack commited on May 15

Commit

7714a5c

•

1 Parent(s): 11231b8

Upload 8 files

Browse files

Files changed (9) hide show

.gitattributes +1 -0
README.md +157 -3
config.yaml +219 -0
configuration.json +13 -0
emotion2vec+data.png +0 -0
emotion2vec+radar.png +0 -0
example/test.wav +0 -0
logo.png +3 -0
tokens.txt +9 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+logo.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,157 @@
----
-license: apache-2.0
----

+---
+license: other
+license_name: model-license
+license_link: https://github.com/alibaba-damo-academy/FunASR
+frameworks:
+- Pytorch
+tasks:
+- emotion-recognition
+widgets:
+  - enable: true
+    version: 1
+    task: emotion-recognition
+    examples:
+      - inputs:
+          - data: git://example/test.wav
+    inputs:
+      - type: audio
+        displayType: AudioUploader
+        validator:
+          max_size: 10M
+        name: input
+    output:
+      displayType: Prediction
+      displayValueMapping:
+        labels: labels
+        scores: scores
+    inferencespec:
+      cpu: 8
+      gpu: 0
+      gpu_memory: 0
+      memory: 4096
+    model_revision: master
+    extendsParameters:
+      extract_embedding: false
+---
+<div align="center">
+    <h1>
+    EMOTION2VEC+
+    </h1>
+    <p>
+     emotion2vec+: speech emotion recognition foundation model <br>
+    <b>emotion2vec+ large model</b>
+    </p>
+    <p>
+    <img src="logo.png" style="width: 200px; height: 200px;">
+    </p>
+    <p>
+    </p>
+</div>
+# Guides
+emotion2vec+ is a series of foundational models for speech emotion recognition (SER). We aim to train a "whisper" in the field of speech emotion recognition, overcoming the effects of language and recording environments through data-driven methods to achieve universal, robust emotion recognition capabilities. The performance of emotion2vec+ significantly exceeds other highly downloaded open-source models on Hugging Face.
+![](emotion2vec+radar.png)
+This version (emotion2vec_plus_large) uses a large-scale pseudo-labeled data for finetuning to obtain a large size model (~300M),  and currently supports the following categories:
+    0: angry
+    1: happy
+    2: neutral
+    3: sad
+    4: unknown
+# Model Card
+GitHub Repo: [emotion2vec](https://github.com/ddlBoJack/emotion2vec)
+|Model|⭐Model Scope|🤗Hugging Face|Fine-tuning Data (Hours)|
+|:---:|:-------------:|:-----------:|:-------------:|
+|emotion2vec|[Link](https://www.modelscope.cn/models/iic/emotion2vec_base/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_base)|/|
+emotion2vec+ seed|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_seed/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_seed)|201|
+emotion2vec+ base|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_base/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_base)|4788|
+emotion2vec+ large|[Link](https://modelscope.cn/models/iic/emotion2vec_plus_large/summary)|[Link](https://huggingface.co/emotion2vec/emotion2vec_plus_large)|42526|
+# Data Iteration
+We offer 3 versions of emotion2vec+, each derived from the data of its predecessor. If you need a model focusing on spech emotion representation, refer to [emotion2vec: universal speech emotion representation model](https://huggingface.co/emotion2vec/emotion2vec).
+- emotion2vec+ seed: Fine-tuned with academic speech emotion data
+- emotion2vec+ base: Fine-tuned with filtered large-scale pseudo-labeled data to obtain the base size model (~90M)
+- emotion2vec+ large: Fine-tuned with filtered large-scale pseudo-labeled data to obtain the large size model (~300M)
+The iteration process is illustrated below, culminating in the training of the emotion2vec+ large model with 40k out of 160k hours of speech emotion data. Details of data engineering will be announced later.
+![](emotion2vec+data.png)
+# Installation
+`pip install -U funasr modelscope`
+# Usage
+input: 16k Hz speech recording
+granularity:
+- "utterance": Extract features from the entire utterance
+- "frame": Extract frame-level features (50 Hz)
+extract_embedding: Whether to extract features; set to False if using only the classification model
+## Inference based on ModelScope
+```python
+from modelscope.pipelines import pipeline
+from modelscope.utils.constant import Tasks
+inference_pipeline = pipeline(
+    task=Tasks.emotion_recognition,
+    model="iic/emotion2vec_plus_large")
+rec_result = inference_pipeline('https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_example_zh.wav', granularity="utterance", extract_embedding=False)
+print(rec_result)
+```
+## Inference based on FunASR
+```python
+from funasr import AutoModel
+model = AutoModel(model="iic/emotion2vec_plus_large")
+wav_file = f"{model.model_path}/example/test.wav"
+res = model.generate(wav_file, output_dir="./outputs", granularity="utterance", extract_embedding=False)
+print(res)
+```
+Note: The model will automatically download.
+Supports input file list, wav.scp (Kaldi style):
+```cat wav.scp
+wav_name1 wav_path1.wav
+wav_name2 wav_path2.wav
+...
+```
+Outputs are emotion representation, saved in the output_dir in numpy format (can be loaded with np.load())
+# Note
+This repository is the Huggingface version of emotion2vec, with identical model parameters as the original model and Model Scope version.
+Original repository: [https://github.com/ddlBoJack/emotion2vec](https://github.com/ddlBoJack/emotion2vec)
+Model Scope repository：[https://github.com/alibaba-damo-academy/FunASR](https://github.com/alibaba-damo-academy/FunASR/tree/funasr1.0/examples/industrial_data_pretraining/emotion2vec)
+Hugging Face repository：[https://huggingface.co/emotion2vec](https://huggingface.co/emotion2vec)
+# Citation
+```BibTeX
+@article{ma2023emotion2vec,
+  title={emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation},
+  author={Ma, Ziyang and Zheng, Zhisheng and Ye, Jiaxin and Li, Jinchao and Gao, Zhifu and Zhang, Shiliang and Chen, Xie},
+  journal={arXiv preprint arXiv:2312.15185},
+  year={2023}
+}
+```

config.yaml ADDED Viewed

	@@ -0,0 +1,219 @@

+# network architecture
+model: Emotion2vec
+model_conf:
+  _name: data2vec_multi
+  activation_dropout: 0.0
+  adversarial_hidden_dim: 128
+  adversarial_training: false
+  adversarial_weight: 0.1
+  attention_dropout: 0.1
+  average_top_k_layers: 16
+  batch_norm_target_layer: false
+  clone_batch: 12
+  cls_loss: 1.0
+  cls_type: chunk
+  d2v_loss: 1.0
+  decoder_group: false
+  depth: 8
+  dropout_input: 0.0
+  ema_anneal_end_step: 20000
+  ema_decay: 0.9997
+  ema_encoder_only: false
+  ema_end_decay: 1.0
+  ema_same_dtype: true
+  embed_dim: 1024
+  encoder_dropout: 0.1
+  end_drop_path_rate: 0.0
+  end_of_block_targets: false
+  instance_norm_target_layer: true
+  instance_norm_targets: false
+  layer_norm_first: false
+  layer_norm_target_layer: false
+  layer_norm_targets: false
+  layerdrop: 0.0
+  log_norms: true
+  loss_beta: 0.0
+  loss_scale: null
+  mae_init: false
+  max_update: 100000
+  min_pred_var: 0.01
+  min_target_var: 0.1
+  mlp_ratio: 4.0
+  normalize: true
+  modalities:
+    _name: null
+    audio:
+      add_masks: false
+      alibi_max_pos: null
+      alibi_scale: 1.0
+      conv_pos_depth: 5
+      conv_pos_groups: 16
+      conv_pos_pre_ln: false
+      conv_pos_width: 95
+      decoder:
+        add_positions_all: false
+        add_positions_masked: false
+        decoder_dim: 768
+        decoder_groups: 16
+        decoder_kernel: 7
+        decoder_layers: 4
+        decoder_residual: true
+        input_dropout: 0.1
+        projection_layers: 1
+        projection_ratio: 2.0
+      ema_local_encoder: false
+      encoder_zero_mask: true
+      end_drop_path_rate: 0.0
+      extractor_mode: layer_norm
+      feature_encoder_spec: '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]'
+      init_extra_token_zero: true
+      inverse_mask: false
+      keep_masked_pct: 0.0
+      learned_alibi: false
+      learned_alibi_scale: true
+      learned_alibi_scale_per_head: true
+      learned_alibi_scale_per_layer: false
+      local_grad_mult: 1.0
+      mask_channel_length: 64
+      mask_channel_prob: 0.0
+      mask_dropout: 0.0
+      mask_length: 5
+      mask_noise_std: 0.01
+      mask_prob: 0.55
+      mask_prob_adjust: 0.1
+      mask_prob_min: null
+      model_depth: 8
+      num_alibi_heads: 16
+      num_extra_tokens: 10
+      prenet_depth: 4
+      prenet_dropout: 0.1
+      prenet_layerdrop: 0.0
+      remove_masks: false
+      start_drop_path_rate: 0.0
+      type: AUDIO
+      use_alibi_encoder: true
+    image:
+      add_masks: false
+      alibi_dims: 2
+      alibi_distance: manhattan
+      alibi_max_pos: null
+      alibi_scale: 1.0
+      decoder:
+        add_positions_all: false
+        add_positions_masked: false
+        decoder_dim: 384
+        decoder_groups: 16
+        decoder_kernel: 5
+        decoder_layers: 5
+        decoder_residual: true
+        input_dropout: 0.1
+        projection_layers: 1
+        projection_ratio: 2.0
+      ema_local_encoder: false
+      embed_dim: 768
+      enc_dec_transformer: false
+      encoder_zero_mask: true
+      end_drop_path_rate: 0.0
+      fixed_positions: true
+      in_chans: 3
+      init_extra_token_zero: true
+      input_size: 224
+      inverse_mask: false
+      keep_masked_pct: 0.0
+      learned_alibi: false
+      learned_alibi_scale: false
+      learned_alibi_scale_per_head: false
+      learned_alibi_scale_per_layer: false
+      local_grad_mult: 1.0
+      mask_channel_length: 64
+      mask_channel_prob: 0.0
+      mask_dropout: 0.0
+      mask_length: 5
+      mask_noise_std: 0.01
+      mask_prob: 0.7
+      mask_prob_adjust: 0.0
+      mask_prob_min: null
+      model_depth: 8
+      num_alibi_heads: 16
+      num_extra_tokens: 0
+      patch_size: 16
+      prenet_depth: 4
+      prenet_dropout: 0.0
+      prenet_layerdrop: 0.0
+      remove_masks: false
+      start_drop_path_rate: 0.0
+      transformer_decoder: false
+      type: IMAGE
+      use_alibi_encoder: false
+    text:
+      add_masks: false
+      alibi_max_pos: null
+      alibi_scale: 1.0
+      decoder:
+        add_positions_all: false
+        add_positions_masked: false
+        decoder_dim: 384
+        decoder_groups: 16
+        decoder_kernel: 5
+        decoder_layers: 5
+        decoder_residual: true
+        input_dropout: 0.1
+        projection_layers: 1
+        projection_ratio: 2.0
+      dropout: 0.1
+      ema_local_encoder: false
+      encoder_zero_mask: true
+      end_drop_path_rate: 0.0
+      init_extra_token_zero: true
+      inverse_mask: false
+      keep_masked_pct: 0.0
+      layernorm_embedding: true
+      learned_alibi: false
+      learned_alibi_scale: false
+      learned_alibi_scale_per_head: false
+      learned_alibi_scale_per_layer: false
+      learned_pos: true
+      local_grad_mult: 1.0
+      mask_channel_length: 64
+      mask_channel_prob: 0.0
+      mask_dropout: 0.0
+      mask_length: 5
+      mask_noise_std: 0.01
+      mask_prob: 0.7
+      mask_prob_adjust: 0.0
+      mask_prob_min: null
+      max_source_positions: 512
+      model_depth: 8
+      no_scale_embedding: true
+      no_token_positional_embeddings: false
+      num_alibi_heads: 16
+      num_extra_tokens: 0
+      prenet_depth: 4
+      prenet_dropout: 0.0
+      prenet_layerdrop: 0.0
+      remove_masks: false
+      start_drop_path_rate: 0.0
+      type: TEXT
+      use_alibi_encoder: false
+  norm_affine: true
+  norm_eps: 1.0e-05
+  num_heads: 16
+  post_mlp_drop: 0.1
+  recon_loss: 0.0
+  seed: 1
+  shared_decoder: null
+  skip_ema: false
+  start_drop_path_rate: 0.0
+  supported_modality: AUDIO
+tokenizer: CharTokenizer
+tokenizer_conf:
+  unk_symbol: <unk>
+  split_with_space: true
+scope_map:
+  - 'd2v_model.'
+  - none

configuration.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "framework": "pytorch",
+  "task" : "emotion-recognition",
+  "pipeline": {"type":"funasr-pipeline"},
+  "model": {"type" : "funasr"},
+  "file_path_metas": {
+    "init_param":"model.pt",
+    "tokenizer_conf": {"token_list": "tokens.txt"},
+    "config":"config.yaml"},
+  "model_name_in_hub": {
+    "ms":"iic/emotion2vec_base",
+    "hf":""}
+}

emotion2vec+data.png ADDED Viewed

emotion2vec+radar.png ADDED Viewed

example/test.wav ADDED Viewed

Binary file (131 kB). View file

logo.png ADDED Viewed

Git LFS Details

SHA256: 8a1aa31431bfb2bf126d7cf383c8b681b2372c333f1328b342bab5969dc0a569
Pointer size: 132 Bytes
Size of remote file: 1.85 MB

tokens.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+生气/angry
+unuse_0
+unuse_1
+开心/happy
+中立/neutral
+unuse_2
+难过/sad
+unuse_3
+<unk>