File size: 4,736 Bytes
e70375e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
---
tasks:
- auto-speech-recognition
domain:
- audio
model-type:
- Non-autoregressive
frameworks:
- pytorch
backbone:
- transformer/conformer
metrics:
- CER
license: Apache License 2.0
language:
- cn
tags:
- FunASR
- Paraformer
- Alibaba
- INTERSPEECH 2022
datasets:
train:
- 60,000 hour industrial Mandarin task
test:
- AISHELL-1 dev/test
- AISHELL-2 dev_android/dev_ios/dev_mic/test_android/test_ios/test_mic
- WentSpeech dev/test_meeting/test_net
- SpeechIO TIOBE
- 60,000 hour industrial Mandarin task
indexing:
results:
- task:
name: Automatic Speech Recognition
dataset:
name: 60,000 hour industrial Mandarin task
type: audio # optional
args: 16k sampling rate, 8404 characters # optional
metrics:
- type: CER
value: 8.53% # float
description: greedy search, withou lm, avg.
args: default
- type: RTF
value: 0.0251 # float
description: GPU inference on V100
args: batch_size=1
widgets:
- task: auto-speech-recognition
inputs:
- type: audio
name: input
title: 音频
examples:
- name: 1
title: 示例1
inputs:
- name: input
data: git://example/asr_example.wav
inferencespec:
cpu: 8 #CPU数量
memory: 4096
finetune-support: True
---
# 模型介绍
基于Paraformer online large(iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online),更换vocab,增加粤语部分字,通过在普通话1w小时、粤语100小时、英语1w小时音频数据集上进行训练1轮。此版本已不再更新,后续请关注vocab 11666版本。
## <strong>[FunASR开源项目介绍](https://github.com/alibaba-damo-academy/FunASR)</strong>
<strong>[FunASR](https://github.com/alibaba-damo-academy/FunASR)</strong>希望在语音识别的学术研究和工业应用之间架起一座桥梁。通过发布工业级语音识别模型的训练和微调,研究人员和开发人员可以更方便地进行语音识别模型的研究和生产,并推动语音识别生态的发展。让语音识别更有趣!
#### 基于ModelScope进行推理
- 流式语音识别api调用方式可参考如下范例:
```python
# -*- encoding: utf-8 -*-
# Copyright FunASR (https://github.com/alibaba-damo-academy/FunASR). All Rights Reserved.
# MIT License (https://opensource.org/licenses/MIT)
from funasr import AutoModel
import soundfile
import os
chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
model = AutoModel(model="dengcunqin/speech_paraformer-large_asr_nat-zh-cantonese-en-16k-vocab8501-online", model_revision="master")
cache = {}
wav_file = os.path.join(model.model_path, "example/asr_example_普通话.wav")
res = model.generate(input=wav_file,
chunk_size=chunk_size,
encoder_chunk_look_back=encoder_chunk_look_back,
decoder_chunk_look_back=decoder_chunk_look_back,
)
print(res)
wav_file = os.path.join(model.model_path, "example/asr_example_粤语.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960 # 600ms、480ms
cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
is_final = i == total_chunk_num - 1
res = model.generate(input=speech_chunk,
cache=cache,
is_final=is_final,
chunk_size=chunk_size,
encoder_chunk_look_back=encoder_chunk_look_back,
decoder_chunk_look_back=decoder_chunk_look_back,
)
print(res)
```
## 使用方式以及适用范围
运行范围
- 支持Linux-x86_64、Mac和Windows运行。
使用方式
- 直接推理:可以直接对输入音频进行解码,输出目标文字。
- 微调:加载训练好的模型,采用私有或者开源数据进行模型训练。
使用范围与目标场景
- 适合于实时语音识别场景。
## 模型局限性以及可能的偏差
考虑到特征提取流程和工具以及训练工具差异,会对CER的数据带来一定的差异(<0.1%),推理GPU环境差异导致的RTF数值差异。
## 相关论文以及引用信息
```BibTeX
@inproceedings{gao2022paraformer,
title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
author={Gao, Zhifu and Zhang, Shiliang and McLoughlin, Ian and Yan, Zhijie},
booktitle={INTERSPEECH},
year={2022}
}
```
|