Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- "zh"
|
4 |
+
thumbnail: "Mandarin-wav2vec2.0 fine-tuned on AISHELL-1 dataset"
|
5 |
+
tags:
|
6 |
+
- automatic-speech-recognition
|
7 |
+
- speech
|
8 |
+
- wav2vec2.0
|
9 |
+
- audio
|
10 |
+
datasets:
|
11 |
+
- AISHELL-1
|
12 |
+
metrics:
|
13 |
+
- cer
|
14 |
+
|
15 |
+
---
|
16 |
+
|
17 |
+
The Mandarin-wav2vec2.0 model is pre-trained on 1000 hours of AISHELL-2 dataset. The pre-training detail can be found at https://github.com/kehanlu/mandarin-wav2vec2. This model is fine-tuned on 178 hours of AISHELL-1 dataset and is the baseline model in the paper "A context-aware knowledge transferring strategy for CTC-based ASR
|
18 |
+
"([preprint](https://arxiv.org/abs/2210.06244)).
|
19 |
+
|
20 |
+
|CER|dev|test|
|
21 |
+
| - | - | - |
|
22 |
+
|vanilla w2v2-CTC | 4.85 | 5.13|
|
23 |
+
|
24 |
+
## Usage
|
25 |
+
**Note:** the model is fine-tuned using ESPNET toolkit, then converted to huggingface model for simple usage.
|
26 |
+
|
27 |
+
```python
|
28 |
+
import torch
|
29 |
+
import soundfile as sf
|
30 |
+
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
|
31 |
+
|
32 |
+
class ExtendedWav2Vec2ForCTC(Wav2Vec2ForCTC):
|
33 |
+
"""
|
34 |
+
In ESPNET there is a LayerNorm layer between encoder output and CTC classification head.
|
35 |
+
"""
|
36 |
+
def __init__(self, config):
|
37 |
+
super().__init__(config)
|
38 |
+
self.lm_head = torch.nn.Sequential(
|
39 |
+
torch.nn.LayerNorm(config.hidden_size),
|
40 |
+
self.lm_head
|
41 |
+
)
|
42 |
+
|
43 |
+
model = ExtendedWav2Vec2ForCTC.from_pretrained("kehanlu/wav2vec2-mandarin-aishell1")
|
44 |
+
processor = Wav2Vec2Processor.from_pretrained("kehanlu/wav2vec2-mandarin-aishell1")
|
45 |
+
|
46 |
+
audio_input, sample_rate = sf.read("/path/to/data_aishell/wav/dev/S0724/BAC009S0724W0121.wav")
|
47 |
+
inputs = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt")
|
48 |
+
|
49 |
+
with torch.no_grad():
|
50 |
+
model.eval()
|
51 |
+
logits = model(**inputs).logits
|
52 |
+
predicted_ids = torch.argmax(logits, dim=-1)
|
53 |
+
transcription = processor.batch_decode(predicted_ids)
|
54 |
+
print(transcription[0])
|
55 |
+
|
56 |
+
# 广州市房地产中介协会分析
|
57 |
+
```
|
58 |
+
|
59 |
+
|
60 |
+
## Licence
|
61 |
+
|
62 |
+
The pre-trained corpus, AISHELL-2, is supported by AISHELL fundation. The outcome model also follow the licence of AISHELL-2. It is free to use for academic purpose and should not be used on any commercial purpose without the permission from AISHELL fundation. (https://www.aishelltech.com/aishell_2)
|
63 |
+
|
64 |
+
```
|
65 |
+
@ARTICLE{aishell2,
|
66 |
+
author = {{Du}, J. and {Na}, X. and {Liu}, X. and {Bu}, H.},
|
67 |
+
title = "{AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale}",
|
68 |
+
journal = {ArXiv},
|
69 |
+
eprint = {1808.10583},
|
70 |
+
primaryClass = "cs.CL",
|
71 |
+
year = 2018,
|
72 |
+
month = Aug,
|
73 |
+
}
|
74 |
+
```
|
75 |
+
|
76 |
+
if you find this useful, please cite
|
77 |
+
```
|
78 |
+
@article{lu2022context,
|
79 |
+
title={A context-aware knowledge transferring strategy for CTC-based ASR},
|
80 |
+
author={Lu, Ke-Han and Chen, Kuan-Yu},
|
81 |
+
journal={arXiv preprint arXiv:2210.06244},
|
82 |
+
year={2022}
|
83 |
+
}
|
84 |
+
```
|