File size: 2,709 Bytes
05c9149
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9d66920
05c9149
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b883c2
 
05c9149
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
language: 
  - "zh"
thumbnail: "Mandarin-wav2vec2.0 fine-tuned on AISHELL-1 dataset"
tags:
- automatic-speech-recognition
- speech
- wav2vec2.0
- audio
datasets:
- AISHELL-1
metrics:
- cer

---

The Mandarin-wav2vec2.0 model is pre-trained on 1000 hours of AISHELL-2 dataset. The pre-training detail can be found at https://github.com/kehanlu/mandarin-wav2vec2. This model is fine-tuned on 178 hours of AISHELL-1 dataset and is the baseline model in the paper "A context-aware knowledge transferring strategy for CTC-based ASR
"([preprint](https://arxiv.org/abs/2210.06244)). 

## Results on AISHELL-1
|CER|dev|test|
| - | - | -  |
|vanilla w2v2-CTC | 4.85 | 5.13|

## Usage 
**Note:** the model is fine-tuned using ESPNET toolkit, then converted to huggingface model for simple usage.

```python
import torch
import soundfile as sf
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

class ExtendedWav2Vec2ForCTC(Wav2Vec2ForCTC):
    """
    In ESPNET there is a LayerNorm layer between encoder output and CTC classification head.
    """
    def __init__(self, config):
        super().__init__(config)
        self.lm_head = torch.nn.Sequential(
                torch.nn.LayerNorm(config.hidden_size),
                self.lm_head
        )
        
model = ExtendedWav2Vec2ForCTC.from_pretrained("kehanlu/mandarin-wav2vec2-aishell1")
processor = Wav2Vec2Processor.from_pretrained("kehanlu/mandarin-wav2vec2-aishell1")

audio_input, sample_rate = sf.read("/path/to/data_aishell/wav/dev/S0724/BAC009S0724W0121.wav")
inputs = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt")

with torch.no_grad():
    model.eval()
    logits = model(**inputs).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)
print(transcription[0])

# 广州市房地产中介协会分析
```


## Licence

The pre-trained corpus, AISHELL-2, is supported by AISHELL fundation. The outcome model also follow the licence of AISHELL-2. It is free to use for academic purpose and should not be used on any commercial purpose without the permission from AISHELL fundation. (https://www.aishelltech.com/aishell_2)

```
@ARTICLE{aishell2,
   author = {{Du}, J. and {Na}, X. and {Liu}, X. and {Bu}, H.},
   title = "{AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale}",
   journal = {ArXiv},
   eprint = {1808.10583},
   primaryClass = "cs.CL",
   year = 2018,
   month = Aug,
}
```

if you find this useful, please cite
```
@article{lu2022context,
  title={A context-aware knowledge transferring strategy for CTC-based ASR},
  author={Lu, Ke-Han and Chen, Kuan-Yu},
  journal={arXiv preprint arXiv:2210.06244},
  year={2022}
}
```