File size: 2,709 Bytes
05c9149 9d66920 05c9149 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
---
language:
- "zh"
thumbnail: "Mandarin-wav2vec2.0 fine-tuned on AISHELL-1 dataset"
tags:
- automatic-speech-recognition
- speech
- wav2vec2.0
- audio
datasets:
- AISHELL-1
metrics:
- cer
---
The Mandarin-wav2vec2.0 model is pre-trained on 1000 hours of AISHELL-2 dataset. The pre-training detail can be found at https://github.com/kehanlu/mandarin-wav2vec2. This model is fine-tuned on 178 hours of AISHELL-1 dataset and is the baseline model in the paper "A context-aware knowledge transferring strategy for CTC-based ASR
"([preprint](https://arxiv.org/abs/2210.06244)).
## Results on AISHELL-1
|CER|dev|test|
| - | - | - |
|vanilla w2v2-CTC | 4.85 | 5.13|
## Usage
**Note:** the model is fine-tuned using ESPNET toolkit, then converted to huggingface model for simple usage.
```python
import torch
import soundfile as sf
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
class ExtendedWav2Vec2ForCTC(Wav2Vec2ForCTC):
"""
In ESPNET there is a LayerNorm layer between encoder output and CTC classification head.
"""
def __init__(self, config):
super().__init__(config)
self.lm_head = torch.nn.Sequential(
torch.nn.LayerNorm(config.hidden_size),
self.lm_head
)
model = ExtendedWav2Vec2ForCTC.from_pretrained("kehanlu/wav2vec2-mandarin-aishell1")
processor = Wav2Vec2Processor.from_pretrained("kehanlu/wav2vec2-mandarin-aishell1")
audio_input, sample_rate = sf.read("/path/to/data_aishell/wav/dev/S0724/BAC009S0724W0121.wav")
inputs = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt")
with torch.no_grad():
model.eval()
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription[0])
# 广州市房地产中介协会分析
```
## Licence
The pre-trained corpus, AISHELL-2, is supported by AISHELL fundation. The outcome model also follow the licence of AISHELL-2. It is free to use for academic purpose and should not be used on any commercial purpose without the permission from AISHELL fundation. (https://www.aishelltech.com/aishell_2)
```
@ARTICLE{aishell2,
author = {{Du}, J. and {Na}, X. and {Liu}, X. and {Bu}, H.},
title = "{AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale}",
journal = {ArXiv},
eprint = {1808.10583},
primaryClass = "cs.CL",
year = 2018,
month = Aug,
}
```
if you find this useful, please cite
```
@article{lu2022context,
title={A context-aware knowledge transferring strategy for CTC-based ASR},
author={Lu, Ke-Han and Chen, Kuan-Yu},
journal={arXiv preprint arXiv:2210.06244},
year={2022}
}
``` |