|
--- |
|
language: |
|
- "zh" |
|
thumbnail: "Mandarin-wav2vec2.0 fine-tuned on AISHELL-1 dataset" |
|
tags: |
|
- automatic-speech-recognition |
|
- speech |
|
- wav2vec2.0 |
|
- audio |
|
datasets: |
|
- AISHELL-1 |
|
metrics: |
|
- cer |
|
|
|
--- |
|
|
|
The Mandarin-wav2vec2.0 model is pre-trained on 1000 hours of AISHELL-2 dataset. The pre-training detail can be found at https://github.com/kehanlu/mandarin-wav2vec2. This model is fine-tuned on 178 hours of AISHELL-1 dataset and is the baseline model in the paper "A context-aware knowledge transferring strategy for CTC-based ASR |
|
"([preprint](https://arxiv.org/abs/2210.06244)). |
|
|
|
## Results on AISHELL-1 |
|
|CER|dev|test| |
|
| - | - | - | |
|
|vanilla w2v2-CTC | 4.85 | 5.13| |
|
|
|
## Usage |
|
**Note:** the model is fine-tuned using ESPNET toolkit, then converted to huggingface model for simple usage. |
|
|
|
```python |
|
import torch |
|
import soundfile as sf |
|
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor |
|
|
|
class ExtendedWav2Vec2ForCTC(Wav2Vec2ForCTC): |
|
""" |
|
In ESPNET there is a LayerNorm layer between encoder output and CTC classification head. |
|
""" |
|
def __init__(self, config): |
|
super().__init__(config) |
|
self.lm_head = torch.nn.Sequential( |
|
torch.nn.LayerNorm(config.hidden_size), |
|
self.lm_head |
|
) |
|
|
|
model = ExtendedWav2Vec2ForCTC.from_pretrained("kehanlu/wav2vec2-mandarin-aishell1") |
|
processor = Wav2Vec2Processor.from_pretrained("kehanlu/wav2vec2-mandarin-aishell1") |
|
|
|
audio_input, sample_rate = sf.read("/path/to/data_aishell/wav/dev/S0724/BAC009S0724W0121.wav") |
|
inputs = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
model.eval() |
|
logits = model(**inputs).logits |
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
transcription = processor.batch_decode(predicted_ids) |
|
print(transcription[0]) |
|
|
|
# 广州市房地产中介协会分析 |
|
``` |
|
|
|
|
|
## Licence |
|
|
|
The pre-trained corpus, AISHELL-2, is supported by AISHELL fundation. The outcome model also follow the licence of AISHELL-2. It is free to use for academic purpose and should not be used on any commercial purpose without the permission from AISHELL fundation. (https://www.aishelltech.com/aishell_2) |
|
|
|
``` |
|
@ARTICLE{aishell2, |
|
author = {{Du}, J. and {Na}, X. and {Liu}, X. and {Bu}, H.}, |
|
title = "{AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale}", |
|
journal = {ArXiv}, |
|
eprint = {1808.10583}, |
|
primaryClass = "cs.CL", |
|
year = 2018, |
|
month = Aug, |
|
} |
|
``` |
|
|
|
if you find this useful, please cite |
|
``` |
|
@article{lu2022context, |
|
title={A context-aware knowledge transferring strategy for CTC-based ASR}, |
|
author={Lu, Ke-Han and Chen, Kuan-Yu}, |
|
journal={arXiv preprint arXiv:2210.06244}, |
|
year={2022} |
|
} |
|
``` |