kehanlu commited on
Commit
05c9149
1 Parent(s): d4d9bc7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - "zh"
4
+ thumbnail: "Mandarin-wav2vec2.0 fine-tuned on AISHELL-1 dataset"
5
+ tags:
6
+ - automatic-speech-recognition
7
+ - speech
8
+ - wav2vec2.0
9
+ - audio
10
+ datasets:
11
+ - AISHELL-1
12
+ metrics:
13
+ - cer
14
+
15
+ ---
16
+
17
+ The Mandarin-wav2vec2.0 model is pre-trained on 1000 hours of AISHELL-2 dataset. The pre-training detail can be found at https://github.com/kehanlu/mandarin-wav2vec2. This model is fine-tuned on 178 hours of AISHELL-1 dataset and is the baseline model in the paper "A context-aware knowledge transferring strategy for CTC-based ASR
18
+ "([preprint](https://arxiv.org/abs/2210.06244)).
19
+
20
+ |CER|dev|test|
21
+ | - | - | - |
22
+ |vanilla w2v2-CTC | 4.85 | 5.13|
23
+
24
+ ## Usage
25
+ **Note:** the model is fine-tuned using ESPNET toolkit, then converted to huggingface model for simple usage.
26
+
27
+ ```python
28
+ import torch
29
+ import soundfile as sf
30
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
31
+
32
+ class ExtendedWav2Vec2ForCTC(Wav2Vec2ForCTC):
33
+ """
34
+ In ESPNET there is a LayerNorm layer between encoder output and CTC classification head.
35
+ """
36
+ def __init__(self, config):
37
+ super().__init__(config)
38
+ self.lm_head = torch.nn.Sequential(
39
+ torch.nn.LayerNorm(config.hidden_size),
40
+ self.lm_head
41
+ )
42
+
43
+ model = ExtendedWav2Vec2ForCTC.from_pretrained("kehanlu/wav2vec2-mandarin-aishell1")
44
+ processor = Wav2Vec2Processor.from_pretrained("kehanlu/wav2vec2-mandarin-aishell1")
45
+
46
+ audio_input, sample_rate = sf.read("/path/to/data_aishell/wav/dev/S0724/BAC009S0724W0121.wav")
47
+ inputs = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt")
48
+
49
+ with torch.no_grad():
50
+ model.eval()
51
+ logits = model(**inputs).logits
52
+ predicted_ids = torch.argmax(logits, dim=-1)
53
+ transcription = processor.batch_decode(predicted_ids)
54
+ print(transcription[0])
55
+
56
+ # 广州市房地产中介协会分析
57
+ ```
58
+
59
+
60
+ ## Licence
61
+
62
+ The pre-trained corpus, AISHELL-2, is supported by AISHELL fundation. The outcome model also follow the licence of AISHELL-2. It is free to use for academic purpose and should not be used on any commercial purpose without the permission from AISHELL fundation. (https://www.aishelltech.com/aishell_2)
63
+
64
+ ```
65
+ @ARTICLE{aishell2,
66
+ author = {{Du}, J. and {Na}, X. and {Liu}, X. and {Bu}, H.},
67
+ title = "{AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale}",
68
+ journal = {ArXiv},
69
+ eprint = {1808.10583},
70
+ primaryClass = "cs.CL",
71
+ year = 2018,
72
+ month = Aug,
73
+ }
74
+ ```
75
+
76
+ if you find this useful, please cite
77
+ ```
78
+ @article{lu2022context,
79
+ title={A context-aware knowledge transferring strategy for CTC-based ASR},
80
+ author={Lu, Ke-Han and Chen, Kuan-Yu},
81
+ journal={arXiv preprint arXiv:2210.06244},
82
+ year={2022}
83
+ }
84
+ ```