Update README.md
Browse files
README.md
CHANGED
@@ -5,4 +5,55 @@ language:
|
|
5 |
pipeline_tag: automatic-speech-recognition
|
6 |
tags:
|
7 |
- icefall
|
8 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
pipeline_tag: automatic-speech-recognition
|
6 |
tags:
|
7 |
- icefall
|
8 |
+
---
|
9 |
+
|
10 |
+
See https://github.com/k2-fsa/icefall/pull/1651
|
11 |
+
|
12 |
+
# icefall-asr-ksponspeech-pruned-transducer-stateless7-streaming-2024-06-12
|
13 |
+
|
14 |
+
KsponSpeech is a large-scale spontaneous speech corpus of Korean.
|
15 |
+
This corpus contains 969 hours of open-domain dialog utterances,
|
16 |
+
spoken by about 2,000 native Korean speakers in a clean environment.
|
17 |
+
|
18 |
+
All data were constructed by recording the dialogue of two people
|
19 |
+
freely conversing on a variety of topics and manually transcribing the utterances.
|
20 |
+
|
21 |
+
The transcription provides a dual transcription consisting of orthography and pronunciation,
|
22 |
+
and disfluency tags for spontaneity of speech, such as filler words, repeated words, and word fragments.
|
23 |
+
|
24 |
+
The original audio data has a pcm extension.
|
25 |
+
During preprocessing, it is converted into a file in the flac extension and saved anew.
|
26 |
+
|
27 |
+
KsponSpeech is publicly available on an open data hub site of the Korea government.
|
28 |
+
The dataset must be downloaded manually.
|
29 |
+
|
30 |
+
For more details, please visit:
|
31 |
+
|
32 |
+
- Dataset: https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=123
|
33 |
+
- Paper: https://www.mdpi.com/2076-3417/10/19/6936
|
34 |
+
|
35 |
+
### Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer)
|
36 |
+
|
37 |
+
Number of model parameters: 79,022,891, i.e., 79.02 M
|
38 |
+
|
39 |
+
#### Training on KsponSpeech (with MUSAN)
|
40 |
+
|
41 |
+
The CERs are:
|
42 |
+
|
43 |
+
| decoding method | chunk size | eval_clean | eval_other | comment | decoding mode |
|
44 |
+
|----------------------|------------|------------|------------|---------------------|----------------------|
|
45 |
+
| greedy search | 320ms | 10.21 | 11.07 | --epoch 30 --avg 9 | simulated streaming |
|
46 |
+
| greedy search | 320ms | 10.22 | 11.07 | --epoch 30 --avg 9 | chunk-wise |
|
47 |
+
| fast beam search | 320ms | 10.21 | 11.04 | --epoch 30 --avg 9 | simulated streaming |
|
48 |
+
| fast beam search | 320ms | 10.25 | 11.08 | --epoch 30 --avg 9 | chunk-wise |
|
49 |
+
| modified beam search | 320ms | 10.13 | 10.88 | --epoch 30 --avg 9 | simulated streaming |
|
50 |
+
| modified beam search | 320ms | 10.1 | 10.93 | --epoch 30 --avg 9 | chunk-size |
|
51 |
+
| greedy search | 640ms | 9.94 | 10.82 | --epoch 30 --avg 9 | simulated streaming |
|
52 |
+
| greedy search | 640ms | 10.04 | 10.85 | --epoch 30 --avg 9 | chunk-wise |
|
53 |
+
| fast beam search | 640ms | 10.01 | 10.81 | --epoch 30 --avg 9 | simulated streaming |
|
54 |
+
| fast beam search | 640ms | 10.04 | 10.7 | --epoch 30 --avg 9 | chunk-wise |
|
55 |
+
| modified beam search | 640ms | 9.91 | 10.72 | --epoch 30 --avg 9 | simulated streaming |
|
56 |
+
| modified beam search | 640ms | 9.92 | 10.72 | --epoch 30 --avg 9 | chunk-size |
|
57 |
+
|
58 |
+
Note: `simulated streaming` indicates feeding full utterance during decoding using `decode.py`,
|
59 |
+
while `chunk-size` indicates feeding certain number of frames at each time using `streaming_decode.py`.
|