Spaces:
Runtime error
Runtime error
transiteration
commited on
Commit
•
10782b9
1
Parent(s):
5970086
Update README.md
Browse files
README.md
CHANGED
@@ -1,101 +1,9 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
- speech
|
11 |
-
- audio
|
12 |
-
- pytorch
|
13 |
-
- stt
|
14 |
-
---
|
15 |
-
|
16 |
-
|
17 |
-
## Model Overview
|
18 |
-
|
19 |
-
In order to prepare and experiment with the model, it's necessary to install [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) [1].\
|
20 |
-
\
|
21 |
-
This model have been trained on NVIDIA GeForce RTX 2070:\
|
22 |
-
Python 3.7.15\
|
23 |
-
NumPy 1.21.6\
|
24 |
-
PyTorch 1.21.1\
|
25 |
-
NVIDIA NeMo 1.7.0
|
26 |
-
|
27 |
-
```
|
28 |
-
pip3 install nemo_toolkit['all']
|
29 |
-
```
|
30 |
-
|
31 |
-
## Model Usage:
|
32 |
-
|
33 |
-
The model is accessible within the NeMo toolkit [1] and can serve as a pre-trained checkpoint for either making inferences or for fine-tuning on a different dataset.
|
34 |
-
|
35 |
-
#### How to Import
|
36 |
-
|
37 |
-
```
|
38 |
-
import nemo.collections.asr as nemo_asr
|
39 |
-
model = nemo_asr.models.EncDecCTCModel.restore_from(restore_path="stt_kz_quartznet15x5.nemo")
|
40 |
-
```
|
41 |
-
|
42 |
-
#### How to Train
|
43 |
-
|
44 |
-
```
|
45 |
-
python3 train.py --train_manifest path/to/manifest.json --val_manifest path/to/manifest.json --batch_size BATCH_SIZE --num_epochs NUM_EPOCHS --model_save_path path/to/save/model.nemo
|
46 |
-
```
|
47 |
-
|
48 |
-
#### How to Evaluate
|
49 |
-
|
50 |
-
```
|
51 |
-
python3 evaluate.py --model_path /path/to/stt_kz_quartznet15x5.nemo --test_manifest path/to/manifest.json"
|
52 |
-
```
|
53 |
-
|
54 |
-
#### How to Transcribe Audio File
|
55 |
-
|
56 |
-
Sample audio to test the model:
|
57 |
-
```
|
58 |
-
wget https://asr-kz-example.s3.us-west-2.amazonaws.com/sample_kz.wav
|
59 |
-
```
|
60 |
-
This line is to transcribe the single audio:
|
61 |
-
```
|
62 |
-
python3 transcibe.py --model_path /path/to/stt_kz_quartznet15x5.nemo --audio_file_path path/to/audio/file
|
63 |
-
```
|
64 |
-
|
65 |
-
## Input and Output
|
66 |
-
|
67 |
-
This model can take input from mono-channel audio .WAV files with a sample rate of 16,000 KHz.\
|
68 |
-
Then, this model gives you the spoken words in a text format for a given audio sample.
|
69 |
-
|
70 |
-
## Model Architecture
|
71 |
-
|
72 |
-
[QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5) [2] is a Jasper-like network that uses separable convolutions and larger filter sizes. It has comparable accuracy to Jasper while having much fewer parameters. This particular model has 15 blocks each repeated 5 times.
|
73 |
-
|
74 |
-
## Training and Dataset
|
75 |
-
|
76 |
-
The model was finetuned to Kazakh speech based on the pre-trained English Model for over several epochs.
|
77 |
-
[Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1) (KSC2) [3] is the first industrial-scale open-source Kazakh speech corpus.\
|
78 |
-
In total, KSC2 contains around 1.2k hours of high-quality transcribed data comprising over 600k utterances.
|
79 |
-
|
80 |
-
## Performance
|
81 |
-
The model achieved:\
|
82 |
-
Average WER: 13.53%\
|
83 |
-
through the applying of **Greedy Decoding**.
|
84 |
-
|
85 |
-
## Limitations
|
86 |
-
|
87 |
-
Because the GPU has limited power, lightweight model architecture was used for fine-tuning.\
|
88 |
-
In general, this makes it faster for inference but might show less overall performance.\
|
89 |
-
In addition, if the speech includes technical terms or dialect words the model hasn't learned, it may not work as well.
|
90 |
-
|
91 |
-
## Demonstration
|
92 |
-
|
93 |
-
For inference and downloading the model, check on Hugging Face Space: [NeMo_STT_KZ_Quartznet15x5](https://huggingface.co/spaces/transiteration/nemo_stt_kz_quartznet15x5)
|
94 |
-
|
95 |
-
## References
|
96 |
-
|
97 |
-
[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
|
98 |
-
|
99 |
-
[2] [QuartzNet 15x5](https://catalog.ngc.nvidia.com/orgs/nvidia/models/quartznet15x5)
|
100 |
-
|
101 |
-
[3] [Kazakh Speech Corpus 2](https://issai.nu.edu.kz/kz-speech-corpus/?version=1.1)
|
|
|
1 |
+
title: stt_kz_quartznet15xt
|
2 |
+
emoji: 🎤
|
3 |
+
colorFrom: green
|
4 |
+
colorTo: blue
|
5 |
+
sdk: gradio
|
6 |
+
sdk_version: 3.0.5
|
7 |
+
app_file: app.py
|
8 |
+
pinned: false
|
9 |
+
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|