File size: 1,975 Bytes
33ef654
df70866
 
33ef654
df70866
 
 
 
 
 
 
 
 
 
 
 
 
 
33ef654
df70866
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
language:
- kr
license: cc-by-4.0
library_name: nemo
datasets:
- RealCallData
thumbnail: null
tags:
- automatic-speech-recognition
- speech
- audio
- Citrinet1024
- NeMo
- pytorch
model-index:
- name: stt_kr_citrinet1024_PublicCallCenter_1000H_0.22
  results: []
---

## Model Overview

<DESCRIBE IN ONE LINE THE MODEL AND ITS USE>

## NVIDIA NeMo: Training

To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
```
pip install nemo_toolkit['all']
``` 

## How to Use this Model

The model is available for use in the NeMo toolkit [1], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.


### Automatically instantiate the model

```python
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("ypluit/stt_kr_citrinet1024_PublicCallCenter_1000H_0.22")
```


### Transcribing using Python
First, let's get a sample
```
get any korean telephone voice wave file
```
Then simply do:
```
asr_model.transcribe(['sample-kr.wav'])
```

### Transcribing many audio files

```shell
python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py  pretrained_name="model"  audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"
```

### Input

This model accepts 16000Hz Mono-channel Audio (wav files) as input.

### Output

This model provides transcribed speech as a string for a given audio sample.


## Model Architecture

See nemo toolkit and reference papers.
## Training

Learned about 30 days on 2 A6000

### Datasets

Private call center real data (1100hour)

## Performance

< 0.13 CER

## Limitations

This model was trained with 650 hours of Korean telephone voice data for customer service in a call center. might be Poor performance for general-purpose dialogue and specific accents.

## References


[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)