angel-poc commited on
Commit
0176960
1 Parent(s): 3bfa86d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -0
README.md CHANGED
@@ -77,6 +77,15 @@ print(transcriptions)
77
 
78
  ```
79
 
 
 
 
 
 
 
 
 
 
80
  ## Additional information
81
 
82
  ### Author
 
77
 
78
  ```
79
 
80
+ ## Training
81
+ ### Data preparation
82
+ We have processed [Common Voice 11.0](https://commonvoice.mozilla.org/en/datasets) using the NeMo toolkit. We used [get_commonvoice_data.py](https://github.com/NVIDIA/NeMo/blob/main/scripts/dataset_processing/get_commonvoice_data.py) to process the manifests and made posterior data cleaning.
83
+
84
+ After cleaning the dataset and normalizing the `ñ` character to `ny`, we have used the following charset to create the final NeMo manifests for training.
85
+ ```python
86
+ ['c', ' ', 'ó', 'g', 'a', 'o', 'ü', 'v', 'p', 't', "'", '—', 'f', 'k', 'à', 'ï', 'í', 'ú', 'd', 'l', 'z', 'é', 'w', 'm', 'r', 'n', 'y', '-', 'u', 'i', 'h', 'ç', 'e', '·', 'q', 'è', 'ò', 'j', 'x', 's', 'b']
87
+ ```
88
+
89
  ## Additional information
90
 
91
  ### Author