anteju commited on
Commit
f086fbf
1 Parent(s): 10d154e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -88,14 +88,14 @@ The tokenizers for these models were built using the text transcripts of the tra
88
 
89
  The vocabulary we use contains 27 characters:
90
  ```python
91
- ['a', 'b', 'c', 'č', 'ć', 'd', 'đ', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 'š', 't', 'u', 'v', 'z', 'ž']
92
  ```
93
 
94
- Full config can be found inside the .nemo files.
95
 
96
  ### Datasets
97
 
98
- All the models in this collection are trained on ParlaSpeech-HR v1.0 Croatian dataset, which contains around 1665 hours of training data after data cleaning, 2.2 hours of developement and 2.3 hours of test data.
99
 
100
  ## Performance
101
 
@@ -103,13 +103,13 @@ The list of the available models in this collection is shown in the following ta
103
 
104
  | Version | Tokenizer | Vocabulary Size | Dev WER | Test WER | Train Dataset |
105
  |---------|-----------------------|-----------------|---------|----------|---------------------|
106
- | 1.11.0 | SentencePiece Unigram | 128 | X.YZ | X.YZ | ParlaSpeech-HR v1.0 |
107
 
108
  You may use language models (LMs) and beam search to improve the accuracy of the models.
109
 
110
  ## Limitations
111
 
112
- Since the model is trained just on ParlaSpeech-HR v1.0 dataset, the performance of this model might degrade for speech which includes terms, or vernecular that the model has not been trained on. The model might also perform worse for accented speech.
113
 
114
  ## References
115
 
 
88
 
89
  The vocabulary we use contains 27 characters:
90
  ```python
91
+ [' ', 'a', 'b', 'c', 'č', 'ć', 'd', 'đ', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 'š', 't', 'u', 'v', 'z', 'ž']
92
  ```
93
 
94
+ Full config can be found inside the `.nemo` files.
95
 
96
  ### Datasets
97
 
98
+ All the models in this collection are trained on ParlaSpeech-HR v1.0 Croatian dataset, which contains around 1665 hours of training data after data cleaning, 2.2 hours of development and 2.3 hours of test data.
99
 
100
  ## Performance
101
 
 
103
 
104
  | Version | Tokenizer | Vocabulary Size | Dev WER | Test WER | Train Dataset |
105
  |---------|-----------------------|-----------------|---------|----------|---------------------|
106
+ | 1.11.0 | SentencePiece Unigram | 128 | 4.56 | 4.69 | ParlaSpeech-HR v1.0 |
107
 
108
  You may use language models (LMs) and beam search to improve the accuracy of the models.
109
 
110
  ## Limitations
111
 
112
+ Since the model is trained just on ParlaSpeech-HR v1.0 dataset, the performance of this model might degrade for speech which includes terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
113
 
114
  ## References
115