danielschnell
commited on
Commit
•
9b39732
1
Parent(s):
d2e439b
Copied from Clarin: http://hdl.handle.net/20.500.12537/227
Browse filesUse original Readme.txt => README.md
Signed-off-by: Daniel Schnell <dschnell@grammatek.com>
- .gitattributes +1 -0
- 10_trials_optim_kenlm.scorer +3 -0
- README.md +89 -3
.gitattributes
CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
10_trials_optim_kenlm.scorer filter=lfs diff=lfs merge=lfs -text
|
10_trials_optim_kenlm.scorer
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:6082bcc551041a630d54c01746b8e8b6d4c2368d9ba7f1e774e32a4b6c95ab11
|
3 |
+
size 1043308192
|
README.md
CHANGED
@@ -1,3 +1,89 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
-------------------------------------------------------------------------------
|
2 |
+
DeepSpeech Scorer for Icelandic 22.06
|
3 |
+
-------------------------------------------------------------------------------
|
4 |
+
|
5 |
+
Authors : Carlos Daniel Hernández Mena (carlosm@ru.is).
|
6 |
+
|
7 |
+
Language : Icelandic.
|
8 |
+
|
9 |
+
Recommended use : speech recognition.
|
10 |
+
|
11 |
+
-------------------------------------------------------------------------------
|
12 |
+
Description
|
13 |
+
-------------------------------------------------------------------------------
|
14 |
+
|
15 |
+
"DeepSpeech Scorer for Icelandic 22.06" is a scorer suitable for recognizers
|
16 |
+
based on the Mozilla's DeepSpeech recognizer [1]. A "scorer" is a single file
|
17 |
+
used to perform language modeling. It is composed of two sub-components, a
|
18 |
+
KenLM language model and a trie data structure containing all words in the
|
19 |
+
vocabulary [2].
|
20 |
+
|
21 |
+
This scorer was originally created to be used with the following DeepSpeech
|
22 |
+
recipe, developed by the Language and Voice Lab (LVL) at Reykjavík University
|
23 |
+
in 2022:
|
24 |
+
|
25 |
+
https://github.com/cadia-lvl/samromur-asr/tree/d5_samromur/d5_samromur
|
26 |
+
|
27 |
+
Nevertheless, due to the flexibility of this kind of resources and their
|
28 |
+
possible application in other tasks, systems or code recipes; it was
|
29 |
+
decided to publish this resource as an independent item.
|
30 |
+
|
31 |
+
-------------------------------------------------------------------------------
|
32 |
+
The Language Model
|
33 |
+
-------------------------------------------------------------------------------
|
34 |
+
|
35 |
+
The language model was created using the Icelandic Gigaword Corpus [3]. The
|
36 |
+
Gigaword corpus contains text from newspaper articles, parliamentary speeches,
|
37 |
+
adjudications, books, transcribed radio/television news and more. The
|
38 |
+
normalization process of the sentences utilized to generate the language
|
39 |
+
model includes to allowing only characters belonging to the Icelandic alphabet,
|
40 |
+
expanding numbers and abbreviations, and removing punctuation marks [4]. The
|
41 |
+
resulting text has a length of more than 44 million lines of text (5.3GB
|
42 |
+
approximately), and it was used to create the scorer.
|
43 |
+
|
44 |
+
-------------------------------------------------------------------------------
|
45 |
+
Citation
|
46 |
+
-------------------------------------------------------------------------------
|
47 |
+
|
48 |
+
When publishing results based on the models please refer to:
|
49 |
+
|
50 |
+
Mena, Carlos; "DeepSpeech Scorer for Icelandic 22.06". Web Download.
|
51 |
+
Reykjavik University: Language and Voice Lab, 2022.
|
52 |
+
|
53 |
+
Contact: Carlos Mena (carlosm@ru.is)
|
54 |
+
|
55 |
+
License: CC BY 4.0
|
56 |
+
|
57 |
+
-------------------------------------------------------------------------------
|
58 |
+
Acknowledgements
|
59 |
+
-------------------------------------------------------------------------------
|
60 |
+
|
61 |
+
This initiative was funded by the Language Technology Programme for Icelandic
|
62 |
+
2019-2023. The programme, which is managed and coordinated by Almannarómur,
|
63 |
+
is funded by the Icelandic Ministry of Education, Science and Culture.
|
64 |
+
|
65 |
+
-------------------------------------------------------------------------------
|
66 |
+
References
|
67 |
+
-------------------------------------------------------------------------------
|
68 |
+
|
69 |
+
[1] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg,
|
70 |
+
E., Case, C., ... & Zhu, Z. (2016, June). Deep speech 2: End-to-end
|
71 |
+
speech recognition in english and mandarin. In International conference
|
72 |
+
on machine learning (pp. 173-182). PMLR.
|
73 |
+
|
74 |
+
[2] Mozilla's DeepSpeech online documentation:
|
75 |
+
https://deepspeech.readthedocs.io/en/r0.9/Scorer.html
|
76 |
+
|
77 |
+
[3] Steingrímsson, S., Helgadóttir, S., Rögnvaldsson, E., Barkarson, S.,
|
78 |
+
& Guðnason, J. (2018, May). Risamálheild: A very large Icelandic text
|
79 |
+
corpus. In Proceedings of the Eleventh International Conference on
|
80 |
+
Language Resources and Evaluation (LREC 2018).
|
81 |
+
|
82 |
+
[4] Nikulásdóttir, A. B., Helgadóttir, I. R., Pétursson, M., & Guðnason,
|
83 |
+
J. (2018, May). Open ASR for Icelandic: Resources and a baseline system.
|
84 |
+
In Proceedings of the Eleventh International Conference on Language
|
85 |
+
Resources and Evaluation (LREC 2018).
|
86 |
+
|
87 |
+
-------------------------------------------------------------------------------
|
88 |
+
-------------------------------------------------------------------------------
|
89 |
+
|