jlehecka commited on
Commit
a52b241
1 Parent(s): 061a77d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -1
README.md CHANGED
@@ -10,4 +10,48 @@ license: "cc-by-nc-sa-4.0"
10
  # wav2vec2-base-cs-80k-ClTRUS
11
  **C**zech **l**anguage **TR**ransformer from **U**nlabeled **S**peech (ClTRUS) is a monolingual Czech Wav2Vec 2.0 base model pre-trained from 80 thousand hours of Czech speech.
12
 
13
- Preprint of our paper is available at ....
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  # wav2vec2-base-cs-80k-ClTRUS
11
  **C**zech **l**anguage **TR**ransformer from **U**nlabeled **S**peech (ClTRUS) is a monolingual Czech Wav2Vec 2.0 base model pre-trained from 80 thousand hours of Czech speech.
12
 
13
+ This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model for speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data.
14
+
15
+ **Note:** This is a checkpoint of the model after 4 epochs over the whole dataset. If you want some earlier or later checkpoints, please feel free to contact the author (jlehecka(at)kky.zcu.cz).
16
+
17
+ ## Pretraining data
18
+ More than 80 thousand hours of unlabeled Czech speech:
19
+ - recordings from radio (22k hours),
20
+ - unlabeled data from VoxPopuli dataset (18.7k hours),
21
+ - TV shows (15k hours),
22
+ - shadow speakers (12k hours),
23
+ - sports (5k hours),
24
+ - telephone data (2k hours),
25
+ - and a smaller amount of data from several other domains including the public CommonVoice dataset.
26
+
27
+ ## Speech recognition results
28
+ After fine-tuning, the model scored the following results on public datasets:
29
+ - Czech portion of CommonVoice v7.0: **WER = 3.8%**
30
+ - Czech portion of VoxPopuli: **WER = 8.8%**
31
+
32
+ See our paper for details.
33
+
34
+ ## Paper
35
+ The preprint of our paper (accepted to INTERSPEECH 2022) is available at http://arxiv.org/abs/2206.07627
36
+
37
+ ## Citation
38
+ If you find this model useful, please cite our paper:
39
+ ```
40
+ @inproceedings{wav2vec2-base-cs-80k-ClTRUS,
41
+ title = {Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of {C}zech},
42
+ author = {
43
+ Jan Lehe\v{c}ka and
44
+ Jan \v{S}vec and
45
+ Ale\v{s} Pra\v{z}\'ak and
46
+ Josef V. Psutka
47
+ },
48
+ booktitle = {Interspeech 2022},
49
+ publisher = {{ISCA}},
50
+ year = {2022},
51
+ note = {(in press)},
52
+ url = {https://arxiv.org/abs/2206.07627},
53
+ }
54
+ ```
55
+
56
+ ## Other papers using this model:
57
+ - [Transformer-based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project](https://arxiv.org/abs/2206.07666)