Update README.md
Browse files
README.md
CHANGED
@@ -10,4 +10,48 @@ license: "cc-by-nc-sa-4.0"
|
|
10 |
# wav2vec2-base-cs-80k-ClTRUS
|
11 |
**C**zech **l**anguage **TR**ransformer from **U**nlabeled **S**peech (ClTRUS) is a monolingual Czech Wav2Vec 2.0 base model pre-trained from 80 thousand hours of Czech speech.
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
# wav2vec2-base-cs-80k-ClTRUS
|
11 |
**C**zech **l**anguage **TR**ransformer from **U**nlabeled **S**peech (ClTRUS) is a monolingual Czech Wav2Vec 2.0 base model pre-trained from 80 thousand hours of Czech speech.
|
12 |
|
13 |
+
This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model for speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data.
|
14 |
+
|
15 |
+
**Note:** This is a checkpoint of the model after 4 epochs over the whole dataset. If you want some earlier or later checkpoints, please feel free to contact the author (jlehecka(at)kky.zcu.cz).
|
16 |
+
|
17 |
+
## Pretraining data
|
18 |
+
More than 80 thousand hours of unlabeled Czech speech:
|
19 |
+
- recordings from radio (22k hours),
|
20 |
+
- unlabeled data from VoxPopuli dataset (18.7k hours),
|
21 |
+
- TV shows (15k hours),
|
22 |
+
- shadow speakers (12k hours),
|
23 |
+
- sports (5k hours),
|
24 |
+
- telephone data (2k hours),
|
25 |
+
- and a smaller amount of data from several other domains including the public CommonVoice dataset.
|
26 |
+
|
27 |
+
## Speech recognition results
|
28 |
+
After fine-tuning, the model scored the following results on public datasets:
|
29 |
+
- Czech portion of CommonVoice v7.0: **WER = 3.8%**
|
30 |
+
- Czech portion of VoxPopuli: **WER = 8.8%**
|
31 |
+
|
32 |
+
See our paper for details.
|
33 |
+
|
34 |
+
## Paper
|
35 |
+
The preprint of our paper (accepted to INTERSPEECH 2022) is available at http://arxiv.org/abs/2206.07627
|
36 |
+
|
37 |
+
## Citation
|
38 |
+
If you find this model useful, please cite our paper:
|
39 |
+
```
|
40 |
+
@inproceedings{wav2vec2-base-cs-80k-ClTRUS,
|
41 |
+
title = {Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of {C}zech},
|
42 |
+
author = {
|
43 |
+
Jan Lehe\v{c}ka and
|
44 |
+
Jan \v{S}vec and
|
45 |
+
Ale\v{s} Pra\v{z}\'ak and
|
46 |
+
Josef V. Psutka
|
47 |
+
},
|
48 |
+
booktitle = {Interspeech 2022},
|
49 |
+
publisher = {{ISCA}},
|
50 |
+
year = {2022},
|
51 |
+
note = {(in press)},
|
52 |
+
url = {https://arxiv.org/abs/2206.07627},
|
53 |
+
}
|
54 |
+
```
|
55 |
+
|
56 |
+
## Other papers using this model:
|
57 |
+
- [Transformer-based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project](https://arxiv.org/abs/2206.07666)
|