jlehecka commited on
Commit
c7522c4
·
1 Parent(s): 97c61c6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "sk"
3
+ tags:
4
+ - Slovak
5
+ - KKY
6
+ - FAV
7
+ license: "cc-by-nc-sa-4.0"
8
+ ---
9
+
10
+ # wav2vec2-base-sk-17k
11
+ This is a monolingual Slovak Wav2Vec 2.0 base model pre-trained from 17 thousand of hours of Slovak speech.
12
+
13
+ This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model for speech recognition, a tokenizer should be created, and the model should be fine-tuned on labeled data.
14
+
15
+ The model was initialized from [fav-kky/wav2vec2-base-cs-80k-ClTRUS](https://huggingface.co/fav-kky/wav2vec2-base-cs-80k-ClTRUS), so transfer learning from Czech to Slovak was used to pre-train the model, see our paper for details.
16
+
17
+ ## Pretraining data
18
+ Almost 18 thousand hours of unlabeled Slovak speech:
19
+ - unlabeled data from VoxPopuli dataset (12.2k hours),
20
+ - recordings from TV shows (4.5k hours),
21
+ - oral history archives (800 hours),
22
+ - CommonVoice 13.0 (24 hours)
23
+
24
+ ## Usage
25
+ Inputs must be 16kHz mono audio files.
26
+
27
+ This model can be used e.g. to extract per-frame contextual embeddings from audio:
28
+ ```python
29
+ from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
30
+ import torchaudio
31
+
32
+ feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("fav-kky/wav2vec2-base-sk-17k")
33
+ model = Wav2Vec2Model.from_pretrained("fav-kky/wav2vec2-base-sk-17k")
34
+
35
+ speech_array, sampling_rate = torchaudio.load("/path/to/audio/file.wav")
36
+ inputs = feature_extractor(
37
+ speech_array,
38
+ sampling_rate=16_000,
39
+ return_tensors="pt"
40
+ )["input_values"][0]
41
+
42
+ output = model(inputs)
43
+ embeddings = output.last_hidden_state.detach().numpy()[0]
44
+ ```
45
+
46
+ ## Speech recognition results
47
+ After fine-tuning, the model scored the following results on public datasets:
48
+ - Slovak portion of CommonVoice v13.0: **WER = 8.82%**
49
+ - Slovak portion of VoxPopuli: **WER = 8.88%**
50
+
51
+ See our paper for details.
52
+
53
+ ## Paper
54
+ The preprint of our paper (accepted to TSD 2023) is available at TBD
55
+
56
+ ## Citation
57
+ If you find this model useful, please cite our paper:
58
+ ```
59
+ @inproceedings{wav2vec2-base-cs-80k-ClTRUS,
60
+ title = {{Transfer Learning of Transformer-based Speech Recognition Models from Czech to Slovak}},
61
+ author = {
62
+ Jan Lehe\v{c}ka and
63
+ Josef V. Psutka and
64
+ Josef Psutka
65
+ },
66
+ booktitle = {{TSD} 2023},
67
+ publisher = {{Springer}},
68
+ year = {2022},
69
+ note = {(in press)},
70
+ }
71
+ ```
72
+
73
+ ## Related works
74
+ - [fav-kky/wav2vec2-base-cs-80k-ClTRUS](https://huggingface.co/fav-kky/wav2vec2-base-cs-80k-ClTRUS)