fav-kky
/

wav2vec2-base-sk-17k

Inference Endpoints

Model card Files Files and versions Community

jlehecka commited on Jun 7, 2023

Commit

c7522c4

·

1 Parent(s): 97c61c6

Create README.md

Files changed (1) hide show

README.md +74 -0

README.md ADDED Viewed

	@@ -0,0 +1,74 @@

+---
+language: "sk"
+tags:
+- Slovak
+- KKY
+- FAV
+license: "cc-by-nc-sa-4.0"
+---
+# wav2vec2-base-sk-17k
+This is a monolingual Slovak Wav2Vec 2.0 base model pre-trained from 17 thousand of hours of Slovak speech.
+This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model for speech recognition, a tokenizer should be created, and the model should be fine-tuned on labeled data.
+The model was initialized from [fav-kky/wav2vec2-base-cs-80k-ClTRUS](https://huggingface.co/fav-kky/wav2vec2-base-cs-80k-ClTRUS), so transfer learning from Czech to Slovak was used to pre-train the model, see our paper for details.
+## Pretraining data
+Almost 18 thousand hours of unlabeled Slovak speech:
+- unlabeled data from VoxPopuli dataset (12.2k hours),
+- recordings from TV shows (4.5k hours),
+- oral history archives (800 hours),
+- CommonVoice 13.0 (24 hours)
+## Usage
+Inputs must be 16kHz mono audio files.
+This model can be used e.g. to extract per-frame contextual embeddings from audio:
+```python
+from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
+import torchaudio
+feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("fav-kky/wav2vec2-base-sk-17k")
+model = Wav2Vec2Model.from_pretrained("fav-kky/wav2vec2-base-sk-17k")
+speech_array, sampling_rate = torchaudio.load("/path/to/audio/file.wav")
+inputs = feature_extractor(
+    speech_array,
+    sampling_rate=16_000,
+    return_tensors="pt"
+)["input_values"][0]
+output = model(inputs)
+embeddings = output.last_hidden_state.detach().numpy()[0]
+```
+## Speech recognition results
+After fine-tuning, the model scored the following results on public datasets:
+- Slovak portion of CommonVoice v13.0: **WER = 8.82%**
+- Slovak portion of VoxPopuli: **WER = 8.88%**
+See our paper for details.
+## Paper
+The preprint of our paper (accepted to TSD 2023) is available at TBD
+## Citation
+If you find this model useful, please cite our paper:
+```
+@inproceedings{wav2vec2-base-cs-80k-ClTRUS,
+  title = {{Transfer Learning of Transformer-based Speech Recognition Models from Czech to Slovak}},
+  author = {
+    Jan Lehe\v{c}ka and
+    Josef V. Psutka and
+    Josef Psutka
+  },
+  booktitle = {{TSD} 2023},
+  publisher = {{Springer}},
+  year = {2022},
+  note = {(in press)},
+}
+```
+## Related works
+- [fav-kky/wav2vec2-base-cs-80k-ClTRUS](https://huggingface.co/fav-kky/wav2vec2-base-cs-80k-ClTRUS)