sadrasabouri
commited on
Commit
•
a0b45c2
1
Parent(s):
6671ee0
Update README.md
Browse files
README.md
CHANGED
@@ -51,11 +51,11 @@ model-index:
|
|
51 |
The base model fine-tuned on 108 hours of Commonvoice on 16kHz sampled speech audio. When using the model
|
52 |
make sure that your speech input is also sampled at 16Khz.
|
53 |
|
54 |
-
#[Paper](https://arxiv.org/abs/2006.11477)
|
55 |
|
56 |
-
#Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
|
57 |
|
58 |
-
|
59 |
|
60 |
#We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can #outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and #solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all #labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec #2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of #labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech #recognition with limited amounts of labeled data.
|
61 |
|
|
|
51 |
The base model fine-tuned on 108 hours of Commonvoice on 16kHz sampled speech audio. When using the model
|
52 |
make sure that your speech input is also sampled at 16Khz.
|
53 |
|
54 |
+
# [Paper](https://arxiv.org/abs/2006.11477)
|
55 |
|
56 |
+
# Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
|
57 |
|
58 |
+
# **Abstract**
|
59 |
|
60 |
#We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can #outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and #solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all #labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec #2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of #labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech #recognition with limited amounts of labeled data.
|
61 |
|