SLPL
/

Sharif-wav2vec2

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

sadrasabouri commited on Sep 1, 2022

Commit

a0b45c2

•

1 Parent(s): 6671ee0

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -51,11 +51,11 @@ model-index:
 The base model fine-tuned on 108 hours of Commonvoice on 16kHz sampled speech audio. When using the model
 make sure that your speech input is also sampled at 16Khz.
-#[Paper](https://arxiv.org/abs/2006.11477)
-#Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
-#**Abstract**
 #We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can #outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and #solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all #labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec #2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of #labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech #recognition with limited amounts of labeled data.

 The base model fine-tuned on 108 hours of Commonvoice on 16kHz sampled speech audio. When using the model
 make sure that your speech input is also sampled at 16Khz.
+# [Paper](https://arxiv.org/abs/2006.11477)
+# Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
+# **Abstract**
 #We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can #outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and #solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all #labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec #2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of #labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech #recognition with limited amounts of labeled data.