--- language: fa datasets: - common_voice_6_1 tags: - audio - automatic-speech-recognition license: apache-2.0 #widget: #- example_title: Librispeech sample 1 # src: https://cdn-media.huggingface.co/speech_samples/sample1.flac #- example_title: Librispeech sample 2 # src: https://cdn-media.huggingface.co/speech_samples/sample2.flac model-index: - name: Sharif-wav2vec2 results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice Corpus 6.1 (clean) type: common_voice_6_1 config: clean split: test args: language: fa metrics: - name: Test WER type: wer value: 6.0 #- task: # name: Automatic Speech Recognition # type: automatic-speech-recognition # dataset: # name: LibriSpeech (other) # type: librispeech_asr # config: other # split: test # args: # language: en # metrics: # - name: Test WER # type: wer # value: 8.6 --- # Sharif-wav2vec2 [Sharif-wav2vec2](https://huggingface.co/SLPL/Sharif-wav2vec2/) Prior to the usage you may need to install below dependencies: ```shell pip -q install pyctcdecode python -m pip -q install pypi-kenlm ``` Then you can use it with: ```python import tensorflow import torchaudio import torch import librosa import numpy as np from transformers import AutoProcessor, AutoModelForCTC processor = AutoProcessor.from_pretrained("SLPL/Sharif-wav2vec2") model = AutoModelForCTC.from_pretrained("SLPL/Sharif-wav2vec2") speech_array, sampling_rate = torchaudio.load("test.wav") speech_array = speech_array.squeeze().numpy() speech_array = librosa.resample( np.asarray(speech_array), sampling_rate, processor.feature_extractor.sampling_rate) features = processor( speech_array, sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt", padding=True) input_values = features.input_values attention_mask = features.attention_mask with torch.no_grad(): logits = model(input_values, attention_mask=attention_mask).logits prediction = processor.batch_decode(logits.numpy()).text print(prediction[0]) # تست ``` # [Paper](https://arxiv.org/abs/2006.11477) The base model fine-tuned on 108 hours of Commonvoice on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz. # Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli # **Abstract** #We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can #outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and #solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all #labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec #2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of #labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech #recognition with limited amounts of labeled data. The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. # Usage To transcribe Persian audio files the model can be used as a standalone acoustic model as follows: ```python from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC from datasets import load_dataset import torch # load model and tokenizer processor = Wav2Vec2Processor.from_pretrained("SLPL/Sharif-wav2vec2") model = Wav2Vec2ForCTC.from_pretrained("SLPL/Sharif-wav2vec2") # load dummy dataset and read soundfiles # ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") # tokenize input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1 # retrieve logits logits = model(input_values).logits # take argmax and decode predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) ``` ## Evaluation This code snippet shows how to evaluate **facebook/wav2vec2-base-960h** on LibriSpeech's "clean" and "other" test data. ```python from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import torch from jiwer import wer librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda") processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") def map_to_pred(batch): input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values with torch.no_grad(): logits = model(input_values.to("cuda")).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) batch["transcription"] = transcription return batch result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"]) print("WER:", wer(result["text"], result["transcription"])) ``` *Result (WER)*: | "clean" | "other" | |---|---| | 3.4 | 8.6 |