--- language: en datasets: - patrickvonplaten/librispeech_asr_dummy metrics: - wer tags: - audio - automatic-speech-recognition - en - speech --- # Fine-tuned facebook/wav2vec2-base large model for speech recognition in English Fine-tuned [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) on English using the train and validation splits of [zodata](https://www.kaggle.com/datasets/mohamedk0emad/zodata). the dataset has 307912 transcibed voice samples, we used 6158 samples for training and 6036 samples for testing and the result for testing with WER accuracy metric is: Test WER: 0.340 When using this model, make sure that your speech input is sampled at 16kHz. This model has been fine-tuned thanks to the GPU credits given by the [kaggle](https://www.kaggle.com/) # Usage To transcribe audio files the model can be used as a standalone acoustic model as follows: ```python from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC from datasets import load_dataset import torch # load model and tokenizer processor = Wav2Vec2Processor.from_pretrained("souzan/zomodel") model = Wav2Vec2ForCTC.from_pretrained("souzan/zomodel") # load dummy dataset and read soundfiles ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") # tokenize input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1 # retrieve logits logits = model(input_values).logits # take argmax and decode predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) ``` ## Evaluation This code snippet shows how to evaluate **facebook/wav2vec2-base-960h** on LibriSpeech's "clean" and "other" test data. ```python from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import torch from jiwer import wer librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") model = Wav2Vec2ForCTC.from_pretrained("souzan/zomodel").to("cuda") processor = Wav2Vec2Processor.from_pretrained("souzan/zomodel") def map_to_pred(batch): input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values with torch.no_grad(): logits = model(input_values.to("cuda")).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) batch["transcription"] = transcription return batch result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"]) print("WER:", wer(result["text"], result["transcription"])) ```