--- language: en datasets: - librispeech_asr tags: - audio - automatic-speech-recognition - hf-asr-leaderboard license: apache-2.0 widget: - example_title: Librispeech sample 1 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac - example_title: Librispeech sample 2 src: https://cdn-media.huggingface.co/speech_samples/sample2.flac model-index: - name: patrickvonplaten/wav2vec2-large-960h-lv60-self-4-gram results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Librispeech (clean) type: librispeech_asr args: en metrics: - name: Test WER type: wer value: 2.59 --- # Wav2Vec2-Base-960h + 4-gram This model is identical to [Facebook's Wav2Vec2-Large-960h-lv60-self](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self), but is augmented with an English 4-gram. The `4-gram.arpa.gz` of [Librispeech's official ngrams](https://www.openslr.org/11) is used. ## Evaluation This code snippet shows how to evaluate **patrickvonplaten/wav2vec2-large-960h-lv60-self-4-gram** on LibriSpeech's "clean" and "other" test data. ```python from datasets import load_dataset from transformers import AutoModelForCTC, AutoProcessor import torch from jiwer import wer model_id = "patrickvonplaten/wav2vec2-large-960h-lv60-self-4-gram" librispeech_eval = load_dataset("librispeech_asr", "other", split="test") model = AutoModelForCTC.from_pretrained(model_id).to("cuda") processor = AutoProcessor.from_pretrained(model_id) def map_to_pred(batch): inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt") inputs = {k: v.to("cuda") for k,v in inputs.items()} with torch.no_grad(): logits = model(**inputs).logits transcription = processor.batch_decode(logits.cpu().numpy()).text[0] batch["transcription"] = transcription return batch result = librispeech_eval.map(map_to_pred, remove_columns=["audio"]) print(wer(result["text"], result["transcription"])) ``` *Result (WER)*: | "clean" | "other" | |---|---| | 1.84 | 3.71 |