metadata

language: en
datasets:
  - librispeech_asr
tags:
  - audio
  - automatic-speech-recognition
  - hf-asr-leaderboard
license: apache-2.0
widget:
  - example_title: Librispeech sample 1
    src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
  - example_title: Librispeech sample 2
    src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
model-index:
  - name: patrickvonplaten/wav2vec2-base-960h-4-gram
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Librispeech (clean)
          type: librispeech_asr
          args: en
        metrics:
          - name: Test WER
            type: wer
            value: 2.59

Wav2Vec2-Base-960h + 4-gram

This model is identical to Facebook's Wav2Vec2-Base-960h, but is augmented with an English 4-gram. The 4-gram.arpa.gz of Librispeech's official ngrams is used.

Evaluation

This code snippet shows how to evaluate patrickvonplaten/wav2vec2-base-960h-4-gram on LibriSpeech's "clean" and "other" test data.

from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torch
from jiwer import wer

model_id = "patrickvonplaten/wav2vec2-base-960h-4-gram"

librispeech_eval = load_dataset("librispeech_asr", "other", split="test")

model = AutoModelForCTC.from_pretrained(model_id).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

def map_to_pred(batch):
    inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt")

    inputs = {k: v.to("cuda") for k,v in inputs.items()}

    with torch.no_grad():
        logits = model(**inputs).logits

    transcription = processor.batch_decode(logits.cpu().numpy()).text[0]
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])

print(wer(result["text"], result["transcription"]))

Result (WER):

"clean"	"other"
2.59	6.46