---
language: en
datasets:
- librispeech_asr
tags:
- speech
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
license: apache-2.0
model-index:
- name: wav2vec2-conformer-rel-pos-large-960h-ft-4-gram
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Librispeech (clean)
      type: librispeech_asr
      args: en
    metrics:
    - name: Test WER
      type: wer
      value: 1.94
---

# Wav2Vec2-Conformer-Large-960h with Relative Position Embeddings + 4-gram

This model is identical to [Facebook's wav2vec2-conformer-rel-pos-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large-960h-ft), but is 
augmented with an English 4-gram. The `4-gram.arpa.gz` of [Librispeech's official ngrams](https://www.openslr.org/11) is used.
 
 ## Evaluation
 
 This code snippet shows how to evaluate **patrickvonplaten/wav2vec2-conformer-rel-pos-large-960h-ft-4-gram** on LibriSpeech's "clean" and "other" test data.
 
```python
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torch
from jiwer import wer

model_id = "patrickvonplaten/wav2vec2-conformer-rel-pos-large-960h-ft-4-gram"

librispeech_eval = load_dataset("librispeech_asr", "other", split="test")

model = AutoModelForCTC.from_pretrained(model_id).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

def map_to_pred(batch):
    inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt")

    inputs = {k: v.to("cuda") for k,v in inputs.items()}

    with torch.no_grad():
        logits = model(**inputs).logits

    transcription = processor.batch_decode(logits.cpu().numpy()).text[0]
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])

print(wer(result["text"], result["transcription"]))
```

*Result (WER)*:

| "clean" | "other" |
|---|---|
| 1.94 | 3.54 |