File size: 2,334 Bytes

cc9f5a4
 
 
 
 
 
 
 
 
 
 
52c5702
cc9f5a4
 
 
 
 
b285424
cc9f5a4
b285424
 
 
 
cc9f5a4
 
 
7c234f3
b285424
 
 
 
 
 
 
 
 
 
 
 
 
 
cc9f5a4
 
52c5702
cc9f5a4
52c5702
 
cc9f5a4
52c5702
cc9f5a4
52c5702
cc9f5a4
 
 
52c5702
cc9f5a4
 
 
52c5702
cc9f5a4
52c5702
cc9f5a4
52c5702
 
cc9f5a4
 
52c5702
 
 
 
cc9f5a4
52c5702
cc9f5a4
52c5702
cc9f5a4
 
 
 
 
52c5702
cc9f5a4
 
 
 
 
 
7c234f3

---
language: en
datasets:
- librispeech_asr
tags:
- speech
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
license: apache-2.0
model-index:
- name: wav2vec2-conformer-rope-large-960h-ft-4-gram
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: LibriSpeech (clean)
      type: librispeech_asr
      config: clean
      split: test
      args: 
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 1.88
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: LibriSpeech (other)
      type: librispeech_asr
      config: other
      split: test
      args: 
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 3.57
---

# Wav2Vec2-Conformer-Large-960h with Rotary Position Embeddings + 4-gram

This model is identical to [Facebook's wav2vec2-conformer-rope-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rope-large-960h-ft), but is 
augmented with an English 4-gram. The `4-gram.arpa.gz` of [Librispeech's official ngrams](https://www.openslr.org/11) is used.
 
 ## Evaluation
 
 This code snippet shows how to evaluate **patrickvonplaten/wav2vec2-conformer-rope-large-960h-ft-4-gram** on LibriSpeech's "clean" and "other" test data.
 
```python
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torch
from jiwer import wer

model_id = "patrickvonplaten/wav2vec2-conformer-rope-large-960h-ft-4-gram"

librispeech_eval = load_dataset("librispeech_asr", "other", split="test")

model = AutoModelForCTC.from_pretrained(model_id).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

def map_to_pred(batch):
    inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt")

    inputs = {k: v.to("cuda") for k,v in inputs.items()}

    with torch.no_grad():
        logits = model(**inputs).logits

    transcription = processor.batch_decode(logits.cpu().numpy()).text[0]
    batch["transcription"] = transcription
    return batch

result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])

print(wer(result["text"], result["transcription"]))
```

*Result (WER)*:

| "clean" | "other" |
|---|---|
| 1.88 | 3.57 |