patrickvonplaten
/

wav2vec2-conformer-rel-pos-large-960h-ft-4-gram

@@ -9,7 +9,7 @@ tags:
 - hf-asr-leaderboard
 license: apache-2.0
 model-index:
-- name: wav2vec2-conformer-rel-pos-large-960h-ft
   results:
   - task:
       name: Automatic Speech Recognition
@@ -21,89 +21,50 @@ model-index:
     metrics:
     - name: Test WER
       type: wer
-      value: 1.85
 ---
-# Wav2Vec2-Conformer-Large-960h with Relative Position Embeddings
-[Facebook's Wav2Vec2 Conformer (TODO-add link)]()
-Wav2Vec2 Conformer with relative position embeddings, pretrained and fine-tuned on 960 hours of Librispeech on 16kHz sampled speech audio. When using the model make sure that your speech input is also sampled at 16Khz.
-[Paper (TODO)](https://arxiv.org/abs/2006.11477)
-Authors: ...
-**Abstract**
-...
-The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20.
-# Usage
-To transcribe audio files the model can be used as a standalone acoustic model as follows:
-```python
- from transformers import Wav2Vec2Processor, Wav2Vec2ConformerForCTC
- from datasets import load_dataset
- import torch
- # load model and processor
- processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large-960h-ft")
- model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-conformer-rel-pos-large-960h-ft")
- # load dummy dataset and read soundfiles
- ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
- # tokenize
- input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
- # retrieve logits
- logits = model(input_values).logits
- # take argmax and decode
- predicted_ids = torch.argmax(logits, dim=-1)
- transcription = processor.batch_decode(predicted_ids)
- ```
-  ## Evaluation
- This code snippet shows how to evaluate **facebook/wav2vec2-conformer-rel-pos-large-960h-ft** on LibriSpeech's "clean" and "other" test data.
 ```python
 from datasets import load_dataset
-from transformers import Wav2Vec2ConformerForCTC, Wav2Vec2Processor
 import torch
 from jiwer import wer
-librispeech_eval = load_dataset("librispeech_asr", "clean", split="test")
-model = Wav2Vec2ConformerForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self").to("cuda")
-processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
 def map_to_pred(batch):
-    inputs = processor(batch["audio"]["array"], return_tensors="pt", padding="longest")
-    input_values = inputs.input_values.to("cuda")
-    attention_mask = inputs.attention_mask.to("cuda")
     with torch.no_grad():
-        logits = model(input_values, attention_mask=attention_mask).logits
-    predicted_ids = torch.argmax(logits, dim=-1)
-    transcription = processor.batch_decode(predicted_ids)
     batch["transcription"] = transcription
     return batch
 result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
-print("WER:", wer(result["text"], result["transcription"]))
 ```
 *Result (WER)*:
 | "clean" | "other" |
 |---|---|
-| 1.85 | 3.82 |

 - hf-asr-leaderboard
 license: apache-2.0
 model-index:
+- name: wav2vec2-conformer-rel-pos-large-960h-ft-4-gram
   results:
   - task:
       name: Automatic Speech Recognition
     metrics:
     - name: Test WER
       type: wer
+      value: --
 ---
+# Wav2Vec2-Conformer-Large-960h with Relative Position Embeddings + 4-gram
+This model is identical to [Facebook's wav2vec2-conformer-rel-pos-large-960h-ft](https://huggingface.co/facebook/wav2vec2-conformer-rel-pos-large-960h-ft), but is
+augmented with an English 4-gram. The `4-gram.arpa.gz` of [Librispeech's official ngrams](https://www.openslr.org/11) is used.
+ ## Evaluation
+ This code snippet shows how to evaluate **patrickvonplaten/wav2vec2-conformer-rel-pos-large-960h-ft-4-gram** on LibriSpeech's "clean" and "other" test data.
 ```python
 from datasets import load_dataset
+from transformers import AutoModelForCTC, AutoProcessor
 import torch
 from jiwer import wer
+model_id = "patrickvonplaten/wav2vec2-conformer-rel-pos-large-960h-ft-4-gram"
+librispeech_eval = load_dataset("librispeech_asr", "other", split="test")
+model = AutoModelForCTC.from_pretrained(model_id).to("cuda")
+processor = AutoProcessor.from_pretrained(model_id)
 def map_to_pred(batch):
+    inputs = processor(batch["audio"]["array"], sampling_rate=16_000, return_tensors="pt")
+    inputs = {k: v.to("cuda") for k,v in inputs.items()}
     with torch.no_grad():
+        logits = model(**inputs).logits
+    transcription = processor.batch_decode(logits.cpu().numpy()).text[0]
     batch["transcription"] = transcription
     return batch
 result = librispeech_eval.map(map_to_pred, remove_columns=["audio"])
+print(wer(result["text"], result["transcription"]))
 ```
 *Result (WER)*:
 | "clean" | "other" |
 |---|---|
+| -- | -- |