Fixes evaluation instructions and updates WER scores

Hi, I was trying to evaluate the model on LibriSpeech's "clean" and "other" test data following the code snippet in the Model card but I got a `TypeError` due to the transcriptions stored in the batch as wrapped in lists instead of as plain strings (e.g. ["transcription example"] instead of "transcription example") in the `map_to_pred` function.

``TypeError: expected string or bytes-like object``

After fixing the error I recomputed the WER and updated the scores without aproximating them. I think the same should be done for other wav2vec2 based models (e.g. facebook/wav2vec2-large-960h-lv60).

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ model-index:
     metrics:
     - name: Test WER
       type: wer
-      value: 1.9
   - task:
       name: Automatic Speech Recognition
       type: automatic-speech-recognition
@@ -38,7 +38,7 @@ model-index:
     metrics:
     - name: Test WER
       type: wer
-      value: 3.9
 ---
 # Wav2Vec2-Large-960h-Lv60 + Self-Training
@@ -85,9 +85,9 @@ To transcribe audio files the model can be used as a standalone acoustic model a
  transcription = processor.batch_decode(predicted_ids)
  ```
-  ## Evaluation
- This code snippet shows how to evaluate **facebook/wav2vec2-large-960h-lv60-self** on LibriSpeech's "clean" and "other" test data.
 ```python
 from datasets import load_dataset
@@ -103,14 +103,14 @@ processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60
 def map_to_pred(batch):
     inputs = processor(batch["audio"]["array"], return_tensors="pt", padding="longest")
-    input_values = inputs.input_values.to("cuda")
     attention_mask = inputs.attention_mask.to("cuda")
     with torch.no_grad():
         logits = model(input_values, attention_mask=attention_mask).logits
     predicted_ids = torch.argmax(logits, dim=-1)
-    transcription = processor.batch_decode(predicted_ids)
     batch["transcription"] = transcription
     return batch
@@ -123,4 +123,4 @@ print("WER:", wer(result["text"], result["transcription"]))
 | "clean" | "other" |
 |---|---|
-| 1.9 | 3.9 |

     metrics:
     - name: Test WER
       type: wer
+      value: 1.86
   - task:
       name: Automatic Speech Recognition
       type: automatic-speech-recognition
     metrics:
     - name: Test WER
       type: wer
+      value: 3.88
 ---
 # Wav2Vec2-Large-960h-Lv60 + Self-Training
  transcription = processor.batch_decode(predicted_ids)
  ```
+## Evaluation
+This code snippet shows how to evaluate **facebook/wav2vec2-large-960h-lv60-self** on LibriSpeech's "clean" and "other" test data.
 ```python
 from datasets import load_dataset
 def map_to_pred(batch):
     inputs = processor(batch["audio"]["array"], return_tensors="pt", padding="longest")
+    input_values = inputs.input_values.to s("cuda")
     attention_mask = inputs.attention_mask.to("cuda")
     with torch.no_grad():
         logits = model(input_values, attention_mask=attention_mask).logits
     predicted_ids = torch.argmax(logits, dim=-1)
+    transcription = processor.batch_decode(predicted_ids)[0]
     batch["transcription"] = transcription
     return batch
 | "clean" | "other" |
 |---|---|
+| 1.86 | 3.88 |