metadata

language: es
datasets:
  - common_voice
metrics:
  - wer
  - cer
tags:
  - audio
  - automatic-speech-recognition
  - speech
  - xlsr-fine-tuning-week
license: apache-2.0

Wav2Vec2-Large-XLSR-53-Spanish-With-LM

This is a model copy of Wav2Vec2-Large-XLSR-53-Spanish that has language model support.

This model card can be seen as a demo for the pyctcdecode integration with Transformers led by this PR. The PR explains in-detail how the integration works.

In a nutshell: This PR adds a new Wav2Vec2WithLMProcessor class as drop-in replacement for Wav2Vec2Processor.

The only change from the existing ASR pipeline will be:

-from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
+from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
from datasets import load_dataset

ds = load_dataset("common_voice", "es", split="test", streaming=True)

sample = next(iter(ds))

model = Wav2Vec2ForCTC.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm")
-processor = Wav2Vec2Processor.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm")
+processor = Wav2Vec2ProcessorWithLM.from_pretrained("patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm")

input_values = processor(sample["audio"]["array"], return_tensors="pt").input_values

logits = model(input_values).logits

-prediction_ids = torch.argmax(logits, dim=-1)
-transcription = processor.batch_decode(prediction_ids)
+transcription = processor.batch_decode(logits)

print(transcription)

Model	WER	CER
jonatasgrosman/wav2vec2-large-xlsr-53-spanish	8.81%	2.70%
pcuenq/wav2vec2-large-xlsr-53-es	10.55%	3.20%
facebook/wav2vec2-large-xlsr-53-spanish	16.99%	5.40%
mrm8488/wav2vec2-large-xlsr-53-spanish	19.20%	5.96%