patrickvonplaten's picture
Update README.md
1d7b83d
metadata
language: multilingual
datasets:
  - common_voice
  - multilingual_librispeech
  - covost2
tags:
  - speech
  - xls_r
  - automatic-speech-recognition
pipeline_tag: automatic-speech-recognition
license: apache-2.0

Wav2Vec2-XLS-R-300M-21-EN

Facebook's Wav2Vec2 XLS-R fine-tuned for Speech Translation.

model image

This is a SpeechEncoderDecoderModel model. The encoder was warm-started from the facebook/wav2vec2-xls-r-300m checkpoint and the decoder from the facebook/mbart-large-50 checkpoint. Consequently, the encoder-decoder model was fine-tuned on 21 {lang} -> en translation pairs of the Covost2 dataset.

The model can translate from the following spoken languages {lang} -> en (English):

{fr, de, es, ca, it, ru, zh-CN, pt, fa, et, mn, nl, tr, ar, sv-SE, lv, sl, ta, ja, id, cy} -> en

For more information, please refer to Section 5.1.2 of the official XLS-R paper.

Usage

As this a standard sequence to sequence transformer model, you can use the generate method to generate the transcripts by passing the speech features to the model.

You can use the model directly via the ASR pipeline

from datasets import load_dataset
from transformers import pipeline

# replace following lines to load an audio file of your choice
librispeech_en = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
audio_file = librispeech_en[0]["file"]

asr = pipeline("automatic-speech-recognition", model="facebook/wav2vec2-xls-r-300m-21-to-en", feature_extractor="facebook/wav2vec2-xls-r-300m-21-to-en")

translation = asr(audio_file)

or step-by-step as follows:

import torch
from transformers import Speech2Text2Processor, SpeechEncoderDecoder
from datasets import load_dataset

model = SpeechEncoderDecoder.from_pretrained("facebook/wav2vec2-xls-r-300m-21-to-en")
processor = Speech2Text2Processor.from_pretrained("facebook/wav2vec2-xls-r-300m-21-to-en")

ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

inputs = processor(ds[0]["audio"]["array"], sampling_rate=16_000, return_tensors="pt")
generated_ids = model.generate(input_ids=inputs["input_features"], attention_mask=inputs["attention_mask"])
transcription = processor.batch_decode(generated_ids)

Results {lang} -> en

See the row of XLS-R (0.3B) for the performance on Covost2 for this model.

results image