facebook/wav2vec2-xls-r-1b-21-to-en · Incorrect config file

Mar 14, 2024

The configuration attached to this model is of mbart 50 which makes it completely unusable.

Mar 15, 2024

Hey @shrey-jasuja , this is a SpeechEncoderDecoderModel, which uses a speech encoder and a text (mbart) decoder. As said in the model card:

The encoder was warm-started from the facebook/wav2vec2-xls-r-1b checkpoint and the decoder from the facebook/mbart-large-50 checkpoint. Consequently, the encoder-decoder model was fine-tuned on 21 {lang} -> en translation pairs of the Covost2 dataset.

shrey-jasuja

Mar 15, 2024

I understand but the inference code under the current form doesn't work. The tokenizer needs to be defined explicitly. The following changes worked for me:

import torch
from transformers import SpeechEncoderDecoderModel,MBart50Tokenizer
from datasets import load_dataset

tokenizer = MBart50Tokenizer.from_pretrained("facebook/mbart-large-50")
from transformers import Wav2Vec2FeatureExtractor
feature_extractor = Wav2Vec2FeatureExtractor("facebook/wav2vec2-xls-r-2b-21-to-en")

from transformers import pipeline
asr=pipeline(model="facebook/wav2vec2-xls-r-2b-21-to-en",tokenizer=tokenizer,feature_extractor=feature_extractor,device=0)

audio = item['file']
translation = asr(audio)["text"]

lysandre

Mar 28, 2024

Pinging @sanchit-gandhi for advice :)

sanchit-gandhi

Jun 5, 2024

Indeed - the code examples here are incorrect. Will be fixed by #6!