Only English transcriptions on Dutch transcribe task?

#13
by RikRaes - opened

When performing the transcribe task on the Dutch Common Voice Data (locally downloaded), I seem to only obtain English transcriptions for the tiny, small, and base models which are the ones I have tested so far. Therefore, I assume there is a mistake in the code or the way I use the pipeline, could anyone help me? I posted the code below.
pipe_whisper = pipeline(model="openai/whisper-tiny", device=device, tokenizer=WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Dutch", task="transcribe"))
df["transcription_whisper"] = df["path"].progress_apply(lambda path: pipe_whisper(DATA_COMMON_VOICE_PATH/path))

Hey! This means either once of three:

  • the model translates
  • the model is bad at transcribing dutch.
  • the task is not fed properly

You should try forwarding the task to whisper using pipe = pipeline(.....,generate_kwargs={"task": "transcribe", "language": "Dutch"}

Sign up or log in to comment