Only English transcriptions on Dutch transcribe task?
When performing the transcribe task on the Dutch Common Voice Data (locally downloaded), I seem to only obtain English transcriptions for the tiny, small, and base models which are the ones I have tested so far. Therefore, I assume there is a mistake in the code or the way I use the pipeline, could anyone help me? I posted the code below.pipe_whisper = pipeline(model="openai/whisper-tiny", device=device, tokenizer=WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Dutch", task="transcribe"))
df["transcription_whisper"] = df["path"].progress_apply(lambda path: pipe_whisper(DATA_COMMON_VOICE_PATH/path))
Hey! This means either once of three:
- the model translates
- the model is bad at transcribing dutch.
- the task is not fed properly
You should try forwarding the task to whisper using pipe = pipeline(.....,generate_kwargs={"task": "transcribe", "language": "Dutch"}