Not getting english transcription from the model

#3
by tusharagarwal3 - opened

Hi,

I am trying to get hindi audio to english text transcription using whisper-hindi-large-v2 model. even when specifying the language as "en" (also tried "english"), the model still doesn't return the output in english. It is returning in some indic language, most likely hindi.

Please let me know if I am doing something wrong.

Screenshot 2023-04-28 at 2.13.57 PM.png

Found this on their github: https://github.com/huggingface/transformers/issues/21937.

Any workaround for this?

I am new to exploring this, so might be missing something.

Also, if you can help me accessing the model embeddings as I need them for downstream tasks.

Also, if you can help me accessing the model embeddings as I need them for downstream tasks.

Hi @tusharagarwal3 ,

Detailed information regarding extraction of various layer embeddings has been provided here (https://github.com/vasistalodagala/whisper-finetune#extract-embeddings-from-whisper-models).
Hope that helps.

Hi,

I am trying to get hindi audio to english text transcription using whisper-hindi-large-v2 model. even when specifying the language as "en" (also tried "english"), the model still doesn't return the output in english. It is returning in some indic language, most likely hindi.

Please let me know if I am doing something wrong.

Screenshot 2023-04-28 at 2.13.57 PM.png

This being an Automatic Speech Recognition (ASR) model, passing Hindi audio as input would give the transcription in Hindi.
Since this model has been fine-tuned for 57000 steps on about 3500 hours of data which is quite substantial, the model must have been biased completely towards generating Hindi tokens (which in a way is desired in this ASR fine-tuning) and is therefore not doing well on the translation task.

To obtain the English translation of a Hindi audio, you may try the 3 methods listed below:

  1. Set decoder_prompt_ids to (language="hi", task="translate") and use the openai's original whisper model. The accuracy is quite low though, from Hindi->English.
  2. Using a speech translation model. There are not many good options though for Indian languages.
  3. Using a cascade system. Example, get the Hindi output from a good ASR model (you could continue using this whisper-hindi-large-v2 for this). Pass the output of this to a machine translation model. The nllb from Facebook (https://huggingface.co/facebook/nllb-200-distilled-600M) is good enough from Hindi to English (has a 20-30% error). This according to me is a better option to get the English translation for a Hindi audio.

Sign up or log in to comment