Training for translation

#43
by tirtohadi - opened

May I ask, how do I fine tune Whisper with translation from english to another language? Mainly want to know what the dataset should look like. There is a tutorial for whisper fine tuning with the example data set is a pair of audio with its original language text. To train for translation, do I pair the audio with the translated text? Many thanks for the help.

Also do you recommend using whisper for audio translation or using other models? Appreciate it.

Hey @tirtohadi - you can use the same fine-tuning tutorial as provided. Simply train on pairs of (English audio, translated text). In the tokenizer and processor, you should the task to translate, and the language to your target language. Whisper should work quite well for this task, especially this new large-v3 version.

While training , how do i save the model as pickel file “pytorch_model.bin”.
When i push the model to the repo using “trainer.push_to_hub(**kwargs)” [ex:CKSINGH/whisper-small-hi-firefox] , i dont see the pickel file pushed along side .
How i can save the pickle file . as I would need it to integrate it to a LM .?

thank you @sanchit-gandhi let me try it out

hi @sanchit-gandhi I tried to follow the tutorial without modification in google colab but gotten this issue: ImportError: Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U

This is when trying to run the following code:

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-small-hi", # change to a repo name of your choice
per_device_train_batch_size=16,
gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=4000,
gradient_checkpointing=True,
fp16=True,
evaluation_strategy="steps",
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=1000,
eval_steps=1000,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=True,
)

Perhaps can give a little guidance? Thanks once again for the effort in coming up with the tutorial

Btw i have run !pip install transformers[torch] and !pip install accelerate -U in the Colab but no difference

@tirtohadi after !pip install accelerate -U restart the colab session, it will work.

Same issue faced by me, but need a fix for the root cause

Thanks for the update. @aaditya i am going to run it in my own local machine for now, so didnt try with Colab

Hey @tirtohadi - you can use the same fine-tuning tutorial as provided. Simply train on pairs of (English audio, translated text). In the tokenizer and processor, you should the task to translate, and the language to your target language. Whisper should work quite well for this task, especially this new large-v3 version.

@sanchit-gandhi
How will be the compute of the metric WER considering the translation task? I was wondering if is more smart to fine-tune a dialect of a language with a translate (to english) task instead of transcription. My data is gather by some .srt (film) and sometimes the text can be the same as meaning of the audio but not litterally the same, so i thought that maybe the translation task is focused more on the meaning, am I wrong?
But at the same time I think that calculating the WER can misleading in this situation.

Thanks for the nice tutorial on HF btw
PS. The audio are in a italian dialect, and the trascription field is in Italian

tirtohadi changed discussion status to closed
tirtohadi changed discussion status to open

@tirtohadi hey how did the finetuning on translation go ? how big of a dataset did you end up using ?
to be precise, could you share the exact fine-tuning translation script used/mentioned.

@tirtohadi hey how did the finetuning on translation go ? how big of a dataset did you end up using ?
to be precise, could you share the exact fine-tuning translation script used/mentioned.

I think that 10hours of audio in the new language will be enough

Sign up or log in to comment