Training for translation

#43

by tirtohadi - opened Nov 23, 2023

Nov 23, 2023

May I ask, how do I fine tune Whisper with translation from english to another language? Mainly want to know what the dataset should look like. There is a tutorial for whisper fine tuning with the example data set is a pair of audio with its original language text. To train for translation, do I pair the audio with the translated text? Many thanks for the help.

Also do you recommend using whisper for audio translation or using other models? Appreciate it.

sanchit-gandhi

Nov 23, 2023

•

edited Nov 23, 2023

Hey @tirtohadi - you can use the same fine-tuning tutorial as provided. Simply train on pairs of (English audio, translated text). In the tokenizer and processor, you should the task to translate, and the language to your target language. Whisper should work quite well for this task, especially this new large-v3 version.

CKSINGH

Nov 23, 2023

•

edited Nov 23, 2023

While training , how do i save the model as pickel file “pytorch_model.bin”.
When i push the model to the repo using “trainer.push_to_hub(**kwargs)” [ex:CKSINGH/whisper-small-hi-firefox] , i dont see the pickel file pushed along side .
How i can save the pickle file . as I would need it to integrate it to a LM .?

tirtohadi

Nov 24, 2023

thank you @sanchit-gandhi let me try it out

tirtohadi

Nov 25, 2023

hi @sanchit-gandhi I tried to follow the tutorial without modification in google colab but gotten this issue: ImportError: Using the Trainer with PyTorch requires accelerate>=0.20.1: Please run pip install transformers[torch] or pip install accelerate -U

This is when trying to run the following code:

from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
output_dir="./whisper-small-hi", # change to a repo name of your choice
per_device_train_batch_size=16,
gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=500,
max_steps=4000,
gradient_checkpointing=True,
fp16=True,
evaluation_strategy="steps",
per_device_eval_batch_size=8,
predict_with_generate=True,
generation_max_length=225,
save_steps=1000,
eval_steps=1000,
logging_steps=25,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="wer",
greater_is_better=False,
push_to_hub=True,
)

Perhaps can give a little guidance? Thanks once again for the effort in coming up with the tutorial

tirtohadi

Nov 25, 2023

Btw i have run !pip install transformers[torch] and !pip install accelerate -U in the Colab but no difference

aaditya

Nov 25, 2023

•

edited Nov 25, 2023

@tirtohadi after !pip install accelerate -U restart the colab session, it will work.

SharatChandra

Dec 1, 2023

Same issue faced by me, but need a fix for the root cause

tirtohadi

Dec 1, 2023

Thanks for the update. @aaditya i am going to run it in my own local machine for now, so didnt try with Colab

Releow

Feb 3, 2024

•

edited Feb 3, 2024

Hey @tirtohadi - you can use the same fine-tuning tutorial as provided. Simply train on pairs of (English audio, translated text). In the tokenizer and processor, you should the task to translate, and the language to your target language. Whisper should work quite well for this task, especially this new large-v3 version.

@sanchit-gandhi
How will be the compute of the metric WER considering the translation task? I was wondering if is more smart to fine-tune a dialect of a language with a translate (to english) task instead of transcription. My data is gather by some .srt (film) and sometimes the text can be the same as meaning of the audio but not litterally the same, so i thought that maybe the translation task is focused more on the meaning, am I wrong?
But at the same time I think that calculating the WER can misleading in this situation.

Thanks for the nice tutorial on HF btw
PS. The audio are in a italian dialect, and the trascription field is in Italian

tirtohadi changed discussion status to closed Feb 5, 2024

tirtohadi changed discussion status to open Feb 5, 2024

StephennFernandes

Feb 14, 2024

@tirtohadi hey how did the finetuning on translation go ? how big of a dataset did you end up using ?
to be precise, could you share the exact fine-tuning translation script used/mentioned.

Releow

Feb 29, 2024

@tirtohadi hey how did the finetuning on translation go ? how big of a dataset did you end up using ?
to be precise, could you share the exact fine-tuning translation script used/mentioned.

I think that 10hours of audio in the new language will be enough

sugintama

Aug 24, 2024

Hey @tirtohadi - you can use the same fine-tuning tutorial as provided. Simply train on pairs of (English audio, translated text). In the tokenizer and processor, you should the task to translate, and the language to your target language. Whisper should work quite well for this task, especially this new large-v3 version.
@sanchit-gandhi
For fine-tuning Whisper translation, is the loss function the same as it is for transcription? And would it be more beneficial to use BLEU rather than WER?

vizsatiz

Dec 6, 2024

When I train for translation to English my accuracy for transcription of the same language is reducing (pretty much forgetting) . Is there a way I can train for translation and transcription in one go so that it creates a balance ? If so how to structure my training code ?

Any help is apprieciated ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment