Fine-tuning for two languages

#78

by artyomboyko - opened Oct 8, 2023

Oct 8, 2023

Good day, what is the correct way to perform fine-tuning for the task of automatic speech recognition for two languages? Are there any peculiarities?

If I do the fine-tuning for Russian and then for English. Will the second fine-tuning for English affect the quality of recognition of Russian?

How do you do it right?

sanchit-gandhi

Oct 10, 2023

Hey @artyomboyko - there is a similar discussion in this thread: https://huggingface.co/spaces/openai/whisper/discussions/6#643d8bc551e2958ef6cd69ef

Does the thread provided here answer your question?

artyomboyko

Oct 10, 2023

•

edited Oct 10, 2023

Good evening @sanchit-gandhi . You might remember me, I'm from the team translating the Transformers audio course into Russian.

This thread partially answers my questions. Thank you very much. I have read this blog post before. I still have a few questions though:

Can you tell me if something in the code used to fine tune the model needs to be changed as well?
If the Russian part of the dataset is smaller than the English part, do you think it won't lead to false positives (recognizing Russian words as English or vice versa)? Is it worth making the dataset parts for different languages the same size?

artyomboyko

Oct 11, 2023

@sanchit-gandhi Good evening Sachit. I will try to implement what is stated in the link you provided. If I have any questions, can I ask you for help?

sanchit-gandhi

Oct 13, 2023

•

edited Oct 13, 2023

Hey @artyomboyko ! Thanks for your efforts translating the Transformers audio course to Russian! Answering your questions in-line below:

No code in the Trainer needs to be changed - only the prefix token ids (i.e. language tokens). These token ids are the target labels during training. Therefore, the model learns to recognise each language based on the language ids in the labels directly. E.g. if we train it on:

<|startoftranscript|><|en|> Some English text.<|endoftranscript|>

Then the model learns from the <|en|> token that the source audio is in English and it should transcribe to English. Whereas in the following example, the <|ru|> token indicates that the source audio is in Russian:

<|startoftranscript|><|ru|> Some Russian text.<|endoftranscript|>

This is all the information the model needs to learn the language id task, and also which language to transcribe in.
2. I think not - the Whisper training set is hugely imbalanced (see page 27 of the paper) but the model still learns to identify the languages accurately

artyomboyko

Oct 14, 2023

@sanchit-gandhi Good morning Sanchit. Already started the implementation in code. Thanks a lot for the great course and help. No more questions so far. I will leave a link to the repositories here as soon as I finish implementing and debugging the code. If you have time, we'll discuss it.

artyomboyko

Oct 30, 2023

@sanchit-gandhi Good evening. While preparing the English part of the dataset I get the following error (I still have ~500 Gbytes of free space on the disk where the WSL2 host is stored):

Can you tell me what the problem is? There is still free disk space.

artyomboyko

Oct 31, 2023

•

edited Oct 31, 2023

@sanchit-gandhi Good evening. I solved the problem with lack of disk space and prepared a dataset. The result is in the picture below. Can you please tell me, in this way I reduce the time to prepare the dataset before fine-tuning the model, do I need to add an additional column containing the language identifier?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment