How to adapt for low resource language?

#73

by Imran1 - opened Jan 29, 2024

Jan 29, 2024

I try to fine tune whisper model, the training lose is 0.007 and val are 0.46.
Is there any way to adapt whisper model for low resource language.?

StephennFernandes

Feb 14, 2024

@Imran1 the new w2vBERT2.0 https://huggingface.co/blog/fine-tune-w2v2-bert has shown really promising results on low resource languages, where training dataset is quite scarce.

its also show to be 10x to 30x faster and 2.5x more resource-efficient than Whisper v3, while having the same fine-tuning World-Error-Rate (WER).
i understand your struggle with training whisper under low resource setting. but given the architecture and training objective, whisper just needs tons and tons of data to perform really well. you'd be better off training with w2v-BERT2.0 or meta's mms model.

Imran1

Feb 15, 2024

@StephennFernandes thanks for reply..
Unfortunately I try this one also. And getting out of memory error..

The batch size is for training and evaluation is 2 gradient accumulation is 4.

StephennFernandes

Feb 15, 2024

could you tell me the GPU that you are using to train the model ?

it might be some internal package error, probably torch and transformers issue. i had the same issue in the past where i would encounter the same issue, try installing and running everything on a clean conda env.
further try mitigating the OOM error by monitoring your GPU VRAM by cmd line tools like nvitop/nvtop/nvidia-smi to check how much each process might consume memory. kill other processes running before training to leverage the whole GPU memory.

ensure that the fp16=True, flag is True so that you are training in fp16 and your GPU supports

Imran1

Feb 15, 2024

@StephennFernandes I am using Google T4 gpu. Let me share my training argument.

training_args = TrainingArguments( output_dir=repo_name, group_by_length=True, per_device_train_batch_size=1, gradient_accumulation_steps=4, evaluation_strategy="steps", num_train_epochs=10, #gradient_checkpointing=True, fp16=True, #save_steps=600, #eval_steps=300, #logging_steps=300, learning_rate=1e-5, #warmup_steps=500, #save_total_limit=2, push_to_hub=True, )

Imran1

Feb 15, 2024

•

edited Feb 15, 2024

@StephennFernandes
i try but can not solve the issues.
!pip install -qU datasets !pip install -q git+https://github.com/huggingface/transformers.git !pip install -q torchaudio !pip install -q jiwer !pip install accelerate -U

i got the same error.
`---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
in <cell line: 1>()
----> 1 trainer.train()

25 frames
/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2231 # remove once script supports set_grad_enabled
2232 no_grad_embedding_renorm(weight, input, max_norm, norm_type)
-> 2233 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
2234
2235

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.97 GiB. GPU 0 has a total capacty of 14.75 GiB of which 2.45 GiB is free. Process 3293 has 12.29 GiB memory in use. Of the allocated memory 9.08 GiB is allocated by PyTorch, and 3.08 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF`

StephennFernandes

Feb 15, 2024

hey looks like you need a bigger VRAM to train the model.
https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event#deepspeed
https://github.com/Vaibhavs10/fast-whisper-finetuning#faster-whisper-finetuning-with-lora-powered-by--peft
try utilizing the following tricks to finetune your model with more efficient GPU utilization.

last option could be you could also setup huggingface accelerate on the TPU notebook and train the model in script model using accelerate the kaggle notebooks i believe kaggle gives TPUv3-8 for longer training hours. try keeping batch_size as big as you could and keep gradient_accumulation set to 2-4 based on your batch_size choice, for better training.

Imran1

Feb 15, 2024

@StephennFernandes thank you.
Someone suggest w2v-bert model for low resource language. I want to confirm that, will the Lora work for this model w2v-bert?

StephennFernandes

Feb 15, 2024

yes as i have recommended w2v-BERT2.0 in couple of threads above. is ideal as compared to whisper v3 its, 10x to 30x faster and 2.5x more resource-efficient .

the following is the notebook that does LoRA finetuning on whisper, i am pretty sure the same code can work for w2vBERT-2.0 as well. as LoRA is just low rank matrix factorization on the transformer layers, and given that the Encoder architectures are almost similar it should work fine. even meta AI's MMS model supports adapter modules but default. so everything is almost interoperable.
you could also try QLoRA but on every forward pass the quantization and de-quantization of the LoRA weights makes the training quite slow. but given training on extremely limited VRAM constrained environments this approach works great. with almost no performance degradation.
you might just need to experiment with some code to find good results as well as making training efficient in the limited GPU VRAM you have.

Imran1

Feb 15, 2024

@StephennFernandes thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment