Facing problem when fine-tuning
Hi, I have been following your work on dysarthria and have followed this notebook on your github (https://github.com/jmaczan/asr-dysarthria/blob/wav2vec2-large-960h-lv60-self/wav2vec2-large-xls-r-300m-dysarthria-big-dataset.ipynb) to perform my own fine-tuning. But even when using the same dataset (i.e. UASpeech), I seem to be getting empty predictions during inference (i.e. the model seems to wrongly predict the token id 28 consistently, which results in [PAD]/empty string for the resulting prediction). Wondering if you faced the same problem too? Could you also share details on which base model you used? Appreciate any help or details you could share. Thank you.
Hi, thanks for getting in touch
I haven't face this particular problem. But even though loss is low on my models, they still generalize poorly: they work quite ok-ish only on single, short words and they produce utter gibberish on a longer sequences of speech
I used this base model https://huggingface.co/facebook/wav2vec2-xls-r-300m
Later this year, I am going to make the next attack on this problem. I am at the beginning of writing code for automated hyperparameter search, I want to include both Uaspeech and Torgo datasets during training, use data augmentation and possibly use another base model (Nvidia Parakeet or similar). I saw a nice paper recently https://arxiv.org/pdf/2401.00662 and I want to base on their work as well
If you'd like to cooperate on research or if you need anything else regarding this project, please let me know
Thank you so much for getting back to me. I’m actively working on this project and currently I'm actually working on the approach you mentioned — training with both UASpeech and Torgo datasets using the base model (facebook/wav2vec2-xls-r-300m). However, I’ve encountered an issue where the model produces blank predictions on the test set, and I’m working on identifying the root cause of this problem.
After resolving this, I also plan to experiment with data augmentation and compare the performance of Wav2Vec2 with DeepSpeech to see how the models perform on longer sequences of dysarthric speech.
I’ll definitely share my results or any new findings with you if I manage to solve the issue or make progress. If you could share more details on your original training using the base model, it would be of great help. Thanks again!
Did you try using my training script? It's very messy and there are multiple versions of it, but it should work just fine if you manage to get to low loss https://github.com/jmaczan/asr-dysarthria/blob/main/training/wav2vec2-large-xls-r-300m-dysarthria-big-dataset.ipynb
Yup, I have tried using your training script but unfortunately I'm still facing the same error, still trying to figure it out.
I am working on automated hyperparameters search now (https://github.com/jmaczan/asr-dysarthria/blob/main/training/hyperparameter_search.py) and once its done, I'll retrain the model, so I will let you know if I stumble on the same issue as you did