Model fine-tuning official repository

#5
by gorinars - opened

We’ve just released SpeechBrain model fine-tuning code with example notebooks to help you start working on your model training and fine-tuning! Let's add some diversity to the leaderboard!

Code: https://github.com/Ubenwa/cryceleb2023

Feedback/questions appreciated.
Have fun

Thank you very much, @gorinars . Could you also share the final metrics of the baseline model? When I run train.ipynb, the validation loss doesn't decrease and the validation accuracy around 0%.

Hi @dhoa

Unfortunately, we did not store the exact logs for the exact model that is uploaded to https://huggingface.co/Ubenwa/ecapa-voxceleb-ft-cryceleb
The code and configs were a bit messy and we did not want to alter the model after releasing the code.

That said, to double check I launched the open-sourced code along with the default config and stored a checkpoint from 586th epoch.
It is uploaded along with train_log to https://huggingface.co/Ubenwa/ecapa-voxceleb-ft2-cryceleb

In my fine-tuning experiment, the loss was not increasing on val as well. I evaluated EER on val for a couple of checkpoints with non-zero accuracy and selected the one that was performed best.

Few things to note:

  1. The number of samples per infant is much smaller than in general speaker recognition datasets. I do not think it is possible to train a decent classifier in terms of val accuracy/loss. However, it is quite easy to train embeddings that are better than non-fine-tuned model
  2. Training split with the birth period in train and discharge on val is super hard. We really encourage you to play with that and potentially ensemble models in the end. Random split will likely result in larger classification accuracy (not sure about EER). Also, there is a bunch of babies with just birth or just discharge recordings that are completely ignored
  3. In our experiments data augmentation and shorter chunks were sometimes resulting in better val loss (but again, not necessarily better embeddings)
  4. AM-softmax in ECAPA forces classes to be separated by large margin. This might be hard to train
  5. Speaker recognition may not be the closes domain to transfer from with such a small amount of data (VoxCeleb is just a popular model that we picked for naive baseline). You can check, for example, a model pre-trained on VGGSound (https://huggingface.co/Ubenwa/sb-ecapa-vggsound) or Chinese speaker id (https://huggingface.co/LanceaKing/spkrec-ecapa-cnceleb)
  6. There is no evidence that newborn verification is an easy task. Arguably, we do not claim it is doable in the current setting but that is basically the challenge :)

Hope it helps!

Thanks a lot for a very detailed answer, @gorinars !

No problem!
Btw I just tried to submit https://huggingface.co/Ubenwa/ecapa-voxceleb-ft2-cryceleb from default notebook (without test-time-augment) and looks like it also gets 0.28125 on public leaderboard.

@gorinars any reason why e2e training wasn't explored? I understand it could be complex and compute-heavy, but wanted to know if there's any other reason for it.

@Phaedrus33 well, we tried to put something quick, familiar to folks and simple for baseline, otherwise it would not be called baseline :)

@Phaedrus33 also just to avoid confusion. E2E would be valid approach if it follows the following rules listed on challenge page

  • The evaluation follows a common open-set evaluation protocol for speaker verification.
  • Make sure to use test data only for verification task. Specifically:
    -- Test verification pairs should be processed independently of the others (no use of other testing data is allowed for scoring or normalization purposes)
    -- Test data cannot be used for training (including unsupervised/self-supervised techniques)

So, for example, training E2E model with siamese-like loss on train/dev is OK but adding test data to training a supervised classifier is not OK as it breaks the "open-set" condition.
Let me know if you have other questions

Sign up or log in to comment