EER very high in the training set with the public baseline

#4
by dhoa - opened

I have made some modifications to the baseline notebook to run the EER on the training set and found that the score is quite poor (0.5). According to the CryCeleb paper, the development and test sets contain relatively simple data. However, I find it weird that despite being trained on this particular training set, the model seems unable to memorize anything

You can find the code below:
https://github.com/dienhoa/cryceleb_nb/blob/master/cryceleb.ipynb

Did I do something wrong here? Thanks

download.png

Interesting!
At first glance, I do not see an issue in your code but I do not recall testing EER on train set. 0.5 is surprising indeed...Perhaps @learningcry may have some ideas.
Also, the fine-tuning code will soon be available so it may be easier to reproduce and analyze deeper.
We should point out that current fine-tuning was not extensively tuned so we likely expect significant improvements from fine-tuning / different models, etc.

Thanks for the response @gorinars . I have some questions related to how the dataset was split

As mentioned in the paper.

It’s important to emphasize that the dev and test infants were
not chosen randomly. Instead, they were randomly sampled
from the top 200 infants with the highest cosine similarities
between their birth and discharge embeddings, as calculated
using the baseline model described in Section V. We opted for
these relatively easier pairs due to the difficulty in recognizing
an infant in an unseen recording within this dataset

  1. I guess you did run some EER calculations for 200 infants (dev+test) with some other data ( that might not be included in the dataset ) so that you can select the 200 top scores?
  2. The baseline mentioned here is the fine-tuned ECAPA-TDNN on the training set ?

Thanks

Thanks for your questions @dhoa ! We should probably add a bit more detail to the paper to avoid confusion. For now, let me answer here:

  1. 200 infants were selected for dev/test using non-finetuned ECAPA (row 1 in Table V). You can easily adapt the notebook to use this model by using speechbrain/spkrec-ecapa-voxceleb instead of Ubenwa/ecapa-voxceleb-ft-cryceleb.

  2. If you refer to this paragraph of the paper, then the answer is no - we will clarify this in next revision of the paper. Data selection was done with first baseline

Another thing to point out is that fine-tuning was done on the birth-only subset of the train while keeping discharge recordings for validation. Overall validation accuracy was quite low due to the fact that many babies were really hard to classify.
This may be attributed to the fact that some recordings have too little data associated with an identity per period.
This could also perhaps be explained by the complexity of the challenge...We are mostly given 1 recording for training and 1 for validation from the same class (out of 348). At the same time, cry contains a lot of variability (for example, reason of cry like hunger / pain / etc). This
The above reasons might actually explain 50% EER that you get on train set

Yes, it's a tough one to classify. None of my tricks are doing the job right now :)) I'm playing with the Siamese Network, but the EER just keeps bouncing around.

Sign up or log in to comment