pretrained model for audio emotion classification

#1
by PranavB - opened

Is there any pre-trained model for audio emotion classification? if not then is anyone interested to collaborate with me to build one?

Massachusetts Institute of Technology org

Cc'ing @sanchit-gandhi here

Hey @PranavB ! I've tried fine-tuning AST for speech related tasks and unfortunately the performance is not very good πŸ˜… My conclusion is that there's too big a domain mis-match between the AST pre-training data (generic audio sounds) and speech. You can see the checkpoint I trained for language identification on the FLEURS dataset here: https://huggingface.co/sanchit-gandhi/ast-fleurs-langid-max-length-2048/tensorboard
Eval accuracy is only 17%...

IMO there's much more promise in fine-tuning Whisper, e.g. on FLEURS I get 88% eval accuracy after just 3 epochs: https://huggingface.co/sanchit-gandhi/whisper-medium-fleurs-lang-id

See related PRs here: https://github.com/huggingface/transformers/pull/21754
And here: https://github.com/huggingface/transformers/pull/21756

I think emotion classification would be cool! You can probably copy over the scripts that I used for Whisper language identification and change the dataset to an emotion classification one

@PranavB were you able to create emotion classifier on input audio?

Hey @PranavB
You can try to get last_hidden_state from fine-tuned AST, and these embeddings use to learn nn.Linear for your classification task. This should be fast to check, often this give good results :)

I think AST is likely to struggle here since it's pre-trained on generic audio sounds (rather than speech) - I would strongly advocate for using Whisper!

@sanchit-gandhi thank you for suggestions. I'll follow your tips

Hi @sanchit-gandhi ,
Could you kindly provide me with the link to your code for Whisper language identification? I believe it would greatly assist me in my current project which involves emotion classification, as I am exploring similar concepts and techniques.

Additionally, I have posted a related question in the Hugging Face forum, which I believe aligns with your expertise. Here is the link to my question:
https://discuss.huggingface.co/t/fine-tuning-whisper-for-audio-classification/44735

I would greatly appreciate it if you could take a moment to review it and provide your valuable suggestions.
Thank you so much

Sign up or log in to comment