Phoneme recognition

#86
by dg96 - opened

Is it possible to use whisper to output phoneme transcription instead of text transcription?

Hi @sanchit-gandhi
Thank you for pointing me towards discussions page.

If I understand it correctly, whisper currently cannot output phoneme transcription. However, there was one response that said one could train a whisper model with audio + phoneme transcriptions instead of the recommended audio + text transcriptions. Is this possible? Because for fine-tuning whisper with audio + phoneme transcriptions, I would be using pretrained feature extractor and tokenizer as per your blog https://huggingface.co/blog/fine-tune-whisper.
Please let me know your thoughts on this

Thanks!

Hey @dg96 - that's a cool proposition! I think we could fine-tune Whisper for phoneme transcriptions. The feature extractor can stay the same (we can pre-process the audio in the same way as before). We'd need to change the tokenizer to handle the new vocabulary. Namely, what we need to do is build a new tokenizer over the possible phonemes. For this, you can follow this guide: https://huggingface.co/learn/nlp-course/chapter6/8?fw=pt

You should then have a tokenizer that you can load with HF Transformers:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(...)

Once we have built our new tokenizer, we need to make sure that the Whisper embedding layer has the same dimensionality as the number of tokens:

# new random embeddings for our phoneme tokens
model.resize_token_embeddings(len(tokenizer))

Once we've done that, the Whisper model will now be set to predict phonemes instead of sub-word tokens. You can then fine-tune the model on an (audio, phoneme) dataset in exactly the same way as the fine-tuning blog describes. You might want to change the compute_metrics function to a more applicable metric for phoneme prediction than WER.

I am not an expert on Whisper, but a related use case is needs timing data as well. For example, to control a 3D animated character's facial expressions, you need phonemes and timing data for the phoneme. Otherwise the lipsync can get out of alignment.

Sign up or log in to comment