adding new tokens to the new fine-tune model

#64
by Santi69 - opened

I am making a new model of a dialect of Arabic following the great tutorial https://huggingface.co/blog/fine-tune-whisper of @sanchit-gandhi

The problem with Arabic is that some words are different from one dialect to another and whisper is focused on MSA Arabic, but not on its multiple dialects.

I would like to know if it is possible to add new tokens (or new words) for example in the vocab.json file or create a specific dictionary of several words for my dialect and then use "WhisperTokenizer.from_pretrained"... etc

would this be possible?
Does anyone have a little tutorial or instructions on how to add new words to a whisper fine-tune model?

Thank you very much to everyone for your help

Sign up or log in to comment