Hinglish Words

#9
by aayush-kala1 - opened

Hi, we have a use-case in which mostly the input text is in Hinglish language and as the model is trained on English data it classified legitimate Hinglish words and sentences as gibberish. Is there any way I can retrain the model with Hinglish words in addition to English words .

yes, try fine-tuning the model on your dataset which contains examples in each of the four categories - clean, mild gibberish, word salad, noise. It might not work as good due to the tokenizer being optimized for an English set, but with enough training keeping the embeddings also as trainable, should do the trick.

Do let me know what you come up with, excited for the project.

Is there any way , I could get your training data ,so that I could combine both of our data and then finetune the model.

I might not be - am bound by a NDA. But anyways it is already trained on the data, as long as you are not training it for too long for it to forget what it was trained on earlier - it should be fine!

Sign up or log in to comment