Is there any way to increase the vocabulary of the tokenizer and use it fine tune the model on the new language

#120
by Tejaswi006 - opened

Hi, I'm trying to fine tune the mistral on my mother tongue tamil but when it fine tune it the output doesn't make any sense so i got to know the tokenizer is not able to understand the tamil. So, is there any way to increase the vocab of the tokenizer ?

Hey @Tejaswi006 ,

I just tried base Mistral-Instruct model on some text from Wikipedia, and looking at the results it looks like it doesn't understands the language much.

However, since it's able to generate the text in Tamil script, the tokenizer should ideally work as-is. I think it requires more training on Tamil corpora instead of tokenizer modifications.

Screenshot 2024-01-20 at 09.44.17.png
Screenshot 2024-01-20 at 09.44.31.png

If you still want to add some new tokens in the tokenizer, you should be able to do as following.

new_tokens = ["new_tok1", "my_new-tok2"]
num_added_toks = tokenizer.add_tokens(new_tokens)
print("We have added", num_added_toks, "tokens")

# Notice: resize_token_embeddings expect to receive the full size of the new vocabulary, i.e., the length of the tokenizer.
model.resize_token_embeddings(len(tokenizer))

References:

Thanks, I will look into it. The method you used was instruct fine tuning ?

i was trying the model to train on Amharic language but it generate text which does not make any sense

Sign up or log in to comment