Model Vocab Size and Tokenizer Vocab Size NOT SAME. It is problem For Training ?

#4
by GokhanAI - opened

WHICH ONE IS TRUE ? HOW TO SOLVE THIS PROBLEM .
model.config.vocab = 157824
tokenizer.vocab_size = 157797

We solve this problem. This example:

"""
if model.config.vocab != tokenizer.vocab_size:
model.resize_token_embeddings(len(tokenizer))
"""

But this solution, during training this cause problem ? And Reduce Embedding vocab size O

Dynamo AI org

Hey GokhanAI!

You don’t need to resize the embedding matrix. It’s fine if the model embedding matrix is larger in length than the config vocab size. We pad the model embedding matrix to a multiple of 64 to take advantage of the ampere GPU’s. Hence the reason, model.config.vocab > tokenizer.vocab_size

Dynamo AI org

Please do let me know if that works, otherwise I can provide a short snippet on how to further fine-tune the model for a downstream task.

Please do let me know if that works, otherwise I can provide a short snippet on how to further fine-tune the model for a downstream task.

Thank you for your return.

I want to implement /huggingface/alignment-handbook sft and dpo structures. I want to make the dataset they suggested by making it suitable for Turkish. I aim to establish a chat structure. Do you think I can achieve success with this? I'm afraid I don't know how successful it will be in Turkish, but I would like to get your thoughts and advice.

Dynamo AI org

Hey Gokhan! Yes, I believe it should be successful with Turkish. However, this solely depends on your dataset and the quality of it. One chat structure you could consider taking after is HuggingFaceH4/zephyr-7b-alpha.

That would look something like:

messages = [
{
"role": "system",
"content": "Sen arkadaş canlısı bir sohbet robotusun.",
},
{"role": "user", "content": "Adın ne senin??"},
]

Please do let me know if that works, otherwise I can provide a short snippet on how to further fine-tune the model for a downstream task.

Hello, /huggingface/alignment-handbook for training The results were not good at all. I also tried different SFT methods. But the result either produces no answers or produces complete meaningless sentences. I trained the data in Turkish format. How can you help with this? Do you have any SFT-DPO codes you recommend?

Sign up or log in to comment