Fine-tuning dataset template

#98
by Lalith16 - opened

Can anyone help me w.r.t dataset for instruct model for mistral 7B. I want to see the dataset template for fine-tuning ?

Hello Lalith,

I recently finetuned this model. I highly recommend you follow the format by which the creators of this model trained the model. As they state in their readme page, here is how you should organize the template for your custom dataset:

text = 
" <s> [INST] What is your favorite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen! </s>"

First, do not forget to add beginning (s) and end (/s) tokens in your prompt format. Then, between [INST] and [/INST] tokens you should feed the text whatever your fine-tuned model is supposed to accomplish. So inside these tokens, you should give both your instruction and input. After [/INST], you should feed the response (your label or ground-truth or whatever you want your model to think and give as an output).

I hope this helps!

Thank you for this , I have done this for my supervised learning dataset and tried fine-tuning but there are other problems wrt training loss and accuracy of results for smaller datasets under 5000 questions and answers.

@halilergul1 I have done finetuning following exactly what you have mentioned. What I noticed is both base instruct model and finetuned instruct model repeat repeat the instructions. Here is an example:
" [INST] write 5 points about India [/INST]". It would repeat this and then begin the generation.

@Lalith16
You can also try the below code and add more examples to see how the inbuilt chat template formats the datapoints.

from transformers import AutoModelForCausalLM, AutoTokenizer, MistralForCausalLM
tokenizer = AutoTokenizer.from_pretrained(model_path)
messages = [
{"role": "user", "content": "What is your favourite condiment?"},
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
{"role": "user", "content": "Do you have mayonnaise recipes?"},
{"role":"assistant","content":"Not sure"},
{"role": "user", "content": "Do you have mayonnaise recipes?"},
{"role":"assistant","content":"Not sure!"}
]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

Output:
[INST] What is your favourite condiment? [/INST]Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen! [INST] Do you have mayonnaise recipes? [/INST]Not sure [INST] Do you have mayonnaise recipes? [/INST]Not sure!

Hello @shriML ,

1- First it is important to make sure you get infrences by following a correct method. I recommend using torch inference mode from pytorch. Like the following
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.0001)

2- This repeating instructions was a common issue back then whan I was fine tuning LLAmA 1 with Alpaca Lora. Also I heard some other guys experiencing the same thing. The main rationale is the amount of data you trained on. For weights to be updated correctly you really need lots of data. There is no specific threshold for sure but I would recommend you should have at least 5000 data instance.

3- Another reasons I suspect is the target modules you update while using Lora or Qlora. Make sure to update recommended target modules by the community. I do not know your method of finetuning for sure (whether you use Lora or Adapter etc.), but this could be potential reason as well.

Hope this helps!

Hey @halilergul1 thanks for the response. I am currently doing complete model finetuning using Stanford Alpaca code via FSDP torchrun .

  1. I tried with inference_mode() but no improvement. Here is what I tried:
    with torch.inference_mode():
    generated_ids = model.generate(**model_inputs,
    max_new_tokens=1000,
    do_sample=True,
    top_k=100,
    top_p=0.95,
    no_repeat_ngram_size = 2,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    temperature=0.01 )

  2. I do have around 5000 training data points. Not sure if the complete model training is causing the issue.

  3. I had tried PEFT LoRA as well for Mistral Instruct and was updating the below target modules. The training loss was quite close to zero but when prompted on training input it was nowhere close to the expected response.
    target_modules=[
    "q_proj",
    "k_proj",
    "v_proj",
    "o_proj",
    "gate_proj",
    "up_proj",
    "down_proj",
    "lm_head",
    ]

Also, any idea why base instruct model would repeat the instruction? Could you point me to any resources for fine tuning LLM? I found Stanford Alpaca code to be a great resource.

Hi again @shriML , so sorry for late reply as things are quite busy.

Well, I am not an expert for sure but here are my ideas:
1- This means your problem has almost nothing to do with how you get inferences
2- Well it is always to your advantage to increase training points for such huge models. But if you do have at least 5000, then I do not think that lack of data may cause the problem. Also, which type of quantization do you use to load model parameters, and through which method do you update them? Sometimes this also may be the source of problems like you face.
3- It seems PEFT LoRA worked well in my case so it is interesting that it did not work well for you. Also, do not expect the performance of training (loss convergence, etc.) will be reciprocated with good model responses. Training and generating are quite different when it comes to LLMs. I also recommend you go back to your training dataset preparation step and go one by one whether the data and the way you feed your model during training with this data are in the appropriate formats because I experienced in my other finetuning attemps that if the data loading and format during training is not well organized then model either repeats itself like a parrot or it produces garbage.

Hello Lalith,

I recently finetuned this model. I highly recommend you follow the format by which the creators of this model trained the model. As they state in their readme page, here is how you should organize the template for your custom dataset:

text =
[INST] What is your favorite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen!

First, do not forget to add beginning (s) and end (/s) tokens in your prompt format. Then, between [INST] and [/INST] tokens you should feed the text whatever your fine-tuned model is supposed to accomplish. So inside these tokens, you should give both your instruction and input. After [/INST], you should feed the response (your label or ground-truth or whatever you want your model to think and give as an output).

I hope this helps!

Hello, can you please give an example dataset row with the beginning (s) and end (/s) tokens and the [INST],[/INST]. Also, if we are talking about a custom local dataset, should it be sequences of JSON objects?
For example, something liek this:

{"text": "[INST] Question1 [/INST] Answer1 "}
{"text": "[INST] Question2 [/INST] Answer2 "}
{"text": "[INST] Question3 [/INST] Answer3 "}

Hello Lalith,

I recently finetuned this model. I highly recommend you follow the format by which the creators of this model trained the model. As they state in their readme page, here is how you should organize the template for your custom dataset:

text =
[INST] What is your favorite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen!

First, do not forget to add beginning (s) and end (/s) tokens in your prompt format. Then, between [INST] and [/INST] tokens you should feed the text whatever your fine-tuned model is supposed to accomplish. So inside these tokens, you should give both your instruction and input. After [/INST], you should feed the response (your label or ground-truth or whatever you want your model to think and give as an output).

I hope this helps!

Hello, can you please give an example dataset row with the beginning (s) and end (/s) tokens and the [INST],[/INST]. Also, if we are talking about a custom local dataset, should it be sequences of JSON objects?
For example, something liek this:

{"text": "[INST] Question1 [/INST] Answer1 "}
{"text": "[INST] Question2 [/INST] Answer2 "}
{"text": "[INST] Question3 [/INST] Answer3 "}

Hi,

The QA structure you wrote as an example is how your data samples should look like. Let me give a basic example. Suppose you create a custom dataset regarding sentiment classification. Here is how a single data object should roughly look like:

{"text": "<s>[INST] Determine the sentiment type of the following sentence: I hate this movie! [/INST] Negative  </s>"}

As for your second question, what do you mean exactly by saying "sequences of JSON objects"?. You can store it in whatever type you like such as dataframe, dictironary or json. BUT, the truth of the matter is when you feed your data row/sample/object into your training model, then you must feed it like string/text objects via sth like dataloader iteratively.

Hello Lalith,

I recently finetuned this model. I highly recommend you follow the format by which the creators of this model trained the model. As they state in their readme page, here is how you should organize the template for your custom dataset:

text =
[INST] What is your favorite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen!

First, do not forget to add beginning (s) and end (/s) tokens in your prompt format. Then, between [INST] and [/INST] tokens you should feed the text whatever your fine-tuned model is supposed to accomplish. So inside these tokens, you should give both your instruction and input. After [/INST], you should feed the response (your label or ground-truth or whatever you want your model to think and give as an output).

I hope this helps!

Hello, can you please give an example dataset row with the beginning (s) and end (/s) tokens and the [INST],[/INST]. Also, if we are talking about a custom local dataset, should it be sequences of JSON objects?
For example, something liek this:

{"text": "[INST] Question1 [/INST] Answer1 "}
{"text": "[INST] Question2 [/INST] Answer2 "}
{"text": "[INST] Question3 [/INST] Answer3 "}

Hi,

The QA structure you wrote as an example is how your data samples should look like. Let me give a basic example. Suppose you create a custom dataset regarding sentiment classification. Here is how a single data object should roughly look like:
{"text": "[INST] Determine the sentiment type of the following sentence: I hate this movie! [/INST] Negative "}

As for your second question, what do you mean exactly by saying "sequences of JSON objects"?. You can store it in whatever type you like such as dataframe, dictironary or json. BUT, the truth of the matter is when you feed your data row/sample/object into your training model, then you must feed it like string/text objects via dataloader like-thing iteratively.

Ok Perfect regarding my second question you are tottaly right, thanks.

"BUT, the truth of the matter is when you feed your data row/sample/object into your training model, then you must feed it like string/text objects via dataloader like-thing iteratively."
Maybe this is what im doing wrong, if you could share a related code block for mistral or any other model to see if we are on the same page, it would be very helpfull.

Any code block would be meaningless for your case. There is hardly any one method that may possibly fits for all the cases. But as a rule of thumb, the dataset instance/object of huggingface is quite useful and I utilized that one. Check this out: https://huggingface.co/docs/datasets/v2.2.1/en/access

Sign up or log in to comment