Dataset format for fine tuning

by andreaKIM - opened

Hello. Is there any proper formatting for fine tuning this model?
Can I use mistral model's prompt or any recommended prompt format is available?

I used the following prompt to fine-tune:
<|system|>\n {instruction} \n<|user|>\n{query}\n<|assistant|>\n{response}

I had problems making the model stop generating content. So I found the solution in this link (

This change before starting the training solved my problem
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
#tokenizer.pad_token = tokenizer.eos_token

tokenizer.pad_token = tokenizer.unk_token <----
tokenizer.padding_side = "right" <----

Hi! Any limmits to the length of the inputs/outputs when finetunning? Like those limmits found in OpenAI models?

Hi! Any limmits to the length of the inputs/outputs when finetunning? Like those limmits found in OpenAI models?

It's my first major fine-tuning, so maybe something I say may not make sense, but when it comes to fine-tuning input, depending on the configuration, it needs to be multi-gpu, otherwise you'll be limited. Already at the output I noticed that the more complete the fine adjustment... checkpoints, times, the more complete the fine adjustment the greater the output has been. But I repeat, this is my first major fine-tuning, because until now I was having a problem with the model not generating the eos_token

I used the following prompt to fine-tune:
<|system|>\n {instruction} \n<|user|>\n{query}\n<|assistant|>\n{response}

What did your prepared dataset look like for finetuning? Was it a .csv file with a single column in this format?

Yes, single column:
"<|system|>\n {instruction} \n<|user|>\n{query}\n<|assistant|>\n{response}"
"<|system|>\n {instruction} \n<|user|>\n{query}\n<|assistant|>\n{response}"

I used the following prompt to fine-tune:
<|system|>\n {instruction} \n<|user|>\n{query}\n<|assistant|>\n{response}

I had problems making the model stop generating content. So I found the solution in this link (

This change before starting the training solved my problem
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
#tokenizer.pad_token = tokenizer.eos_token

tokenizer.pad_token = tokenizer.unk_token <----
tokenizer.padding_side = "right" <----

How much RAM is it needed to run this model locally?

Can someone tell me what format I should use in order to fine tune the model to answer questions from a specific document?
For example is the following correct?

data = [

        {"role": "system", "content": '''Text:

A right triangle is a triangle with a right angle.
A right angle equals to 90 degrees.
Based on the above Text answer rge following question. Your answer should be from the Text only. Do not answer to questions which are irelevant to the Text given.'''},
{"role": "user", "content": "question1"},
{"role": "assistant", "content": "asnwer1."},
{"role": "user", "content": "question2"},
{"role": "assistant", "content": "answer2."},

Sign up or log in to comment