Finetuning data format (alpaca or chat)

#33
by halilergul1 - opened

Hi,

I got a little confused regarding how to properly feed our custom data for fine-tuning on a specific task with this model. I am familiar with Alpaca format (instruction, input, response) and mistral (where prompt should be surrounded by [INST] and [/INST] tokens).

Does anybody have an idea what to use for this, is it should be the following style:

<start_of_turn>user
please write a hello world program<end_of_turn>
<start_of_turn>model

I assume "model" after second start_of_turn token represents the answer or response of model?

Thanks a lot!

Hi, the format should be used as described in the instructions:

<start_of_turn>user
please write a hello world program<end_of_turn>
<start_of_turn>model

note the use of < > and the spacing needed for best performance.

"chat_template": "{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif
%}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{
raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if
(message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '
' + role + '\n' + message['content'] | trim + '\n' }}{% endfor %}{% if add_generation_prompt
%}{{'model\n'}}{% endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": "",
"legacy": null,
"model_max_length": 1000000000000000019884624838656,
"pad_token": "",
"sp_model_kwargs": {},
"spaces_between_special_tokens": false,
"tokenizer_class": "GemmaTokenizer",
"unk_token": "",
"use_default_system_prompt": false
}

prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

you have to compose prompt as format:

image.png

Thanks a lot

Google org

Yes, indeed, hope this helped!

suryabhupa changed discussion status to closed

Sign up or log in to comment