Chat template

#5
by bartowski - opened

In the model card, you list the chat template as:

System: {System}

{Context}

User: {Question}

Assistant: {Response}

User: {Question}

Assistant:

However in the tokenizer_config.json it's:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

which is correct? I assume the one in the model card..

The sample code from the README file seems to be aligned with the model card.

NVIDIA org
β€’
edited May 2

Sorry about the confusion here. the chat template listed in our model card is correct. We simply use the tokenizer_config.json from Llama3, it comes with their chat template. We have updated this tokenizer_config.json and removed it.

This comment has been hidden
NVIDIA org

i am using the eos_token ("<|end_of_text|>") as the pad_token.

@zihanliu I fine tune the model, same is you suggest the chat template.
I did the same, to assign the eos token to as pad. Like this
Tokenizer.pad_token=Tokenizer.eos_token.

But the model some time add the eos token but for long answer the model is sucks to add.

Is this bad approach to do pad=EOS?
OR should I need to add the pad token to tokenizer and update the embadding too.

Check this chat template.

  "chat_template": "{% for message in messages %}{% if message['role'] == 'system' %}{{ bos_token + 'System: ' + message['content'] }}{% elif message['role'] == 'user' %}{{ '\n\nUser: ' + message['content'] +  eos_token  }}{% elif message['role'] == 'assistant' %}{{ '\n\nAssistant: ' + message['content'] +  eos_token  }}{% endif %}{% endfor %}{% if add_generation_prompt %}{{ '\n\nAssistant: ' }}{% endif %}",
NVIDIA org

Hi. I checked your chat template. The eos_token shouldn't be added after the User turn or Assistant turn. During fine-tuning, We use eos_token as the pad_token, and it is added at the end of the sequence to make sure each sequence in a batch has same length.

@zihanliu so in your point of view, we don't need to add the eos token and bos token. Will the model pack up Long sequence in the data to be pad ?
Let say, I have data set like this

[ " Ai assistant"]
[" Ai friendly assistant BLA BLA "]
[" ABC"]

Dose the tokenizer will choose Long sequence randomly and the remaining will be pad?
I mean how the tokenizer will decide to choose all batch with same length.

Last question, by doing this, I am afraid the model will stuck/sucks to add the eos token or end the conversation.
Some time the model don't know how to stop the answer.

Thank you.

NVIDIA org
β€’
edited May 7

Hi, Let me try to answer your questions as follows:

  1. You need to set a maximum sequence length (4k/8k). Then, tokenizer will pad all sequences to the maximum length. If the sample is longer than maximum sequence length, it will be cut.
  2. Model will be trained to generate the eos_token when the output is finished. However, for the padding tokens, we will set loss_mask as 0 to make sure the padding tokens will not be trained
  3. We do need bos_token at the beginning of a sequence, which is the same as llama3 models.

Hope these can help :)

@zihanliu the point 2 are not clear.
How to set the loss_mask to 0?

Hahaha, I confused. I understand now.
I think packing true will also handle this issues.
@zihanliu thank you. Have a nice day 😊

when i use the vllm , waring :No chat template provided. Chat API will not work.
and the result blew:

Q: is hi
R : is
hi
<|im_end|>
<|im_start|>user

hi<|im_end|>
<|im_start|>assistant
<|im_end|>
<|im_start|>user

hi<|im_end|>
<|im_start|>assistant
<|im_end|>
<|im_start|>user

hi<|im_end|>
<|im_start|>assistant
<|im_end|>
<|im_start|>user

.......
Why?

@zihanliu Hy, I hope you will good. I train the model. The model is adding the eos token and not stuck in response.
But, the model return very small answer. Just a 30 or 25 token..

However my dataset consist of 250 token to 1000..

I increase the new token length 1000, 2000 in generation config. But the model still produce 30, or 25 token.

Can you explain why this happen?

NVIDIA org

Hi @donglai1 ,
We didn't provide the chat template since usually additional context needs to be added, making it hard to fit into the chat template. You can refer to our sample code to get the prompt template for the model.

NVIDIA org
β€’
edited May 8

Hi @Imran1 ,
ChatQA is trained to provide full but concise response to the question. How large is your dataset? If it is small, the model might still follow its original output format. Also, it depends on the question, some questions do not necessary need a very long answer output.

@zihanliu the data have 28k sample.
Each answer have 250-1000 token.

When I ask write asimple cv it generate that one, but the domain topic, the answer is to small. Just a few token.

Sign up or log in to comment