end of sentence token in fine-tuning dataset

#12
by tanner-sorensen - opened

As I understand it, the basic format of the prompts for the base instruction fine-tuned checkpoint is the following:

<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{instruction} [/INST] {response} </s>

This is based on these:

Now, when we create the strings we are using for fine-tuning, we do not insert <s> at the start of the string because the tokenizer takes care of this for us before we feed it into the model (see below):

>>> import transformers
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf", use_auth_token="...")
>>> tokens = tokenizer("this is a test")
>>> tokenized_string = tokenizer.decode(tokens.input_ids, skip_special_tokens=False)
>>> tokenized_string
'<s> this is a test'

We noticed that the tokenizer does not automatically insert the end-of-sentence token (see just above). Thus, as a precaution, we insert </s> at the end of the response as an end-of-sentence token. We can use this token as a stopping criterion.

>>> import transformers
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-chat-hf", use_auth_token="...")
>>> tokenizer.decode(tokenizer.eos_token_id)
'</s>'

Since the tokenizer adds <s> but not </s>, is it correct to use this as the input to the tokenizer, then?

[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n{instruction} [/INST] {response} </s>
This comment has been hidden

Sign up or log in to comment