Special token( </s>) not generating in the model.generate() method

#47
by Pradeep1995 - opened

I finetuned the mistralai/Mistral-7B-Instruct-v0.2 model by using the dataset of a given format

sentence1</s>sentence2
sentence3</s>sentence4

So after tuning, I am trying to do the inference by prompting the first part only (ie sentence1 or sentence3 ).
So I am expecting the response structure such as </s>sentence2 or </s>sentence4

But the finetuned model just produces sentence2 and sentence4 only without generating the </s> special token.

So to generate the </s> token by model.generate() method, how to change the code?

Hi @Pradeep1995
How do you verify that </s> is not generated? Can you make sure you decode all tokens with skip_special_tokens=False ?
Also, it might be possible that the model do not attend to these tokens during training, could you inspect the attention mask of your training protocol and make sure the token </s> is correctly attended?

@ybelkada
Before training, I initiated the tokenizer as follows

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", trust_remote_code=True,use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

I haven't put anything like skip_special_tokens=False/True in the tokenizer before training.

Also after training, during decoding time after inference i tried the both ways with the same tokenizer used above

tokenizer.decode(output_tokens,skip_special_tokens=True)
and
tokenizer.decode(output_tokens,skip_special_tokens=False)

But the model not generating and so not decoding the special tokens.

Is my method correct?

It's more in the training did you add the eos and bos to every prompt? Also for training eos = pad seems wrong you will always ignore the eos but you need to pay attention to it when you train

sentence1</s>sentence2
sentence3</s>sentence4
.....
.....

This is the format of my training data. I didn't explicitly mention anything like eos and bos in the training data, rather than the </s> In the middle of each data sample.
what I want is for the model should generate the special token(</s>) during the inference in the middle of the sentence rather than at the end.
so how can I modify the code for that? please share the snipped if possible.
@ybelkada

I see, I think by default the DataCollatorForLanguageModeling masks out the EOS token during training, can you share your training snippet?

import torch
from peft import LoraConfig, AutoPeftModelForCausalLM, prepare_model_for_kbit_training, get_peft_model
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
#For model
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
)

#For tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True,use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

peft_model = .....
peft_config=  .....
training_arguments = ....
#Dataset format  sentence1</s>sentence2, sentence3</s>sentence4,....etc
data    = dataset

#SFTTrainer
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=data,
    peft_config=peft_config,
    dataset_text_field="prompt",
    max_seq_length=3000,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)
trainer.train()

@ybelkada please check

@Pradeep1995
Thanks !
Is your dataset already formatted as sentence1</s>sentence2, sentence3</s>sentence4,....etc ? If that's the case you need to set packing=False. The other solution is to set the token for separating each sentence differently than </s> as that token is already used as the EOS token.
Does also this issue helps : https://github.com/huggingface/trl/issues/1283 ?
Let me know how it goes!

Sign up or log in to comment