FT Mistral Generate Slowly

#112
by yixliu1 - opened

I fully parameter fine tuned a mistral-7B model. Here is my part of FT code:
training_arguments = TrainingArguments(
output_dir= "",
num_train_epochs= 5,
per_device_train_batch_size= 8,
gradient_accumulation_steps= 2,
optim = "paged_adamw_32bit",
save_steps= 1000,
logging_steps= 30,
learning_rate= 2e-4,
weight_decay= 0.001,
fp16= False,
bf16= False,
max_grad_norm= 0.3,
max_steps= -1,
warmup_ratio= 0.3,
group_by_length= True,
lr_scheduler_type= "constant",
report_to="wandb"
)

Setting sft parameters

trainer = SFTTrainer(
model=model,
train_dataset=dataset,

peft_config=peft_config,

max_seq_length= None,
dataset_text_field="text",
tokenizer=tokenizer,
args=training_arguments,
packing= False,

)

Here is what I got:

image.png

Here is my inference code:
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

def mixtral_inf(text, token_len=512):
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=token_len,
do_sample=True,
repetition_penalty=1.0,
temperature=0.8,
top_p=0.75,
top_k=40
)
return tokenizer.decode(outputs[0])

My generation speed is so slow. It requires about 40-50s to generate one inference with max_new_token_length=128. It is much slower compared with mistral-8*7B. I wonder if I get wrong model files or my inference method is wrong. Also, it keeps notice me that A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set padding_side='left' when initializing the tokenizer. I am a bit confused about this part. I haven't set padding while training, is it a default value? Also, if my padding during training is right, why should I set it to left while generation?

Sign up or log in to comment