Inference time issue

#59
by amnasher - opened

Hello I have fintuned the falcon " ybelkada/falcon-7b-sharded-bf16 " but during inference it is taking too much time like it took 28 minutes on a single prompt when i assigned a token size of 700 how can I resolve this issue?

generation_config = model.generation_config
generation_config.max_new_tokens = 200
generation_config.temperature= 0.7
generation_config.top_p = 1.0
generation_config.num_return_sequences = 1
generation_config.pad_token_id = tokenizer.eos_token_id
generation_config.eos_token_id = tokenizer.eos_token_id

This is my code.

prompt= f"""

Human : Query
Assistant :
"""
encoding = tokenizer(prompt, return_tensors = 'pt').to(DEVICE)
with torch.inference_mode():
outputs = model.generate(
input_ids = encoding.input_ids,
attention_mask = encoding.attention_mask,
generation_config = generation_config,
)

print(tokenizer.decode(outputs[0], skip_special_tokens = True))

Sign up or log in to comment