performance issue with hf 4bit

#8
by FinancialSupport - opened

Hi! It may be 99% a problem on my part, but this model runs like 3x slower than mistral 7b on my 4060ti 16gb (0 offloading on RAM)
Here is the code I'm using for inference, I'm trying from one prompt to generate a batch of 10 answers.
I really want to use this model but for some reason, I don't understand why it performs so poorly :(

def load_model(path):

start_time = time.time()

# I love Tim Dettmers
bnb_config = {'load_in_4bit':True,
             'bnb_4bit_use_double_quant':True,
             'bnb_4bit_quant_type':"nf4",
             'bnb_4bit_compute_dtype': torch.bfloat16}

model = AutoModelForCausalLM.from_pretrained(path, **bnb_config)
tokenizer = AutoTokenizer.from_pretrained(path)
end_time = time.time()  # End the timer
print(f"Model loading: {end_time - start_time} seconds")  
return model, tokenizer

def llm_batch_generate(model, tokenizer, prompt, device, max_token, num_sequences):

start_time = time.time()
chat = [
    {"role": "user", "content": f"{prompt}"},
]
model_inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt")
model_inputs = model_inputs.to(device)
generated_ids = model.generate(
    model_inputs,
    max_new_tokens= max_token + 100,
    do_sample=True,
    temperature=1.5,
    pad_token_id=tokenizer.eos_token_id,
    num_return_sequences=num_sequences,  # Specify number of sequences to return
)

answers = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

# Split answers if the tokenizer includes special tokens that you wish to remove
# Note: The line below assumes your special token is '[/INST]', adjust if different
answers = [answer.split('[/INST]')[1] if '[/INST]' in answer else answer for answer in answers]

# If only one sequence was generated, return a string instead of a list
if len(answers) == 1:
    answers =  answers[0]

end_time = time.time()  # End the timer
print(f"Batch inference: {end_time - start_time} seconds")  

return answers

I also have a 4060ti 16gb, so I checked for myself:
OpenHermes-2.5-Mistral-7B is about 3.3 token/s in 4bit
NouseHermes-2-Solar-10.7B is about 2.5 tokens/s in 4bit.

Which is as expected given their comparative size I guess. Could it be possible that the different models simply give answers which are different in length in your test case, which throws of your timing?

By the way, afaik llama.cpp or exllama are much faster for inference than transformers anyway, so I'd generally use that if you are doing purely inference.

Thanks for the test, must be something on my part because I tried the same on an aws EC2 and got to the same conclusion (solar wayy slower than it should be!)
I'll pickup the advice on using llama.cpp or exllama, I'll look up how to change my code accordingly.

FinancialSupport changed discussion status to closed
NousResearch org

I also have a 4060ti 16gb, so I checked for myself:
OpenHermes-2.5-Mistral-7B is about 3.3 token/s in 4bit
NouseHermes-2-Solar-10.7B is about 2.5 tokens/s in 4bit.

Which is as expected given their comparative size I guess. Could it be possible that the different models simply give answers which are different in length in your test case, which throws of your timing?

By the way, afaik llama.cpp or exllama are much faster for inference than transformers anyway, so I'd generally use that if you are doing purely inference.

Both seem slow to me here

Sign up or log in to comment