generation is not good
I think this 4bit version is not working well as the generation contains too many random or garbage tokens.
I think this 4bit version is not working well as the generation contains too many random or garbage tokens.
@jmjzz
Can you please share what script you used to run the model? With the script provided in this repo, I am getting "Runtime error: LayerNormKernelImpl not implemented for Half".
I think this 4bit version is not working well as the generation contains too many random or garbage tokens.
Can you give us an example of which input you used ?
The generation for the base prompt looks good to me @jmjzz :
I see. I'm also using the base prompt given in the DBRX huggingface page. Did you make any modifications?
This is the exact script i used @jmjzz :
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("PrunaAI/dbrx-instruct-bnb-4bit", trust_remote_code=True, token="hf_YOUR_TOKEN")
model = AutoModelForCausalLM.from_pretrained("PrunaAI/dbrx-instruct-bnb-4bit", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, token="hf_YOUR_TOKEN")
input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))
@johnrachwanpruna I see, the generation looks good right now. But loading the model takes like 30 minutes, which is significantly slower than loading Mixtral 7B*8.
@jmjzz For me running the code snippet i showed you takes only 30 seconds
@johnrachwanpruna Thanks, I think I solved the problem. BTW, I feel the 4bit DBRX is weaker than the default Mixtral 7B*8 after running some evaluations. Have you tried to evaluate it on any benchmarks?
We did not try to benchmark the quantized models at the moment.