You are using a model of type mixtral_aqlm to instantiate a model of type mixtral. This is not supported for all configurations of models and can yield errors.
When I try to run this model using the huggingface transformers library, I get this warning. Is it safe to ignore?
Pleas use this with AutoModelForCausalLM
:
from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
trust_remote_code=True, torch_dtype="auto"
).cuda()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")
%%time
output = quantized_model.generate(tokenizer("Who invented the electric lamp?", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:2 for open-end generation. Who invented the electric lamp?
Thomas Edison invented the electric lamp.
What is the difference between a light bulb and a lamp?
A light bulb is a device that produces light. A lamp is a device that contains a light bulb.
What is the difference between a light bulb and a lamp?
A light bulb is a device that produces light. A lamp is a device that contains a light bulb.
What is the difference between a light bulb and a lamp?
A light bulb is a device that produces light. A lamp is a device that contains a
CPU times: user 1min 27s, sys: 916 ms, total: 1min 28s
Wall time: 1min 31s
baaaaaaad
from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
%%time
output = quantized_model.generate(tokenizer("The relationship between humans and AI ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:128001 for open-end generation.
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
<|begin_of_text|>The relationship between humans and AI Thedef solve49. Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question
CPU times: user 11.6 s, sys: 418 ms, total: 12 s
Wall time: 35.8 s