You are using a model of type mixtral_aqlm to instantiate a model of type mixtral. This is not supported for all configurations of models and can yield errors.

#3
by Tejasram - opened

When I try to run this model using the huggingface transformers library, I get this warning. Is it safe to ignore?

IST Austria Distributed Algorithms and Systems Lab org

Pleas use this with AutoModelForCausalLM:

from transformers import AutoTokenizer, AutoModelForCausalLM

quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
    trust_remote_code=True, torch_dtype="auto"
).cuda()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")

from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")

%%time
output = quantized_model.generate(tokenizer("Who invented the electric lamp?", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:2 for open-end generation.
Who invented the electric lamp?

Thomas Edison invented the electric lamp.

What is the difference between a light bulb and a lamp?

A light bulb is a device that produces light. A lamp is a device that contains a light bulb.

What is the difference between a light bulb and a lamp?

A light bulb is a device that produces light. A lamp is a device that contains a light bulb.

What is the difference between a light bulb and a lamp?

A light bulb is a device that produces light. A lamp is a device that contains a
CPU times: user 1min 27s, sys: 916 ms, total: 1min 28s
Wall time: 1min 31s

baaaaaaad

from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

%%time
output = quantized_model.generate(tokenizer("The relationship between humans and AI ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
<|begin_of_text|>The relationship between humans and AI Thedef solve49. Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question: Ã0Question
CPU times: user 11.6 s, sys: 418 ms, total: 12 s
Wall time: 35.8 s

Sign up or log in to comment