Inquiry about Generation Speed
I've been experiencing some issues with the generation speed recently and was wondering if anyone else has encountered similar challenges. It seems like the process is slower than usual.
model_path = "mixtral"
model = AutoModelForCausalLM.from_pretrained(
model_path,device_map="auto", max_memory=max_memory_mapping
)
tokenizer =AutoTokenizer.from_pretrained(model_path)
.........................
output_ids = model.generate(input_ids=input_ids.cuda(),
do_sample=True,
temperature=0.4,
top_k=50,
max_new_tokens=300,)
Same here, the generation is very slow.
If the model is offloaded to the CPU, then of course it's going to be slow :/ The model did not change, unless you are computing the loss (which was not working on parallel devices). Make sure output_router_logits
is set to False
in the config
@Boyue27 your model is most likely offloaded into CPU or disk as stated by Arthur, you need to make sure you load your model in half-precision or 4-bit precision to make sure your model is fit into your GPU device:
For float16:
import torch
from transformers import AutoModelForCausalLM
model_path = "mixtral"
model = AutoModelForCausalLM.from_pretrained(
model_path,device_map="auto", max_memory=max_memory_mapping, torch_dtype=torch.float16
)
tokenizer =AutoTokenizer.from_pretrained(model_path)
4-bit precision (after installing bitsandbytes (pip install bitsandbytes
):
import torch
from transformers import AutoModelForCausalLM
model_path = "mixtral"
model = AutoModelForCausalLM.from_pretrained(
model_path,device_map="auto", max_memory=max_memory_mapping, load_in_4bit=True
)
tokenizer =AutoTokenizer.from_pretrained(model_path)