Very long response time

#130
by farbodKMSE - opened

Hello,

After downloading and loading the model by :

text2text_generator = pipeline(
"text-generation",
model = "mistralai/Mistral-7B-v0.1",
max_length = 20)

It takes 15 minutes to generate : [{'generated_text': 'where is the capitale of germany?\n\nBerlin is the capitale of germany'}]
by text2text_generator("where is the capitale of germany?")

am I doing something wrong ? is there a way to reduce this response time ?

My setup is : MacBook Pro 2018, cpu : 2,9 Ghz intel core i9, Memory : 32 DDR4, Graphic : intel UHD Graphics 630 1536 MB

deleted

Time to get a better machine i think.

Thank you for your response,
If I want to deploy this model as part of a application on a server, what kind of setup should I ask for the server?

deleted
edited Mar 6

I wont say i'm 'the' expert, but you need to look into NVIDIA GPU.. running this stuff on CPU is going to be painful at best. I run this sort of stuff on an old 12G Titan and its still a world of difference between that and
even a decent CPU. You can get far better than i have these days for not much budget.

Oh and for local dev work, might consider a GGUF format. will run faster and be good enough

I used GoogleColab GPU and it took me 10 min to generate this string 'A list of colors: red, blue, green, yellow, orange, purple, pink,' :

model_inputs = tokenizer(["A list of colors: red, blue"], return_tensors="pt").to("cuda")
generated_ids = model.generate(**model_inputs, pad_token_id=tokenizer.eos_token_id)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Sign up or log in to comment