CPU VS GPU computation time for Mixtral-8x7B-Instruct-v0.1

#85
by kmukeshreddy - opened

I have created a base prompt and set the maximum token limit. I then ran the prompt on both a CPU and a GPU. However, to my surprise, the computation time for the model was the same for both the CPU and GPU runs. I am wondering if anyone else has encountered this result or has any insights on why this might be happening. (Typically, a GPU should perform computations faster than a CPU.)

If you are using huggingface, you must move model and input ids to cuda.
Do it with model.cuda() and input_ids.cuda()

Yes, the issue was the complete model was not enough in GPU's which i have.
When i quantized the model the GPU inference is good.

kmukeshreddy changed discussion status to closed

Sign up or log in to comment