I want to make the chatbot using Mixtral-8x7B-Instruct-v0.1 model, but the inference speed is very slow, so I cannot use it as a chatbot. How can I fix this issue?

#211
by rising620 - opened

I have already done many things such as flash attention, model quantization to optimize the inference speed, but still got slow influence speed as a chatbot.
Please help me how can I speed up this model.

Sign up or log in to comment