Long latency and low gpu utilization

#8
by Scott0612 - opened

I am running this model using the mlx example code mistral.py, I downloaded the model, tokenizer and config files. When I run it, the model is loaded fine, but the gpu utilization is single digit, and the first 10 tokens output took like 5 mins. Using a M2 macbook air 16 GB. What is the issue?

Sign up or log in to comment