Long latency and low gpu utilization

by Scott0612 - opened Dec 22, 2023

Dec 22, 2023

I am running this model using the mlx example code mistral.py, I downloaded the model, tokenizer and config files. When I run it, the model is loaded fine, but the gpu utilization is single digit, and the first 10 tokens output took like 5 mins. Using a M2 macbook air 16 GB. What is the issue?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment