example code doesn't work at all

#2
by cloudyu - opened

output is: pad only
Prompt: Write me a poem about Machine Learning.

mlx 0.15.2
mlx-lm 0.15.0

MLX Community org

The example code should work fine:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gemma-2-27b-it-8bit")
response = generate(model, tokenizer, prompt="hello", verbose=True)
prince-canuma changed discussion status to closed

Reproducible here:

% mlx_lm.generate --model "mlx-community/gemma-2-27b-it-8bit" --prompt "Hello"
Fetching 11 files: 100%|█████████████████████| 11/11 [00:00<00:00, 31152.83it/s]
==========
Prompt: <bos><start_of_turn>user
Hello<end_of_turn>
<start_of_turn>model

<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
==========
Prompt: 0.538 tokens-per-sec
Generation: 1.840 tokens-per-sec

% python3 prince.py 
Fetching 11 files: 100%|█████████████████████| 11/11 [00:00<00:00, 34820.64it/s]
==========
Prompt: hello
<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
==========
Prompt: 0.124 tokens-per-sec
Generation: 2.043 tokens-per-sec

yep, very bad exprience.
not work, but someone still tell you works.

The example code should work fine:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/gemma-2-27b-it-8bit")
response = generate(model, tokenizer, prompt="hello", verbose=True)

do you really test the code?

very bad exprience.
it not work, but someone still tell you it works.

I have previously noticed differences with mlx-vlm (and PaliGemma) vs. the official demo on HF as well - but didn't have time to pursue this further. Perhaps there is an underlying MLX issue? I am using macOS 14.3 on M3 Max.

By contrast, the 9B-FP16 variant does work:

% mlx_lm.generate --model "mlx-community/gemma-2-9b-it-fp16" --prompt "Hello"
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 17614.90it/s]

Prompt: user
Hello
model

Hello! 👋

How can I help you today? 😊

==========
Prompt: 6.337 tokens-per-sec
Generation: 13.758 tokens-per-sec

MLX Community org

I'm sorry @cloudyu @ndurner ,

It was an oversight on my part,

There is a tiny bug with the 27B version, and should be fixed soon:
https://github.com/ml-explore/mlx-examples/pull/857

prince-canuma changed discussion status to open
MLX Community org

Fixed ✅

pip install -U mlx-lm

prince-canuma changed discussion status to closed

This is again an issue. Output is again after version 0.19.1. It works up to 0.19.0 only.

Sign up or log in to comment