meta-llama/Meta-Llama-3-8B-Instruct · MPS support quantification

Apr 20

I'm trying to run this with the transformers library on an M1 Macbook Pro.

With bfloat16, I get:
"TypeError: BFloat16 is not supported on MPS"

With float16, I get:
"NotImplementedError: The operator 'aten::isin.Tensor_Tensor_out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS."

Is there a quantized model somewhere that I should be using instead? Any chance of running this model on Apple GPU with the hugging face libraries?

tonimelisma changed discussion title from MPS support quantification to Xxx Apr 20

tonimelisma changed discussion title from Xxx to MPS support quantification Apr 20

rileydean

20 days ago

Curious, did you ever get this working?

ybelkada

Meta Llama org 19 days ago

Hi @tonimelisma
For using quantized Llama on apple devices, I advise to use MLX: https://huggingface.co/collections/mlx-community/llama-3-662156b069a5d33b3328603c cc @awni @prince-canuma

awni

19 days ago

Yup, should be easy to do and reasonably fast with MLX:

pip install mlx-lm
mlx_lm.generate --model mlx-community/Meta-Llama-3-8B-Instruct-4bit --prompt "hello"

More docs here

tonimelisma

18 days ago

Yes, MLX and llama.cpp work fine. I was inquiring whether Huggingface would work, too.