Utilize Apple M1 chip causes error (kernel death)

#13
by tomwjhtom - opened

I don't have Nvidia GPU, so tried to use M1 on my Macbook air. However, executing the code below leads to kernel death.

import torch
from torch import autocast
from diffusers import StableDiffusionPipeline

model_id = "CompVis/stable-diffusion-v1-4"
device = "mps"


pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, revision="fp16", use_auth_token=True)
pipe = pipe.to(device)

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt, guidance_scale=7.5)["sample"][0]  

Note that pipe.to(device) executes successfully.
Have anyone made M1 work yet? My pytorch version is '1.13.0.dev20220823'

We're working on exactly this! Pinging @pcuenq and @apolinario here as well

Please also check announcements on Twitter - we'll publish something about that soon!

In my case on Apple M1 with the code

# make sure you're logged in with `huggingface-cli login`
import os
import torch
from torch import autocast
from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler

# To swap out the noise scheduler, pass it to from_pretrained:
lms = LMSDiscreteScheduler(
    beta_start=0.00085, 
    beta_end=0.012, 
    beta_schedule="scaled_linear"
)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'running on {device}')
pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-3", 
    scheduler=lms,
    torch_dtype=torch.float16,
    revision="fp16",
    use_auth_token=True,
    cache_dir=os.getenv("cache_dir", "./models")
).to(device)

prompt = "a photo of an astronaut riding a horse on mars"
with autocast(device):
    image = pipe(prompt)["sample"][0]  
    
image.save("astronaut_rides_horse.png")

I get the following error

Traceback (most recent call last):
  File "diffuser.py", line 27, in <module>
    image = pipe(prompt)["sample"][0]  
  ...
  File "/Documents/Projects/bloom/.venv/lib/python3.7/site-packages/torch/nn/functional.py", line 2503, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

For device mps it doesn't work out of the box yet indeed, however if device = cpu it should work @loretoparisi . Can you try removing the with autocast(device)? Autocast doesn't work for CPU as of now

For device mps it doesn't work out of the box yet indeed, however if device = cpu it should work @loretoparisi . Can you try removing the with autocast(device)? Autocast doesn't work for CPU as of now

Thanks, I slightly modified the code like

prompt = "a photo of an astronaut riding a horse on mars"
samples = 2
steps = 45
scale = 7.5
if device=='cuda':
    with autocast(device):
        image = pipe(
            [prompt]*samples,
            num_inference_steps=steps,
            guidance_scale=scale,
            )["sample"][0]
else:
    image = pipe(prompt)["sample"][0]

but I'm still getting the same error:

Traceback (most recent call last):
  File "diffuser.py", line 39, in <module>
    image = pipe(prompt)["sample"][0]
  File "/Projects/bloom/.venv/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
...
  File "/Projects/bloom/.venv/lib/python3.7/site-packages/torch/nn/functional.py", line 2503, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

@loretoparisi , oh this is probably because you are trying to load the fp16 version of the model - which also doesn't work on CPU 😅
Try this for pipe

pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", #better model btw 
    scheduler=lms,
    use_auth_token=True,
    cache_dir=os.getenv("cache_dir", "./models")
).to(device)

Thank you it works without on Apple M1, removing autocast and fp16!

Here is the code for other people's convenience

# make sure you're logged in with `huggingface-cli login`
import os
import torch
from torch import autocast
from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler

# To swap out the noise scheduler, pass it to from_pretrained:
lms = LMSDiscreteScheduler(
    beta_start=0.00085, 
    beta_end=0.012, 
    beta_schedule="scaled_linear"
)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'running on {device}')

pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", #better model btw 
    scheduler=lms,
    use_auth_token=True,
    cache_dir=os.getenv("cache_dir", "./models")
).to(device)


prompt = "a photo of an astronaut riding a horse on mars"
samples = 2
steps = 45
scale = 7.5
image = pipe(prompt)["sample"][0]
    

it definitely works, very slowly though.

On an M1 (not M1 Max) I get TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead. if I don't specify revision and torch_dtype. The script python scripts/txt2img.py works to create images though, so it's an issue with diffusers and not stable-diffusion I think.

On an M1 (not M1 Max) I get TypeError: Cannot convert a MPS Tensor to float64 dtype as the MPS framework doesn't support float64. Please use float32 instead. if I don't specify revision and torch_dtype. The script python scripts/txt2img.py works to create images though, so it's an issue with diffusers and not stable-diffusion I think.

Can you say how you specify revision and torch_dtype?

I think txt2image.py is using CPU if Cuda is not available

To sgt101, I think you're running in CPU mode, because of the line that says device = 'cuda' if torch.cuda.is_available() else 'cpu'

I'm pretty sure my txt2image is using MPS (magnusviri's fork) because it takes about a minute to run instead of upwards of 30 mins, among other things. I have

pipe = StableDiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", 
    torch_dtype=torch.float16, revision="fp16",
    use_auth_token=True,
).to("mps")

But that errors out with

0it [00:00, ?it/s]loc("mps_add"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/20d6c351-ee94-11ec-bcaf-7247572f23b4/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":219:0)): error: input types 'tensor<2x1280xf32>' and 'tensor<*xf16>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).
Abort trap: 6
/Users/fragmede/miniforge3/envs/ldm/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Figured it out! I opened a PR so hugging face can get my fix to diffusers to get MPS to work.

Figured it out! I opened a PR so hugging face can get my fix to diffusers to get MPS to work.

It works after upgrading diffusers.

 pip install -U diffusers
pcuenq changed discussion status to closed

torch_dtype=torch.float16
remove this, and it works for me

Hey !

I have the same problem :

loc("varianceEps"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/a0876c02-1788-11ed-b9c4-96898e02b808/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":219:0)): error: input types 'tensor<1x77x1xf16>' and 'tensor<1xf32>' are not broadcast compatible
LLVM ERROR: Failed to infer result type(s).

But I don't understand how to fix it ... I'm an architect and I don't have skills with coding ... Can you please develop a bit the method (if there is any) ?

Thanks a lot in advance !

Sign up or log in to comment