RuntimeError: Device does not support shared memory of 98304bytes

#9
by leojames - opened

when i run the demo i found that cuda gpu use share memory is not work . Anyone have same proplem ?

Traceback (most recent call last):
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/gradio/routes.py", line 412, in run_predict
output = await app.get_blocks().process_api(
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/gradio/blocks.py", line 1299, in process_api
result = await self.call_function(
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/gradio/blocks.py", line 1021, in call_function
prediction = await anyio.to_thread.run_sync(
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/export/code/gradio_code.py", line 13, in to_black
y = model.generate(x, max_length=500, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/transformers/generation/utils.py", line 1485, in generate
return self.sample(
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/transformers/generation/utils.py", line 2524, in sample
outputs = self(
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/replit-code-v1-3b/replit_lm.py", line 355, in forward
x, past_key_value = block(x,
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/replit-code-v1-3b/gpt_blocks.py", line 81, in forward
b, _, past_key_value = self.attn(a,
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/replit-code-v1-3b/attention.py", line 333, in forward
context, attn_weights = self.attn_fn(
File "/root/.cache/huggingface/modules/transformers_modules/replit-code-v1-3b/attention.py", line 227, in triton_flash_attn_fn
attn_output = flash_attn_triton.flash_attn_func(query, key, value,
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/flash_attn/flash_attn_triton.py", line 810, in forward
o, lse, ctx.softmax_scale = _flash_attn_forward(
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/flash_attn/flash_attn_triton.py", line 623, in _flash_attn_forward
_fwd_kernel[grid](
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 200, in run
return self.fn.run(*args, **kwargs)
File "", line 41, in _fwd_kernel
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/triton/compiler.py", line 1256, in compile
asm, shared, kernel_name = _compile(fn, signature, device, constants, configs[0], num_warps, num_stages,
File "/export/anaconda3/envs/code/lib/python3.10/site-packages/triton/compiler.py", line 901, in _compile
name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, module, device, num_warps, num_stages, extern_libs, cc)
RuntimeError: Device does not support shared memory of 98304bytes

Hey! Can you share more details about your environment? All package versions, NVIDIA GPU details, CUDA drivers, etc. will be a helpful first step.

Also if you could check what does nvidia-smioutput once you load up the model from your script? Getting a sense of how much GPU memory is used out of the available will be helpful as well.

Hey! Can you share more details about your environment? All package versions, NVIDIA GPU details, CUDA drivers, etc. will be a helpful first step.

Also if you could check what does nvidia-smioutput once you load up the model from your script? Getting a sense of how much GPU memory is used out of the available will be helpful as well.

I used the NVIDIA GPU P100, which is based on the Pascal architecture, so it does not support this type of acceleration. Even after switching to A30, the speed is still slow.

Replit org

Closing as the issue has been solved by moving to a more recent GPU.
For additional performance, we recommend to double check that you are using Triton flash attention (as described in the README) and bfloat16.

pirroh changed discussion status to closed

Sign up or log in to comment