Issue training With Triton

#40
by MikeyBelllissimo - opened

I have tried so many different combinations of each version of libraries etc but I keep getting the same error for the backward pass when its sent to the kernel. Right now I'm using the triton_pre_mlib and flash-attn==1.0.3.post0 but I also tried with triton==2.0.0.dev20221202. I also tried with and without the flash_attn_triton.py file and still the same error. I'm using CUDA 11.7 on a 3090 and I'm able to train without issue when I have attn_impl set to torch but I would like to use triton for obvious reasons. I am getting the following error:

Traceback (most recent call last):
File "", line 21, in _bwd_kernel
KeyError: ('2-.-0-.-0--394352f6a8351feaac334fbb8cc63fa4-46c7c5d46afed8316facd72e7e581bec-ee7112c0f04b05ca1104709529fc7c00-39e3c68a052760cc345a9147b0d68f7d-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-4ac47e74762ba6a774cceea0e1e75ae6-13b7ffc189bd9fba7696034bbcfee151', (torch.float16, torch.float16, torch.float16, torch.float32, torch.float16, torch.float32, torch.float16, torch.float16, torch.float32, torch.float32, 'fp32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), ('vector', True, 128, False, False, False, True, 128, 128), (True, True, True, True, True, True, True, True, True, True, (False,), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (True, False), (False, False), (False, False), (True, False), (True, False), (False, False), (False, False)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/leucha/Ai/LLMs/MPT-7B/mpt-lora/finetune.py", line 327, in
fire.Fire(train)
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/leucha/Ai/LLMs/MPT-7B/mpt-lora/finetune.py", line 316, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/transformers/trainer.py", line 1664, in train
return inner_training_loop(
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/transformers/trainer.py", line 2745, in training_step
self.scaler.scale(loss).backward()
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/home/leucha/Ai/LLMs/MPT-7B/mpt-lora/flash_attn_triton.py", line 482, in backward
_flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
File "/home/leucha/Ai/LLMs/MPT-7B/mpt-lora/flash_attn_triton.py", line 396, in _flash_attn_backward
_bwd_kernel[grid](q, k, v, bias, do, dq_accum, dk, dv, lse, delta, softmax_scale, q.stride(0), q.stride(2), q.stride(1), k.stride(0), k.stride(2), k.stride(1), v.stride(0), v.stride(2), v.stride(1), *bias_strides, do.stride(0), do.stride(2), do.stride(1), dq_accum.stride(0), dq_accum.stride(2), dq_accum.stride(1), dk.stride(0), dk.stride(2), dk.stride(1), dv.stride(0), dv.stride(2), dv.stride(1), nheads, seqlen_q, seqlen_k, seqlen_q_rounded, d, seqlen_q // 32, seqlen_k // 32, bias_type, causal, BLOCK_HEADDIM)
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/triton_pre_mlir/runtime/jit.py", line 106, in launcher
return self.run(*args, grid=grid, **kwargs)
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/triton_pre_mlir/runtime/autotuner.py", line 73, in run
timings = {config: self._bench(*args, config=config, **kwargs)
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/triton_pre_mlir/runtime/autotuner.py", line 73, in
timings = {config: self._bench(*args, config=config, **kwargs)
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/triton_pre_mlir/runtime/autotuner.py", line 63, in _bench
return do_bench(kernel_call)
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/triton_pre_mlir/testing.py", line 140, in do_bench
fn()
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/triton_pre_mlir/runtime/autotuner.py", line 62, in kernel_call
self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)
File "/home/leucha/ls/envs/MPTFlash/lib/python3.10/site-packages/triton_pre_mlir/runtime/autotuner.py", line 200, in run
return self.fn.run(*args, **kwargs)
File "", line 43, in _bwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

Mosaic ML, Inc. org

My guess if that your environment is not set up correctly.
see here for setup instructions

specifically
you have a non-compatible version of triton installed.

yeah I've been getting the same thing with the exact dependencies listed in the LLM Foundry - I've tried so many different things that the only thing I can even guess that's causing it is multiple cuda installations. do you have any other cuda versions besides just 11.7? I've got several versions in my /usr/local folder so give that a check. if you've only got 11.7 in there then I'm not sure what the issue could be.

Also facing the same Key Error, someone sloved it?
my gpu is Nvidia T4
driver: 515.105.01
Nvidia toolkit: V11.7.99
CUDA: 11.7
triton: 2.0.0.dev20221202
torch: 1.13.1+cu117

Yeah I'm also using the recommended libraries. I even just ran it through their docker image and everything and still got the same error. I only have 11.7 installed too so I don't think that's the source. I'm doing LoRA with the peft library which adds the inputs embeds argument. Is this the common denominator with the rest of you by chance?

I'm not so sure that this would cause it but I figure it can't hurt to see if that's the common denominator before I dig deep into the kernel.

@MikeyBelllissimo I'm actually seeing this issue on inference, so I'm not using any extra libraries. exact same error except my issue arises with _fwd_kernel() instead of _bwd_kernel() in the original post

@MikeyBelllissimo I'm actually seeing this issue on inference, so I'm not using any extra libraries. exact same error except my issue arises with _fwd_kernel() instead of _bwd_kernel() in the original post

That’s odd my forward pass is working fine. Have you tried the updated attention.py and flash_attn_triton.py files and installed the triton_pre_mlir?

I’m thinking mine might be caused by a race condition. Working on some other projects right now as well but will try and debug when I get a chance.

This comment has been hidden
This comment has been hidden

@MikeyBelllissimo are you explicitly installing triton_pre_mlir? as in pip install triton_pre_mlir triton ? (I'm asking because I'm unsure what exactly I need to install to use triton).

I'm seeing this too, even after installing triton_pre_mlr explicitly. On an A6000 with ubuntu20

relevant python packages installed and versions are below

Package                  Version
------------------------ ------------
accelerate               0.24.1
datasets                 2.14.6
einops                   0.7.0
huggingface-hub          0.17.3
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-cupti-cu11   11.7.101
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
nvidia-cufft-cu11        10.9.0.58
nvidia-curand-cu11       10.2.10.91
nvidia-cusolver-cu11     11.4.0.1
nvidia-cusparse-cu11     11.7.4.91
nvidia-nccl-cu11         2.14.3
nvidia-nvtx-cu11         11.7.91
safetensors              0.4.0
tokenizers               0.14.1
torch                    2.0.0
tqdm                     4.66.1
transformers             4.34.1
triton                   2.0.0
triton-pre-mlir          2.0.0

Sign up or log in to comment