Kernels
kernel
cuda

Flash Attention in CUDA from Scratch

A CUDA kernel built on Deep-ML and published to the Kernel Hub.

Build a tiled, IO-aware Flash Attention implementation in CUDA, starting from elementary GPU primitives and progressing to a fused online-softmax attention kernel. Along the way you implement a naive attention baseline, the online softmax math, and finish with a causal variant suitable for autoregressive models.

Usage

from kernels import get_kernel

kernel = get_kernel("Deep-ML/flash-attention-in-cuda-from-scratch")

This repo follows the kernel-builder layout. Build the binaries with nix build . (or the kernel-builder Docker image) before loading.


Generated from a completed Deep-ML project.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support