Flash Attention in CUDA from Scratch

A CUDA kernel built on Deep-ML and published to the Kernel Hub.

Build a tiled, IO-aware Flash Attention implementation in CUDA, starting from elementary GPU primitives and progressing to a fused online-softmax attention kernel. Along the way you implement a naive attention baseline, the online softmax math, and finish with a causal variant suitable for autoregressive models.

Usage

from kernels import get_kernel

kernel = get_kernel("Deep-ML/flash-attention-in-cuda-from-scratch")

This repo follows the kernel-builder layout. Build the binaries with nix build . (or the kernel-builder Docker image) before loading.

Generated from a completed Deep-ML project.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support