Instructions to use Deep-ML/flash-attention-in-cuda-from-scratch with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Kernels
How to use Deep-ML/flash-attention-in-cuda-from-scratch with Kernels:
# !pip install kernels from kernels import get_kernel kernel = get_kernel("Deep-ML/flash-attention-in-cuda-from-scratch") - Notebooks
- Google Colab
- Kaggle
Flash Attention in CUDA from Scratch
A CUDA kernel built on Deep-ML and published to the Kernel Hub.
Build a tiled, IO-aware Flash Attention implementation in CUDA, starting from elementary GPU primitives and progressing to a fused online-softmax attention kernel. Along the way you implement a naive attention baseline, the online softmax math, and finish with a causal variant suitable for autoregressive models.
Usage
from kernels import get_kernel
kernel = get_kernel("Deep-ML/flash-attention-in-cuda-from-scratch")
This repo follows the kernel-builder
layout. Build the binaries with nix build . (or the kernel-builder Docker image) before loading.
Generated from a completed Deep-ML project.
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support