Instructions to use Frosty40/hydra with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Kernels
How to use Frosty40/hydra with Kernels:
# !pip install kernels from kernels import get_kernel kernel = get_kernel("Frosty40/hydra") - Notebooks
- Google Colab
- Kaggle
Hydra
Hydra is an experimental bounded-residency attention kernel for long-context decode. It keeps sink tokens, recent tokens, and selected older pages resident instead of forcing each decode step to attend over the full KV cache.
Source code: https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra
This release is intentionally narrow. It is not a general replacement for full attention, and it does not claim universal speedups or broad quality preservation. The current target is fit and usability for specific long-context inference workloads where the full-attention path is memory-bound.
Usage
After the kernel is published:
import torch
from kernels import get_kernel
hydra = get_kernel("Frosty40/hydra")
q = torch.randn(1, 32, 1, 128, device="cuda", dtype=torch.bfloat16)
k = torch.randn(1, 8, 8192, 128, device="cuda", dtype=torch.bfloat16)
v = torch.randn(1, 8, 8192, 128, device="cuda", dtype=torch.bfloat16)
out = hydra.hydra(q, k, v)
print(out.shape)
For local development from the public source checkout:
from pathlib import Path
import sys
sys.path.insert(0, str(Path("hf-kernels") / "hydra" / "torch-ext"))
import hydra
readme_example.py uses the local source packet by default so it can run before
publication. Set HYDRA_USE_HUB=1 after publication to exercise the Hub-loaded
path.
API
hydra.hydra(
q,
k,
v,
*,
is_causal=True,
sliding_window=None,
policy_layer_idx=None,
precision="high",
)
Current constraints:
- CUDA tensors only
- bf16
q,k, andv - shape
(B, H, T, D)withD=128 - causal attention only
- decode path supports
Tq == 1with arbitraryTkv - prefill path requires
T % BLOCK_SIZE == 0
Evidence Boundary
Submission-facing evidence must come from checked artifacts, not prose notes. Treat evidence in three separate scopes:
- kernel/package validation: tests, CUDA parity logs,
kernel-builderlogs, and isolated decode benchmarks for this source packet - broad Hydra research campaign: capacity, quality, sparse-attention comparison, edge/OOM, diagnostic, and model-family reports from the staging repo
- exact-model proof-of-concept: checked
Qwen/Qwen3.6-35B-A3B-FP8rows for named GPUs only
The exact-Qwen proof-of-concept appendix in the staging repo is under:
results/raw/qwen3p6_35b_a3b_fp8/
results/reports/QWEN3P6_FP8_EVIDENCE_TABLE.md
Each cited row must include all three:
- fit/headroom: GPU, context length, memory allocated/reserved, and OOM state
- quality/correctness: prompt/task ID and generated answer artifact
- speed/usability: wall time, generated tokens, tokens/sec, and comparison target
Do not cite proxy models, loader-only probes, failed dependency checks, or non-matching model runs as Hydra benchmark results. Do not describe the exact-Qwen proof-of-concept subset as the full Hydra validation campaign.
Current Proof-Of-Concept Scope
The current exact-Qwen artifact-backed proof-of-concept scope is:
| GPU | Model | Scope |
|---|---|---|
| RTX PRO 6000 WS | Qwen/Qwen3.6-35B-A3B-FP8 |
32k/80k/160k repeat packet, 160k c96 warm packet, and frontier/headroom sweeps |
| RTX 3090 | Qwen/Qwen3.6-35B-A3B-FP8 |
2k/3k/4k/6k/8k fit probes and completed 10k/12k/14k edge sweep |
The 3090 result should be framed as fit/usability evidence, not a speedup claim. Token rates are slow in the long-context edge rows. The broader Hydra campaign includes additional GPUs, tasks, and comparison lanes outside this exact-model appendix.
Validation Required Before Merge
Minimum gates for source changes:
cd hf-kernels/hydra
python3 -m pytest -q tests
nix run .#ci-test
python3 benchmarks/benchmark_hydra_decode.py --repo .
python3 readme_example.py
Run the CUDA tests on real GPUs. Local syntax checks are not enough for a kernel submission.
Benchmark Snapshot
The current 8192-token decode smoke/benchmark matrix is intentionally reported as kernel/package evidence, not as a universal speedup claim.
| GPU | Package smoke decode | HF benchmark mean |
|---|---|---|
| RTX 3060 | 0.2574 ms | 0.3229 ms |
| RTX 3070 | 0.1474 ms | 0.2532 ms |
| RTX 3080 | 0.2051 ms | 0.3157 ms |
| RTX 3090 | 0.1492 ms | 0.3107 ms |
| RTX 4070 Ti | 0.1261 ms | 0.2215 ms |
| RTX 4090 | 0.1132 ms | 0.2245 ms |
| A100 SXM4 | 0.1408 ms | 0.2568 ms |
| RTX PRO 6000 Blackwell | 0.1158 ms | 0.1371 ms |
| RTX A6000 | builder smoke 0.2166 ms | 0.3230 ms |
The final kernel-builder gate passed on a Vast RTX A6000 with
BUILDER_VARIANT=torch210-cxx11-cu128-x86_64-linux: local pytest 6 passed,
decode smoke 0.2166 ms/iter, builder pytest 4 passed, 2 skipped, exit
status 0.
- Downloads last month
- -