atlas-gdn
Hand-tuned Gated DeltaNet kernels for the linear-attention path of Qwen3.6 hybrid models on NVIDIA GB10 (DGX Spark, SM121).
What's inside
| Op | Use |
|---|---|
gdn_decode |
Single-token recurrent decode (FP32 Q/K/V, BF16 out) |
gdn_prefill |
Multi-token prefill (BF16 throughout) |
gdn_chunk2 / gdn_chunk3 |
MTP K=2/3 chunkwise verify (Qwen3.6 NVFP4 specialized) |
gdn_wy2 / wy3 / wy4 |
2-pass WY-chunkwise verify (general K=2/3/4) |
causal_conv1d_fwd |
Depthwise causal Conv1d (SSM input projection) |
causal_conv1d_update |
Single-step Conv1d update (decode) |
Hardware
These kernels target only NVIDIA GB10 (compute capability 12.1,
sm_121f). They will not load on any other GPU. GB10 has:
- Unified LPDDR5X memory (~273 GB/s) — bandwidth-bound, not occupancy-bound
- No multi-CTA clusters (ClusterShape forced to 1×1×1)
- No
cvt.rn.satfinite.e2m1x2.f32PTX (software E2M1 conversion path) - Cooperative-only scheduling (no Pingpong)
build.toml pins cuda-capabilities = ["12.1"] so the build matrix
yields a single SM121 binary; no fallback binaries are produced.
Models tested
| Model | Layers using these kernels |
|---|---|
| Qwen/Qwen3.6-27B (dense, hybrid) | 48 GDN layers |
| Qwen/Qwen3.6-35B-A3B (sparse MoE, hybrid) | 30 GDN layers |
Provenance
Sources are extracted from the Atlas inference engine
(https://github.com/Avarok-Cybersecurity/atlas, AGPL-3.0). The GDN
NVFP4 variant ships with __launch_bounds__ annotations specific to
Qwen3.6 hidden dimensions (k_dim=128, v_dim=128, 16/32 K/V heads).
License
AGPL-3.0-only.
- Downloads last month
- 8
- OS
- linux
- Arch
- aarch64