initial public release: code, README, KNOWN_ISSUES

Browse files

Files changed (6) hide show

KNOWN_ISSUES.md +64 -0
README.md +116 -0
build.sh +52 -0
trit_gemv.cu +292 -0
trit_gemv_lib.py +280 -0
trit_gemv_standalone.cu +598 -0

KNOWN_ISSUES.md ADDED Viewed

	@@ -0,0 +1,64 @@

+# Known issues — tritllm-kernel
+Surfaced during a pre-release code review. None affect the published paper benchmark numbers (those were obtained on shapes that respect the contract), but anyone using these kernels with new shapes, custom launch parameters, or as a drop-in inference primitive should be aware.
+## BLOCKER — must respect or fix before relying on the kernel
+### 1. Implicit one-warp-per-block launch contract
+**Where:** [`trit_gemv.cu:190-237` (`trit_gemv_uniform`)](trit_gemv.cu#L190), [`trit_gemv.cu:245-290` (`trit_gemv_variable`)](trit_gemv.cu#L245)
+The kernels use `lane = threadIdx.x` directly as the lane index and reduce with a full-warp mask `__shfl_down_sync(0xFFFFFFFF, ...)`. This is correct only when `blockDim.x == 32`.
+If launched with `blockDim.x > 32`:
+- Threads with `threadIdx.x >= 32` will compute `idx = lane*2+i` past the 64-element group bound and read out-of-bounds.
+- All threads with lane 0 across multiple warps race to write `y[row]`.
+**Fix in caller:** always launch with `blockDim.x == 32`. The host-side wrappers in `trit_gemv_standalone.cu` do this correctly. Direct callers from custom code must respect it.
+**Future fix in kernel:** add `assert(blockDim.x == WARP_SIZE)` at kernel entry, or rewrite to handle multi-warp blocks correctly.
+### 2. `in_features` not a multiple of `GROUP_SIZE` is silently dropped
+**Where:** [`trit_gemv.cu:194`](trit_gemv.cu#L194), [`trit_gemv.cu:259`](trit_gemv.cu#L259)
+```cpp
+int num_groups = in_features / GROUP_SIZE;
+```
+Integer division truncates. If `in_features % 64 != 0`, the trailing partial group is silently skipped and that fragment of the dot product is missing from the output.
+**Fix in caller:** pad the input weight matrix (and activations) with zero rows to the next multiple of 64 before quantizing. The codec output already does this for Qwen, Llama, and Mistral architectures, all of which have `hidden_dim` divisible by 64.
+**Future fix in kernel:** add `assert(in_features % GROUP_SIZE == 0)` at kernel entry, or write a tail-handling path.
+## SHOULD-FIX
+### 3. C API performs no input validation
+**Where:** `trit_gemv_standalone.cu`, all `extern "C"` functions
+`trit_gemv_d2_fast`, `trit_gemv_d2_dp4a`, `trit_gemv_d3_native`, etc. accept null pointers, mismatched `rows`/`cols`/`num_groups`, and incorrectly packed buffers without complaint. Bad inputs become device faults or OOB reads.
+For a public ctypes-facing library this is sharp. We will add a validation pass in a future revision; for now, callers must guarantee their arguments.
+### 4. `get_gpu_name(char* buf, int buflen)` has no null/length guard
+**Where:** [`trit_gemv_standalone.cu:700`](trit_gemv_standalone.cu#L700)
+Calling with `buf == nullptr` or `buflen <= 0` is immediate UB on the host side. Trivial fix; pending.
+### 5. CUDA error returns are not surfaced
+**Where:** several places in `trit_gemv_standalone.cu` where `set_l2_persist`, kernel launches, and helper calls drop `cudaError_t` returns
+If a kernel launch fails (e.g., bad shapes that pass the (missing) input validation), the failure is silent until the next `cudaDeviceSynchronize()` or `cudaGetLastError()`. The public functions return `void` and have no error-reporting path.
+Workaround: call `cuda_sync()` after each operation and check `cudaGetLastError()` from your wrapper.
+### 6. Reduction wastes 31 lanes per group
+**Where:** [`trit_gemv.cu:223-232`](trit_gemv.cu#L223), [`trit_gemv.cu:279-286`](trit_gemv.cu#L279)
+After the warp reduction, only lane 0 multiplies by the group scale and accumulates into `row_acc`. The other 31 lanes idle for the scale/add path. This is correct, just leaves performance on the table relative to the deferred-reduction design used in `k_d3_hardened` (`trit_gemv_standalone.cu:493`).
+The headline 7.8× number is from the deferred-reduction path, so this only matters if you use the educational `trit_gemv_uniform` / `trit_gemv_variable` kernels directly.
+## NIT
+### 7. Multiple prototype kernels in production file
+`trit_gemv_standalone.cu` contains v9, v27, v28, v29, `k_d3_hardened`, plus the non-deferred kernels — a development history rather than a clean public surface. The `k_v29_pipeline` / `trit_pipeline` path was broken (passed nullptr for required arrays) and was removed in commit prior to this release. The remaining prototypes (`k_v27`, `k_v29`, `k_v28`) are still wired through public C functions; they work, but the API surface is wider than needed. A future revision will trim to one canonical entry per depth.

README.md ADDED Viewed

	@@ -0,0 +1,116 @@

+---
+license: apache-2.0
+tags:
+  - cuda
+  - quantization
+  - ternary
+  - llm-inference
+  - kernel
+---
+# tritllm-kernel
+Multiply-free ternary GEMV CUDA kernel for the codec from
+**"Balanced Ternary Post-Training Quantization for Large Language Models"** (Stentzel, 2026).
+The headline number from the paper: **7.8× speedup** over cuBLAS FP16 GEMV on RTX 4090 in the memory-bound regime, projected to full-model token generation throughput from per-layer benchmarks.
+> **These are kernel-only projections, not end-to-end serving throughput.** They exclude attention, KV cache, sampling, and tokenizer overhead. See Section 7 of the paper for methodology.
+## What it is
+A standalone CUDA shared library (`libtrit_gemv.so` / `.dll`) callable via `ctypes` from any language, with no PyTorch dependency. The same algorithm is also wrapped via PyTorch's pybind11 in `trit_gemv_wrapper.cu` for benchmarking.
+The core trick: each ternary weight (-1, 0, +1) reduces a multiply-accumulate to a conditional add/subtract/skip. The kernel uses Ada/Hopper/Blackwell `dp4a` intrinsics on int4-packed weights and pre-interleaved int8 activations to do four ternary-times-int8 dot products per instruction.
+## Build
+```bash
+cd kernel
+./build.sh
+```
+The build script targets SM 70/75/80/86/89/90/100/120 in one fat binary so the `.so` runs on V100, T4, A100, RTX 30/40/50, H100, and B100/B200 without recompilation.
+Required: `nvcc` (CUDA 11.8 or newer) and a C++ compiler.
+## Performance (Qwen2.5-7B, d=2, 3.47 bpw, 3.3 GB model)
+| GPU | L2 cache | Tokens/sec | Speedup vs FP16 cuBLAS | Effective BW |
+|---|---|---|---|---|
+| RTX 4090 | 72 MB | 588 | 7.8× | 1940 GB/s |
+| RTX 3090 | 6 MB | 192 | 3.4× | 633 GB/s |
+| RTX 4080 Laptop | 64 MB | 133 | 5.8× | 439 GB/s |
+| A100 80GB | 40 MB | 201 | 4.2× | 663 GB/s |
+These are per-layer GEMV benchmarks projected to full-model token-generation throughput. The L2-cache size correlates strongly with speedup because each `d=2` layer fits in L2 on the RTX 4090, giving an effective bandwidth roughly 2× HBM bandwidth.
+See `kernel/bench_*.py` for the benchmark drivers.
+## Launch contract
+The kernels in `trit_gemv.cu` and `trit_gemv_standalone.cu` assume:
+| Constraint | Why | What happens if violated |
+|---|---|---|
+| `blockDim.x == 32` (one warp per block) | Kernels use `__shfl_down_sync(0xFFFFFFFF, ...)` and lane-0 reduction | OOB index reads + race on `y[row]` |
+| `in_features % 64 == 0` | Group size is fixed at 64 weights | Trailing partial group is silently dropped — incorrect output for that row |
+| Weight, scale, and activation buffers are device-resident and properly aligned | Kernel uses `__ldg` for cached loads | UB / device fault |
+If your model has `in_features` not divisible by 64, pad the weight matrix to the next multiple of 64 with zero rows before quantizing.
+## API surface
+C ABI in `trit_gemv_standalone.cu`:
+```c
+// Best-tested d=2 path (champion for 4090)
+void trit_gemv_d2_fast(
+    const int32_t* pt,        // [rows * num_groups * 8] int4-packed weights
+    const float*   ws,        // [rows * num_groups]     scales
+    const int32_t* xt_e,      // [num_groups * 8]        even nibble activations
+    const int32_t* xt_o,      // [num_groups * 8]        odd nibble activations
+    const float*   xs,        // [num_groups]            activation scales
+    float*         y,         // [rows]                  output
+    int cols, int rows, int num_groups,
+    int use_l2_persist        // 0 = off, 1 = enable L2 persistence
+);
+// Native-trit packed d=3 (no int4 intermediate)
+void trit_gemv_d3_native(
+    const int32_t* pt,        // [rows * num_groups * 13] trit-packed
+    const float*   sc,
+    const float*   x,
+    float*         y,
+    int cols, int rows, int depth
+);
+// L2 cache size query (for deciding whether to enable persist)
+int  get_l2_cache_bytes();
+void get_gpu_name(char* buf, int buflen);
+void cuda_sync();
+```
+## Known issues
+Documented in [KNOWN_ISSUES.md](KNOWN_ISSUES.md). Summary:
+- **Launch contract is implicit, not enforced.** Kernels are correct only with `blockDim.x == 32`. There are no runtime asserts; the contract is guarded only by the host-side wrappers in this file. Direct callers must respect it.
+- **`in_features` not a multiple of 64 silently fails.** No assert. Pad your matrix.
+- **C API has no input validation.** Null pointers, wrong dimensions, and buffer-shape mismatches become device faults or OOB reads. This is a public-API hardening item we have not yet completed.
+- **CUDA error returns are not surfaced to the caller** in some helper paths. If a kernel launch fails, `cuda_sync()` will see it but the public functions return `void`.
+## Citation
+```
+@article{stentzel2026ternaryptq,
+  title  = {Balanced Ternary Post-Training Quantization for Large Language Models},
+  author = {Stentzel, Eric},
+  year   = 2026,
+  note   = {Entrit Systems}
+}
+```
+## License
+Apache-2.0.

build.sh ADDED Viewed

	@@ -0,0 +1,52 @@

+#!/bin/bash
+# Build libtrit_gemv.so — standalone CUDA kernel library
+# No PyTorch, no Python, no framework dependency.
+# Just nvcc + CUDA runtime.
+#
+# Fat binary: compiles for all major GPU architectures.
+# The right kernel is selected at runtime based on the GPU.
+set -e
+cd "$(dirname "$0")"
+# Detect nvcc
+NVCC=$(which nvcc 2>/dev/null || echo "/usr/local/cuda/bin/nvcc")
+if [ ! -x "$NVCC" ]; then
+    echo "ERROR: nvcc not found. Install CUDA toolkit."
+    exit 1
+fi
+echo "Using nvcc: $NVCC"
+$NVCC --version | head -1
+# Architecture targets (fat binary)
+# Volta (V100), Turing (2080), Ampere (3090/A100),
+# Ada (4080/4090), Hopper (H100), Blackwell (5070+)
+ARCHS=""
+ARCHS="$ARCHS -gencode=arch=compute_70,code=sm_70"   # V100
+ARCHS="$ARCHS -gencode=arch=compute_75,code=sm_75"   # 2080
+ARCHS="$ARCHS -gencode=arch=compute_80,code=sm_80"   # A100, 3080
+ARCHS="$ARCHS -gencode=arch=compute_86,code=sm_86"   # 3090
+ARCHS="$ARCHS -gencode=arch=compute_89,code=sm_89"   # 4080, 4090
+ARCHS="$ARCHS -gencode=arch=compute_90,code=sm_90"   # H100
+# Blackwell — only if nvcc supports it (CUDA 12.8+)
+if $NVCC --help 2>&1 | grep -q "compute_120"; then
+    ARCHS="$ARCHS -gencode=arch=compute_120,code=sm_120"
+    echo "Including Blackwell (sm_120)"
+fi
+echo "Building libtrit_gemv.so..."
+$NVCC -O3 --use_fast_math \
+    -shared -Xcompiler -fPIC \
+    $ARCHS \
+    -o libtrit_gemv.so \
+    trit_gemv_standalone.cu
+ls -la libtrit_gemv.so
+echo "Done! Library ready at $(pwd)/libtrit_gemv.so"
+echo ""
+echo "Usage from Python:"
+echo "  from trit_gemv_lib import TritGEMV"
+echo "  lib = TritGEMV()"
+echo "  lib.gemv_d2(weights, scales, x_int8, x_scales, output, K, M, ng)"

trit_gemv.cu ADDED Viewed

	@@ -0,0 +1,292 @@

+/*
+ * TritLLM CUDA Kernel — Ternary GEMV (Matrix-Vector Multiply)
+ *
+ * Core operation: y = W_ternary @ x
+ * Where W_ternary is packed ternary weights with per-group scales.
+ *
+ * Each group of 64 weights has:
+ *   - A depth (1-4 trits per weight)
+ *   - A FP16 scale factor
+ *   - Packed trit values (2 bits per trit: 00=0, 01=+1, 10=-1, 11=unused)
+ *
+ * The key: NO floating-point multiply in the inner loop.
+ * Ternary MAC = conditional add/subtract.
+ */
+#include <cuda_fp16.h>
+#include <cuda_runtime.h>
+#include <stdint.h>
+#define GROUP_SIZE 64
+#define WARP_SIZE 32
+// Trit encoding: 2 bits per trit
+// 00 = 0, 01 = +1, 10 = -1
+#define TRIT_ZERO 0
+#define TRIT_POS  1
+#define TRIT_NEG  2
+/*
+ * Depth 1 (3 levels: {-1, 0, +1}): 1 trit per weight, 2 bits per weight
+ * Pack 16 trits per uint32 (16 * 2 = 32 bits)
+ * Group of 64 = 4 uint32s
+ *
+ * Inner loop: read trit, branch-free conditional accumulate
+ */
+__device__ __forceinline__ float trit_mac_d1(
+    const uint32_t* __restrict__ packed,  // 4 uint32s = 64 trits
+    const float* __restrict__ x,          // 64 activations
+    int lane                              // warp lane (0-31)
+) {
+    float acc = 0.0f;
+    // Each thread in warp handles 2 elements (64 / 32 = 2)
+    #pragma unroll
+    for (int i = 0; i < 2; i++) {
+        int idx = lane * 2 + i;
+        int word = idx / 16;          // which uint32 (0-3)
+        int bit_offset = (idx % 16) * 2;  // bit position within word
+        uint32_t trit = (packed[word] >> bit_offset) & 0x3;
+        float val = x[idx];
+        // Branch-free: acc += (trit == 1) * val - (trit == 2) * val
+        acc += ((trit == TRIT_POS) - (trit == TRIT_NEG)) * val;
+    }
+    return acc;
+}
+/*
+ * Depth 2 (9 levels: {-4..+4}): 2 trits per weight, 4 bits per weight
+ * Trit value = trit1 * 3 + trit0 - 4  (maps to -4..+4)
+ * Pack 8 values per uint32 (8 * 4 = 32 bits)
+ * Group of 64 = 8 uint32s
+ */
+__device__ __forceinline__ float trit_mac_d2(
+    const uint32_t* __restrict__ packed,  // 8 uint32s = 64 values
+    const float* __restrict__ x,
+    int lane
+) {
+    float acc = 0.0f;
+    #pragma unroll
+    for (int i = 0; i < 2; i++) {
+        int idx = lane * 2 + i;
+        int word = idx / 8;
+        int bit_offset = (idx % 8) * 4;
+        uint32_t bits = (packed[word] >> bit_offset) & 0xF;
+        // Decode: trit1 = bits >> 2, trit0 = bits & 0x3
+        // value = (trit1_sign * 3 + trit0_sign)
+        // where trit_sign: 00->0, 01->+1, 10->-1
+        int t0 = (int)(bits & 0x3);
+        int t1 = (int)((bits >> 2) & 0x3);
+        int sign0 = (t0 == TRIT_POS) - (t0 == TRIT_NEG);
+        int sign1 = (t1 == TRIT_POS) - (t1 == TRIT_NEG);
+        int level = sign1 * 3 + sign0;  // -4 to +4
+        // Still no FP multiply — integer * float is one instruction
+        // level is small integer, compiler optimizes to repeated add
+        acc += level * x[idx];
+    }
+    return acc;
+}
+/*
+ * Depth 3 (27 levels: {-13..+13}): 3 trits per weight, 6 bits per weight
+ * Pack 5 values per uint32 (5 * 6 = 30 bits, 2 wasted)
+ * Group of 64 = 13 uint32s (64 values, last uint32 has 4 values)
+ */
+__device__ __forceinline__ float trit_mac_d3(
+    const uint32_t* __restrict__ packed,  // 13 uint32s
+    const float* __restrict__ x,
+    int lane
+) {
+    float acc = 0.0f;
+    #pragma unroll
+    for (int i = 0; i < 2; i++) {
+        int idx = lane * 2 + i;
+        int word = idx / 5;
+        int pos = idx % 5;
+        int bit_offset = pos * 6;
+        uint32_t bits = (packed[word] >> bit_offset) & 0x3F;
+        int t0 = (int)(bits & 0x3);
+        int t1 = (int)((bits >> 2) & 0x3);
+        int t2 = (int)((bits >> 4) & 0x3);
+        int s0 = (t0 == TRIT_POS) - (t0 == TRIT_NEG);
+        int s1 = (t1 == TRIT_POS) - (t1 == TRIT_NEG);
+        int s2 = (t2 == TRIT_POS) - (t2 == TRIT_NEG);
+        int level = s2 * 9 + s1 * 3 + s0;  // -13 to +13
+        acc += level * x[idx];
+    }
+    return acc;
+}
+/*
+ * Depth 4 (81 levels: {-40..+40}): 4 trits per weight, 8 bits per weight
+ * Pack 4 values per uint32 (4 * 8 = 32 bits, perfect)
+ * Group of 64 = 16 uint32s
+ */
+__device__ __forceinline__ float trit_mac_d4(
+    const uint32_t* __restrict__ packed,  // 16 uint32s
+    const float* __restrict__ x,
+    int lane
+) {
+    float acc = 0.0f;
+    #pragma unroll
+    for (int i = 0; i < 2; i++) {
+        int idx = lane * 2 + i;
+        int word = idx / 4;
+        int bit_offset = (idx % 4) * 8;
+        uint32_t bits = (packed[word] >> bit_offset) & 0xFF;
+        int t0 = (int)(bits & 0x3);
+        int t1 = (int)((bits >> 2) & 0x3);
+        int t2 = (int)((bits >> 4) & 0x3);
+        int t3 = (int)((bits >> 6) & 0x3);
+        int s0 = (t0 == TRIT_POS) - (t0 == TRIT_NEG);
+        int s1 = (t1 == TRIT_POS) - (t1 == TRIT_NEG);
+        int s2 = (t2 == TRIT_POS) - (t2 == TRIT_NEG);
+        int s3 = (t3 == TRIT_POS) - (t3 == TRIT_NEG);
+        int level = s3 * 27 + s2 * 9 + s1 * 3 + s0;
+        acc += level * x[idx];
+    }
+    return acc;
+}
+/*
+ * Main GEMV kernel: y[out_features] = W[out_features, in_features] @ x[in_features]
+ *
+ * W is stored as packed ternary groups:
+ *   - packed_trits: variable-length packed trit data per group
+ *   - scales: FP16 scale per group
+ *   - depths: uint8 depth per group (1-4)
+ *   - group_offsets: byte offset into packed_trits for each group
+ *
+ * One warp per output row, iterating over groups along the input dimension.
+ * Warp reduction gives the final dot product.
+ */
+// Simplified version: uniform depth across all groups in a tensor
+// (variable-depth version below)
+__global__ void trit_gemv_uniform(
+    const uint32_t* __restrict__ packed_trits,  // packed trit data
+    const float* __restrict__ scales,           // [num_groups] FP16 stored as float
+    const float* __restrict__ x,                // [in_features]
+    float* __restrict__ y,                      // [out_features]
+    int in_features,
+    int out_features,
+    int depth                                   // uniform depth 1-4
+) {
+    int row = blockIdx.x;      // one block per output row
+    if (row >= out_features) return;
+    int lane = threadIdx.x;    // lane within warp (0-31)
+    int num_groups = in_features / GROUP_SIZE;
+    // Words per group depends on depth
+    int words_per_group;
+    switch (depth) {
+        case 1: words_per_group = 4; break;   // 64 * 2 / 32
+        case 2: words_per_group = 8; break;   // 64 * 4 / 32
+        case 3: words_per_group = 13; break;  // ceil(64 * 6 / 32)
+        case 4: words_per_group = 16; break;  // 64 * 8 / 32
+        default: words_per_group = 4; break;
+    }
+    float row_acc = 0.0f;
+    for (int g = 0; g < num_groups; g++) {
+        int group_offset = (row * num_groups + g) * words_per_group;
+        const uint32_t* group_data = &packed_trits[group_offset];
+        const float* group_x = &x[g * GROUP_SIZE];
+        float scale = scales[row * num_groups + g];
+        float group_acc;
+        switch (depth) {
+            case 1: group_acc = trit_mac_d1(group_data, group_x, lane); break;
+            case 2: group_acc = trit_mac_d2(group_data, group_x, lane); break;
+            case 3: group_acc = trit_mac_d3(group_data, group_x, lane); break;
+            case 4: group_acc = trit_mac_d4(group_data, group_x, lane); break;
+            default: group_acc = 0.0f; break;
+        }
+        // Warp reduction
+        #pragma unroll
+        for (int offset = 16; offset > 0; offset >>= 1) {
+            group_acc += __shfl_down_sync(0xFFFFFFFF, group_acc, offset);
+        }
+        // Lane 0 accumulates the scaled result
+        if (lane == 0) {
+            row_acc += group_acc * scale;
+        }
+    }
+    // Write output
+    if (lane == 0) {
+        y[row] = row_acc;
+    }
+}
+/*
+ * Variable-depth version: each group can have a different depth.
+ * Uses a depth map and offset table to handle mixed-depth tensors.
+ */
+__global__ void trit_gemv_variable(
+    const uint32_t* __restrict__ packed_trits,
+    const float* __restrict__ scales,
+    const uint8_t* __restrict__ depth_map,      // [num_groups_per_row] depth per group
+    const int* __restrict__ group_offsets,       // [num_groups_per_row + 1] word offsets
+    const float* __restrict__ x,
+    float* __restrict__ y,
+    int in_features,
+    int out_features
+) {
+    int row = blockIdx.x;
+    if (row >= out_features) return;
+    int lane = threadIdx.x;
+    int num_groups = in_features / GROUP_SIZE;
+    float row_acc = 0.0f;
+    for (int g = 0; g < num_groups; g++) {
+        int depth = depth_map[g];
+        int word_offset = group_offsets[g] + row * group_offsets[num_groups];  // row stride
+        const uint32_t* group_data = &packed_trits[word_offset];
+        const float* group_x = &x[g * GROUP_SIZE];
+        float scale = scales[row * num_groups + g];
+        float group_acc;
+        switch (depth) {
+            case 1: group_acc = trit_mac_d1(group_data, group_x, lane); break;
+            case 2: group_acc = trit_mac_d2(group_data, group_x, lane); break;
+            case 3: group_acc = trit_mac_d3(group_data, group_x, lane); break;
+            case 4: group_acc = trit_mac_d4(group_data, group_x, lane); break;
+            default: group_acc = 0.0f; break;
+        }
+        #pragma unroll
+        for (int offset = 16; offset > 0; offset >>= 1) {
+            group_acc += __shfl_down_sync(0xFFFFFFFF, group_acc, offset);
+        }
+        if (lane == 0) {
+            row_acc += group_acc * scale;
+        }
+    }
+    if (lane == 0) {
+        y[row] = row_acc;
+    }
+}

trit_gemv_lib.py ADDED Viewed

	@@ -0,0 +1,280 @@

+"""Framework-agnostic trit GEMV library.
+Loads the pre-compiled libtrit_gemv.so via ctypes.
+Works with PyTorch, JAX, CuPy, or raw CUDA pointers.
+Compile the library once:
+    cd kernel/
+    ./build.sh
+Then use from any framework:
+    from trit_gemv_lib import TritGEMV
+    lib = TritGEMV()
+    # PyTorch
+    lib.gemv_d2(pt_tensor, ws_tensor, xt_tensor, xs_tensor, y_tensor, cols, rows, ng)
+    # Raw pointers (CuPy, JAX, etc.)
+    lib.gemv_d2_ptr(pt_ptr, ws_ptr, xt_ptr, xs_ptr, y_ptr, cols, rows, ng)
+"""
+import ctypes
+import os
+import subprocess
+import sys
+# Find the library
+_LIB_NAMES = ['libtrit_gemv.so', 'libtrit_gemv.dll', 'trit_gemv.so']
+_SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+def _find_lib():
+    for name in _LIB_NAMES:
+        path = os.path.join(_SCRIPT_DIR, name)
+        if os.path.exists(path):
+            return path
+    return None
+def _build_lib():
+    """Auto-compile if not found."""
+    build_script = os.path.join(_SCRIPT_DIR, 'build.sh')
+    if os.path.exists(build_script):
+        print("Building libtrit_gemv.so...", flush=True)
+        subprocess.run(['bash', build_script], cwd=_SCRIPT_DIR, check=True)
+    else:
+        # Inline build
+        cu_file = os.path.join(_SCRIPT_DIR, 'trit_gemv_standalone.cu')
+        out_file = os.path.join(_SCRIPT_DIR, 'libtrit_gemv.so')
+        if not os.path.exists(cu_file):
+            raise FileNotFoundError(f"Cannot find {cu_file}")
+        # Detect GPU architecture
+        try:
+            import torch
+            cc = torch.cuda.get_device_capability(0)
+            arch = f"compute_{cc[0]}{cc[1]}"
+            sm = f"sm_{cc[0]}{cc[1]}"
+            gencode = f"-gencode=arch={arch},code={sm}"
+        except:
+            # Default to common architectures
+            gencode = " ".join([
+                f"-gencode=arch=compute_{a},code=sm_{a}"
+                for a in ["70", "75", "80", "86", "89", "90"]
+            ])
+        cmd = f"nvcc -O3 --use_fast_math -shared -Xcompiler -fPIC {gencode} -o {out_file} {cu_file}"
+        print(f"Compiling: {cmd}", flush=True)
+        subprocess.run(cmd, shell=True, check=True)
+    return _find_lib()
+class TritGEMV:
+    """Framework-agnostic trit GEMV kernel."""
+    def __init__(self, lib_path=None):
+        if lib_path is None:
+            lib_path = _find_lib()
+        if lib_path is None:
+            lib_path = _build_lib()
+        if lib_path is None:
+            raise RuntimeError("Cannot find or build libtrit_gemv.so")
+        self._lib = ctypes.CDLL(lib_path)
+        # Set up function signatures
+        # d2 dp4a (champion)
+        self._lib.trit_gemv_d2_dp4a.argtypes = [
+            ctypes.c_void_p,  # pt (int32*)
+            ctypes.c_void_p,  # ws (float*)
+            ctypes.c_void_p,  # xt (int32*)
+            ctypes.c_void_p,  # xs (float*)
+            ctypes.c_void_p,  # y (float*)
+            ctypes.c_int,     # cols
+            ctypes.c_int,     # rows
+            ctypes.c_int,     # num_groups
+            ctypes.c_int,     # use_l2_persist
+        ]
+        self._lib.trit_gemv_d2_dp4a.restype = None
+        # d3 native trit
+        self._lib.trit_gemv_d3_native.argtypes = [
+            ctypes.c_void_p,  # pt
+            ctypes.c_void_p,  # sc
+            ctypes.c_void_p,  # x
+            ctypes.c_void_p,  # y
+            ctypes.c_int,     # cols
+            ctypes.c_int,     # rows
+            ctypes.c_int,     # depth
+        ]
+        self._lib.trit_gemv_d3_native.restype = None
+        # d3 int8 dp4a (no decode, DRAM-bound path)
+        self._lib.trit_gemv_d3_int8_dp4a.argtypes = [
+            ctypes.c_void_p,  # wt (int32*)
+            ctypes.c_void_p,  # ws (float*)
+            ctypes.c_void_p,  # xt (int32*)
+            ctypes.c_void_p,  # xs (float*)
+            ctypes.c_void_p,  # y (float*)
+            ctypes.c_int,     # cols
+            ctypes.c_int,     # rows
+            ctypes.c_int,     # num_groups
+            ctypes.c_int,     # use_l2_persist
+        ]
+        self._lib.trit_gemv_d3_int8_dp4a.restype = None
+        # Utility
+        self._lib.get_l2_cache_bytes.restype = ctypes.c_int
+        self._lib.cuda_sync.restype = None
+        buf = ctypes.create_string_buffer(256)
+        self._lib.get_gpu_name(buf, 256)
+        self.gpu_name = buf.value.decode()
+        self.l2_bytes = self._lib.get_l2_cache_bytes()
+    def sync(self):
+        self._lib.cuda_sync()
+    def _get_ptr(self, tensor):
+        """Extract GPU pointer from any framework's tensor."""
+        if hasattr(tensor, 'data_ptr'):
+            # PyTorch
+            return tensor.data_ptr()
+        elif hasattr(tensor, '__cuda_array_interface__'):
+            # CuPy, JAX, Numba
+            return tensor.__cuda_array_interface__['data'][0]
+        elif isinstance(tensor, int):
+            # Raw pointer
+            return tensor
+        else:
+            raise TypeError(f"Cannot extract GPU pointer from {type(tensor)}")
+    def gemv_d2(self, pt, ws, xt, xs, y, cols, rows, num_groups, l2_persist=True):
+        """D2 GEMV with int4 packing + dp4a.
+        Args:
+            pt: int32 tensor [rows * num_groups * 8] — int4 packed weights
+            ws: float32 tensor [rows * num_groups] — weight scales
+            xt: int32 tensor [num_groups * 16] — int8 packed activations
+            xs: float32 tensor [num_groups] — activation scales
+            y:  float32 tensor [rows] — output (written in-place)
+            cols: input dimension (K)
+            rows: output dimension (M)
+            num_groups: K // 64
+            l2_persist: enable L2 cache persistence (default True)
+        """
+        self._lib.trit_gemv_d2_dp4a(
+            self._get_ptr(pt), self._get_ptr(ws),
+            self._get_ptr(xt), self._get_ptr(xs),
+            self._get_ptr(y), cols, rows, num_groups,
+            1 if l2_persist else 0,
+        )
+    def gemv_adaptive(self, pt_int4, ws, xt, xs, y, cols, rows, num_groups,
+                      pt_int8=None):
+        """Hardware-aware GEMV: auto-selects best kernel based on L2 cache.
+        If the int4 weight data fits in L2 → uses d2 int4 + dp4a (5x FP16)
+        If not → uses pre-expanded int8 + dp4a (2x FP16, no decode overhead)
+        Args:
+            pt_int4: int32 tensor — int4 packed weights (always stored, compact)
+            ws: weight scales
+            xt, xs: quantized activations
+            y: output
+            pt_int8: optional pre-expanded int8 weights for DRAM path.
+                     If None and needed, expanded on-the-fly (one-time cost).
+        """
+        weight_bytes = rows * num_groups * 8 * 4  # int4: 8 words per group
+        l2_margin = self.l2_bytes * 0.75  # leave 25% for x, scales, other data
+        if weight_bytes < l2_margin:
+            # Fits in L2 → use compact int4, decode inline at L2 speed
+            self._lib.trit_gemv_d2_dp4a(
+                self._get_ptr(pt_int4), self._get_ptr(ws),
+                self._get_ptr(xt), self._get_ptr(xs),
+                self._get_ptr(y), cols, rows, num_groups, 1)
+        else:
+            # Doesn't fit L2 → use int8 for zero-decode DRAM speed
+            if pt_int8 is None:
+                raise ValueError(
+                    f"Layer ({weight_bytes/1e6:.0f} MB) exceeds L2 ({self.l2_bytes/1e6:.0f} MB). "
+                    f"Provide pre-expanded pt_int8 for DRAM path. "
+                    f"Use TritGEMV.expand_int4_to_int8(pt_int4) at model load time."
+                )
+            self._lib.trit_gemv_d3_int8_dp4a(
+                self._get_ptr(pt_int8), self._get_ptr(ws),
+                self._get_ptr(xt), self._get_ptr(xs),
+                self._get_ptr(y), cols, rows, num_groups, 0)
+    @staticmethod
+    def expand_int4_to_int8(pt_int4, device='cuda'):
+        """Pre-expand int4 packed weights to int8 for DRAM-bound layers.
+        Called once at model load. Uses 2x more VRAM but eliminates decode overhead.
+        int4: 8 words per group → int8: 16 words per group
+        Args:
+            pt_int4: int32 tensor [n_groups * 8] — int4 packed
+        Returns:
+            int32 tensor [n_groups * 16] — int8 packed (dp4a compatible)
+        """
+        import torch
+        n_words = pt_int4.shape[0]
+        n_groups = n_words // 8
+        # Each int4 word has 8 nibbles → 8 int8 values → 2 int8x4 words
+        pt_int8 = torch.zeros(n_groups * 16, dtype=torch.int32, device=device)
+        # Expand on GPU (vectorized)
+        for g in range(n_groups):
+            for w in range(8):
+                word = pt_int4[g * 8 + w].item()
+                for nib in range(8):
+                    val = (word >> (nib * 4)) & 0xF
+                    if val & 0x8:
+                        val = val | 0xFFFFFFF0  # sign extend
+                    val = val & 0xFF
+                    out_col = w * 8 + nib
+                    out_word = out_col // 4
+                    out_byte = out_col % 4
+                    pt_int8[g * 16 + out_word] |= (val << (out_byte * 8))
+        return pt_int8
+    def gemv_d3(self, pt, sc, x, y, cols, rows, depth=3):
+        """D3 GEMV with native trit packing.
+        Args:
+            pt: int32 tensor [rows * ng * 13] — trit packed weights
+            sc: float32 tensor [rows * ng] — scales
+            x:  float32 tensor [cols] — activations
+            y:  float32 tensor [rows] — output
+        """
+        self._lib.trit_gemv_d3_native(
+            self._get_ptr(pt), self._get_ptr(sc),
+            self._get_ptr(x), self._get_ptr(y),
+            cols, rows, depth,
+        )
+    def gemv_d3_int8(self, wt, ws, xt, xs, y, cols, rows, num_groups, l2_persist=True):
+        """D3 GEMV with int8 level packing + dp4a (same quality as d3, dp4a speed).
+        Args:
+            wt: int32 tensor [rows * num_groups * 16] — int8 packed levels
+            ws: float32 tensor [rows * num_groups] — weight scales
+            xt: int32 tensor [num_groups * 16] — int8 packed activations
+            xs: float32 tensor [num_groups * 16] — per-word x scales
+            y:  float32 tensor [rows] — output
+        """
+        if not hasattr(self._lib, 'trit_gemv_d3_int8_dp4a'):
+            raise RuntimeError("d3 int8 not in this build — rebuild libtrit_gemv.so")
+        self._lib.trit_gemv_d3_int8_dp4a(
+            self._get_ptr(wt), self._get_ptr(ws),
+            self._get_ptr(xt), self._get_ptr(xs),
+            self._get_ptr(y), cols, rows, num_groups,
+            1 if l2_persist else 0,
+        )
+    def __repr__(self):
+        return f"TritGEMV(gpu='{self.gpu_name}', l2={self.l2_bytes/1e6:.0f}MB)"

trit_gemv_standalone.cu ADDED Viewed

	@@ -0,0 +1,598 @@

+/*
+ * Standalone trit GEMV kernel — no PyTorch dependency.
+ * Compiles with nvcc to a shared library (.so/.dll).
+ * Called from Python via ctypes, or from C/C++ directly.
+ *
+ * Compile:
+ *   nvcc -O3 --use_fast_math -shared -Xcompiler -fPIC \
+ *     -gencode=arch=compute_70,code=sm_70 \
+ *     -gencode=arch=compute_75,code=sm_75 \
+ *     -gencode=arch=compute_80,code=sm_80 \
+ *     -gencode=arch=compute_86,code=sm_86 \
+ *     -gencode=arch=compute_89,code=sm_89 \
+ *     -gencode=arch=compute_90,code=sm_90 \
+ *     -gencode=arch=compute_100,code=sm_100 \
+ *     -gencode=arch=compute_120,code=sm_120 \
+ *     -o libtrit_gemv.so trit_gemv_standalone.cu
+ *
+ * Supports: Volta(V100), Turing(2080), Ampere(3090/A100), Ada(4080/4090),
+ *           Hopper(H100), Blackwell(5070/5090) — all in one binary.
+ *
+ * API: C functions with extern "C" — callable from any language.
+ */
+#include <cuda_runtime.h>
+#include <stdint.h>
+#define GROUP_SIZE 64
+#define WARP_SIZE 32
+#define TRIT_POS 1
+#define TRIT_NEG 2
+// Forward declarations
+static void set_l2_persist(void* ptr, size_t bytes);
+static void clear_l2_persist();
+// ============================================================
+// V27: D2 int4-packed + dp4a + L2 persist (champion kernel)
+// ============================================================
+// ============================================================
+// V28: Branchless interleaved nibble decode (7 instructions for 8 weights)
+//
+// The trick: extract even nibbles (0,2,4,6) and odd nibbles (1,3,5,7)
+// as separate byte vectors using mask + shift. Sign-extend all 4 bytes
+// simultaneously with XOR + SUB (zero branches).
+//
+// x activations are pre-interleaved to match: x_evens has values at
+// positions 0,2,4,6 and x_odds has 1,3,5,7. Pre-interleave is done
+// once at activation quantization time (negligible cost).
+//
+// Instructions per 8 weights:
+//   v27: 32 (loop with branches)
+//   v28: 14 (7 expand + 2 dp4a + 3 load + 2 scale)
+//   Balance BW shift: 3.4 → 5.6 TB/s on A100 → crosses into memory-bound!
+// ============================================================
+#define V28_RPB 16
+#define V28_WPG 8
+#define V28_BS (V28_RPB * WARP_SIZE)
+__global__ void k_v28(
+    const uint32_t* __restrict__ pt,    // int4 packed: 8 weights per uint32
+    const float* __restrict__ ws,       // weight scales [rows * ng]
+    const uint32_t* __restrict__ xt_e,  // x int8 EVEN positions [ng * 8]
+    const uint32_t* __restrict__ xt_o,  // x int8 ODD positions [ng * 8]
+    const float* __restrict__ xs,       // x scales [ng]
+    float* __restrict__ y,
+    int cols, int rows, int num_groups
+) {
+    int wid = threadIdx.x / WARP_SIZE;
+    int lane = threadIdx.x % WARP_SIZE;
+    int row = blockIdx.x * V28_RPB + wid;
+    if (row >= rows) return;
+    const uint32_t* row_w = &pt[row * num_groups * V28_WPG];
+    const float* row_ws = &ws[row * num_groups];
+    float acc = 0.0f;
+    int total_words = num_groups * V28_WPG;
+    for (int base = 0; base < total_words; base += WARP_SIZE) {
+        int w = base + lane;
+        if (w < total_words) {
+            // COALESCED load
+            uint32_t word = __ldg(&row_w[w]);
+            int g = w >> 3;            // group index (shift, 1 cycle)
+            int word_in_group = w & 7; // word within group (mask, 1 cycle)
+            // === BRANCHLESS INT4→INT8 EXPANSION (7 instructions) ===
+            // Extract even nibbles (weights 0,2,4,6) into bytes
+            uint32_t evens = word & 0x0F0F0F0F;                    // AND (1 op)
+            evens = (evens ^ 0x08080808) - 0x08080808;              // XOR+SUB (2 ops)
+            // Extract odd nibbles (weights 1,3,5,7) into bytes
+            uint32_t odds = (word >> 4) & 0x0F0F0F0F;              // SHR+AND (2 ops)
+            odds = (odds ^ 0x08080808) - 0x08080808;                // XOR+SUB (2 ops)
+            // Total: 7 instructions for 8 sign-extended int8 values
+            // dp4a against pre-interleaved x
+            // x_evens[g*8 + word_in_group] has activations at even positions
+            // x_odds[g*8 + word_in_group] has activations at odd positions
+            int x_idx = g * 8 + word_in_group;
+            uint32_t xe = __ldg(&xt_e[x_idx]);
+            uint32_t xo = __ldg(&xt_o[x_idx]);
+            int dp_e = __dp4a((int)evens, (int)xe, 0);
+            int dp_o = __dp4a((int)odds, (int)xo, 0);
+            float combined_scale = __ldg(&row_ws[g]) * __ldg(&xs[g]);
+            acc += (float)(dp_e + dp_o) * combined_scale;
+        }
+    }
+    #pragma unroll
+    for (int o = 16; o > 0; o >>= 1)
+        acc += __shfl_down_sync(0xFFFFFFFF, acc, o);
+    if (lane == 0) y[row] = acc;
+}
+// ============================================================
+// V29: BIAS TRICK — zero sign extension, unsigned weights + correction
+//
+// Store weights as (level + 4) → range 0-8 (unsigned).
+// Zero-extend nibbles: AND only, no XOR, no SUB.
+// dp4a gives biased result. Subtract precomputed correction.
+//
+// Decode: 3 instructions (AND, SHR, AND) vs v28's 7
+// Correction: 1 SUB + 1 LD per word (precomputed x_bias)
+// Net: 10 instructions per word vs v28's 14
+// ============================================================
+#define V29_RPB 16
+#define V29_WPG 8
+#define V29_BS (V29_RPB * WARP_SIZE)
+__global__ void k_v29(
+    const uint32_t* __restrict__ pt,    // UNSIGNED int4: (level+4) packed, 0-8 per nibble
+    const float* __restrict__ ws,       // weight scales [rows * ng]
+    const uint32_t* __restrict__ xt_e,  // x int8 EVEN [ng * 8]
+    const uint32_t* __restrict__ xt_o,  // x int8 ODD [ng * 8]
+    const int* __restrict__ x_bias,     // precomputed 4×(sum of 8 x values) per word position [ng * 8]
+    const float* __restrict__ xs,       // x scales [ng]
+    float* __restrict__ y,
+    int cols, int rows, int num_groups
+) {
+    int wid = threadIdx.x / WARP_SIZE;
+    int lane = threadIdx.x % WARP_SIZE;
+    int row = blockIdx.x * V29_RPB + wid;
+    if (row >= rows) return;
+    const uint32_t* row_w = &pt[row * num_groups * V29_WPG];
+    const float* row_ws = &ws[row * num_groups];
+    float acc = 0.0f;
+    int total_words = num_groups * V29_WPG;
+    for (int base = 0; base < total_words; base += WARP_SIZE) {
+        int w = base + lane;
+        if (w < total_words) {
+            uint32_t word = __ldg(&row_w[w]);
+            int g = w >> 3;
+            int wig = w & 7;
+            // BIAS DECODE: 3 instructions total (no XOR, no SUB)
+            uint32_t evens = word & 0x0F0F0F0F;               // AND
+            uint32_t odds = (word >> 4) & 0x0F0F0F0F;          // SHR + AND
+            // Values 0-8 are valid positive int8 — no sign extension needed
+            int x_idx = g * 8 + wig;
+            uint32_t xe = __ldg(&xt_e[x_idx]);
+            uint32_t xo = __ldg(&xt_o[x_idx]);
+            // dp4a: biased result (includes +4 per weight)
+            int dp = __dp4a((int)evens, (int)xe, 0)
+                   + __dp4a((int)odds, (int)xo, 0);
+            // Subtract precomputed bias: 4 × sum of 8 x values for this word
+            int bias = __ldg(&x_bias[x_idx]);
+            dp -= bias;
+            float combined_scale = __ldg(&row_ws[g]) * __ldg(&xs[g]);
+            acc += (float)dp * combined_scale;
+        }
+    }
+    #pragma unroll
+    for (int o = 16; o > 0; o >>= 1)
+        acc += __shfl_down_sync(0xFFFFFFFF, acc, o);
+    if (lane == 0) y[row] = acc;
+}
+// v29 wrapper moved to extern "C" block below
+// Keep v27 as fallback (doesn't need interleaved x)
+__device__ __forceinline__ uint32_t extract_int4x4_to_int8x4(uint32_t word, int start) {
+    uint32_t result = 0;
+    #pragma unroll
+    for (int i = 0; i < 4; i++) {
+        int shift = (start + i) * 4;
+        int nibble = (word >> shift) & 0xF;
+        int val = (nibble & 0x8) ? (nibble | 0xFFFFFFF0) : nibble;
+        result |= ((uint32_t)(val & 0xFF)) << (i * 8);
+    }
+    return result;
+}
+#define V27_RPB 16
+#define V27_WPG 8
+#define V27_BS (V27_RPB * WARP_SIZE)
+__global__ void k_v27(
+    const uint32_t* __restrict__ pt,
+    const float* __restrict__ ws,
+    const uint32_t* __restrict__ xt,
+    const float* __restrict__ xs,
+    float* __restrict__ y,
+    int cols, int rows, int num_groups
+) {
+    int wid = threadIdx.x / WARP_SIZE;
+    int lane = threadIdx.x % WARP_SIZE;
+    int row = blockIdx.x * V28_RPB + wid;
+    if (row >= rows) return;
+    const uint32_t* row_w = &pt[row * num_groups * V27_WPG];
+    const float* row_ws = &ws[row * num_groups];
+    float acc = 0.0f;
+    int total_words = num_groups * V27_WPG;
+    for (int base = 0; base < total_words; base += WARP_SIZE) {
+        int w = base + lane;
+        if (w < total_words) {
+            uint32_t word = __ldg(&row_w[w]);
+            int g = w >> 3;
+            int word_in_group = w & 7;
+            uint32_t lo = extract_int4x4_to_int8x4(word, 0);
+            uint32_t hi = extract_int4x4_to_int8x4(word, 4);
+            int x_base = g * 16 + word_in_group * 2;
+            uint32_t x_lo = __ldg(&xt[x_base]);
+            uint32_t x_hi = __ldg(&xt[x_base + 1]);
+            int dp_lo = __dp4a((int)lo, (int)x_lo, 0);
+            int dp_hi = __dp4a((int)hi, (int)x_hi, 0);
+            float combined_scale = __ldg(&row_ws[g]) * __ldg(&xs[g]);
+            acc += (float)(dp_lo + dp_hi) * combined_scale;
+        }
+    }
+    #pragma unroll
+    for (int o = 16; o > 0; o >>= 1)
+        acc += __shfl_down_sync(0xFFFFFFFF, acc, o);
+    if (lane == 0) y[row] = acc;
+}
+// ============================================================
+// V9-style: trit-packed d3 (for models stored in trit format)
+// ============================================================
+#define V9R 4
+#define V9W 2
+#define V9BS (V9R * V9W * WARP_SIZE)
+__device__ __forceinline__ float mac_wide_d3(
+    const uint32_t* __restrict__ p, const float* __restrict__ x, int tid
+) {
+    float acc = 0.0f;
+    #pragma unroll
+    for (int i = 0; i < 4; i++) {
+        int idx = tid * 4 + i;
+        int w = idx / 5, pos = idx % 5;
+        uint32_t bits = (__ldg(&p[w]) >> (pos * 6)) & 0x3F;
+        int t0 = bits & 3, t1 = (bits >> 2) & 3, t2 = (bits >> 4) & 3;
+        int lv = ((t2==TRIT_POS)-(t2==TRIT_NEG))*9 + ((t1==TRIT_POS)-(t1==TRIT_NEG))*3
+               + ((t0==TRIT_POS)-(t0==TRIT_NEG));
+        acc += lv * __ldg(&x[idx]);
+    }
+    return acc;
+}
+__global__ void k_v9(
+    const uint32_t* __restrict__ pt, const float* __restrict__ sc,
+    const float* __restrict__ x, float* __restrict__ y,
+    int in_f, int out_f, int depth
+) {
+    __shared__ float parts[V9R * V9W];
+    int base = blockIdx.x * V9R;
+    int wid = threadIdx.x / WARP_SIZE, lane = threadIdx.x % WARP_SIZE;
+    int lr = wid / V9W, rw = wid % V9W, row = base + lr;
+    int ng = in_f / GROUP_SIZE;
+    const int w = 13;
+    int half = lane / 16;
+    int tid_in_group = lane % 16;
+    float partial = 0.0f;
+    if (row < out_f) {
+        for (int g_pair = rw; g_pair < (ng + 1) / 2; g_pair += V9W) {
+            int g = g_pair * 2 + half;
+            if (g < ng) {
+                float ga = mac_wide_d3(&pt[(row * ng + g) * w],
+                                       &x[g * GROUP_SIZE], tid_in_group);
+                unsigned mask = half ? 0xFFFF0000u : 0x0000FFFFu;
+                #pragma unroll
+                for (int o = 8; o > 0; o >>= 1)
+                    ga += __shfl_down_sync(mask, ga, o);
+                if (tid_in_group == 0)
+                    partial += ga * __ldg(&sc[row * ng + g]);
+            }
+        }
+    }
+    float my_partial = (lane == 0 || lane == 16) ? partial : 0.0f;
+    my_partial += __shfl_xor_sync(0xFFFFFFFF, my_partial, 16);
+    if (lane == 0) parts[wid] = my_partial;
+    __syncthreads();
+    if (lane == 0 && rw == 0 && row < out_f) {
+        float s = 0;
+        for (int i = 0; i < V9W; i++) s += parts[lr * V9W + i];
+        y[row] = s;
+    }
+}
+// ============================================================
+// L2 persistence helpers
+// ============================================================
+static void set_l2_persist(void* ptr, size_t bytes) {
+    cudaStreamAttrValue attr;
+    attr.accessPolicyWindow.base_ptr = ptr;
+    attr.accessPolicyWindow.num_bytes = bytes;
+    attr.accessPolicyWindow.hitRatio = 1.0f;
+    attr.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;
+    attr.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;
+    cudaStreamSetAttribute(0, cudaStreamAttributeAccessPolicyWindow, &attr);
+}
+static void clear_l2_persist() {
+    cudaStreamAttrValue attr;
+    memset(&attr, 0, sizeof(attr));
+    cudaStreamSetAttribute(0, cudaStreamAttributeAccessPolicyWindow, &attr);
+}
+// ============================================================
+// C API — callable from any language via dlopen/ctypes/FFI
+// ============================================================
+extern "C" {
+// v27: d2 int4-packed + dp4a (champion for GPU)
+// pt: [rows * ng * 8] int32 (int4 packed weights)
+// ws: [rows * ng] float32 (weight scales)
+// xt: [ng * 16] int32 (int8 packed activations)
+// xs: [ng] float32 (activation scales)
+// y:  [rows] float32 (output)
+void trit_gemv_d2_dp4a(
+    const int32_t* pt, const float* ws,
+    const int32_t* xt, const float* xs,
+    float* y, int cols, int rows, int num_groups,
+    int use_l2_persist
+) {
+    if (use_l2_persist) {
+        set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
+    }
+    k_v27<<<(rows + V27_RPB - 1) / V27_RPB, V27_BS>>>(
+        (const uint32_t*)pt, ws, (const uint32_t*)xt, xs, y, cols, rows, num_groups);
+    if (use_l2_persist) {
+        clear_l2_persist();
+    }
+}
+// v9: trit-packed d3 (for native trit format)
+// pt: [rows * ng * 13] int32 (trit packed weights)
+// sc: [rows * ng] float32 (scales)
+// x:  [cols] float32 (activations)
+// y:  [rows] float32 (output)
+void trit_gemv_d3_native(
+    const int32_t* pt, const float* sc,
+    const float* x, float* y,
+    int cols, int rows, int depth
+) {
+    k_v9<<<(rows + V9R - 1) / V9R, V9BS>>>(
+        (const uint32_t*)pt, sc, x, y, cols, rows, depth);
+}
+// v29: d2 unsigned int4 + bias trick (no sign extension)
+void trit_gemv_d2_bias(
+    const int32_t* pt, const float* ws,
+    const int32_t* xt_e, const int32_t* xt_o,
+    const int32_t* x_bias, const float* xs,
+    float* y, int cols, int rows, int num_groups,
+    int use_l2_persist
+) {
+    if (use_l2_persist) {
+        set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
+    }
+    k_v29<<<(rows + V29_RPB - 1) / V29_RPB, V29_BS>>>(
+        (const uint32_t*)pt, ws,
+        (const uint32_t*)xt_e, (const uint32_t*)xt_o,
+        (const int*)x_bias, xs,
+        y, cols, rows, num_groups);
+    if (use_l2_persist) {
+        clear_l2_persist();
+    }
+}
+// v28: d2 int4 + branchless interleaved decode + dp4a
+// xt_e/xt_o: pre-interleaved x (even/odd nibble positions)
+// Each has ng*8 uint32 words (4 int8 values per word, 8 words per group)
+void trit_gemv_d2_fast(
+    const int32_t* pt, const float* ws,
+    const int32_t* xt_e, const int32_t* xt_o, const float* xs,
+    float* y, int cols, int rows, int num_groups,
+    int use_l2_persist
+) {
+    if (use_l2_persist) {
+        set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
+    }
+    k_v28<<<(rows + V28_RPB - 1) / V28_RPB, V28_BS>>>(
+        (const uint32_t*)pt, ws,
+        (const uint32_t*)xt_e, (const uint32_t*)xt_o, xs,
+        y, cols, rows, num_groups);
+    if (use_l2_persist) {
+        clear_l2_persist();
+    }
+}
+// v21f: d3 int8-packed + dp4a (same format as v21f in wrapper)
+// wt: [rows * ng * 16] int32 (int8 packed weight levels)
+// ws: [rows * ng] float32 (weight scales)
+// xt: [ng * 16] int32 (int8 packed activations)
+// xs: [ng] float32 (activation scales)
+#define V21F_RPB 4
+#define V21F_BS (V21F_RPB * WARP_SIZE)
+__global__ void k_v21f_standalone(
+    const uint32_t* __restrict__ wt,
+    const float* __restrict__ ws,
+    const uint32_t* __restrict__ xt,
+    const float* __restrict__ xs,
+    float* __restrict__ y,
+    int cols, int rows, int num_groups
+) {
+    int wid = threadIdx.x / WARP_SIZE;
+    int lane = threadIdx.x % WARP_SIZE;
+    int row = blockIdx.x * V21F_RPB + wid;
+    if (row >= rows) return;
+    const uint32_t* row_w = &wt[row * num_groups * 16];
+    const float* row_ws = &ws[row * num_groups];
+    float acc = 0.0f;
+    int total_words = num_groups * 16;
+    for (int base = 0; base < total_words; base += WARP_SIZE) {
+        int w = base + lane;
+        if (w < total_words) {
+            uint32_t w_word = __ldg(&row_w[w]);
+            uint32_t x_word = __ldg(&xt[w]);
+            int dp = __dp4a((int)w_word, (int)x_word, 0);
+            int g = w >> 4;  // 16 words per group
+            acc += (float)dp * __ldg(&row_ws[g]) * __ldg(&xs[g]);
+        }
+    }
+    #pragma unroll
+    for (int o = 16; o > 0; o >>= 1)
+        acc += __shfl_down_sync(0xFFFFFFFF, acc, o);
+    if (lane == 0) y[row] = acc;
+}
+// ============================================================
+// D3 HARDENED: int8 dp4a, 16 RPB, L2 persist, deferred reduction
+//
+// d3 levels: -13 to +13 (27 values), stored as int8 (1 byte each)
+// 16 words per group, 4 int8 values per word = 64 values/group
+// Division by 16 = shift (no div-by-13 problem!)
+//
+// This is the SIMPLEST kernel — no decode at all.
+// The int8 values go DIRECTLY into dp4a.
+// Pure memory-bound on every GPU.
+// ============================================================
+#define D3H_RPB 16
+#define D3H_WPG 16   // 16 uint32 words per group (4 int8 each = 64 values)
+#define D3H_BS (D3H_RPB * WARP_SIZE)
+__global__ void k_d3_hardened(
+    const uint32_t* __restrict__ wt,    // int8 packed: 16 words per group
+    const float* __restrict__ ws,       // weight scales [rows * ng]
+    const uint32_t* __restrict__ xt,    // x int8 packed: 16 words per group
+    const float* __restrict__ xs,       // x scales [ng]
+    float* __restrict__ y,
+    int cols, int rows, int num_groups
+) {
+    int wid = threadIdx.x / WARP_SIZE;
+    int lane = threadIdx.x % WARP_SIZE;
+    int row = blockIdx.x * D3H_RPB + wid;
+    if (row >= rows) return;
+    const uint32_t* row_w = &wt[row * num_groups * D3H_WPG];
+    const float* row_ws = &ws[row * num_groups];
+    float acc = 0.0f;
+    int total_words = num_groups * D3H_WPG;
+    for (int base = 0; base < total_words; base += WARP_SIZE) {
+        int w = base + lane;
+        if (w < total_words) {
+            // COALESCED load — 32 threads × 4 bytes = 128 bytes
+            uint32_t w_word = __ldg(&row_w[w]);
+            uint32_t x_word = __ldg(&xt[w]);
+            // dp4a: 4× int8 multiply-accumulate — ZERO decode
+            int dp = __dp4a((int)w_word, (int)x_word, 0);
+            // Group index: SHIFT (16 is power of 2!)
+            int g = w >> 4;
+            // Deferred: accumulate per-thread, ONE reduction at end
+            acc += (float)dp * __ldg(&row_ws[g]) * __ldg(&xs[g]);
+        }
+    }
+    // ONE warp reduction
+    #pragma unroll
+    for (int o = 16; o > 0; o >>= 1)
+        acc += __shfl_down_sync(0xFFFFFFFF, acc, o);
+    if (lane == 0) y[row] = acc;
+}
+// d3 hardened: uses k_d3_hardened (16 RPB, deferred reduction)
+void trit_gemv_d3_int8_dp4a(
+    const int32_t* wt, const float* ws,
+    const int32_t* xt, const float* xs,
+    float* y, int cols, int rows, int num_groups,
+    int use_l2_persist
+) {
+    if (use_l2_persist) {
+        set_l2_persist((void*)wt, (size_t)rows * num_groups * 16 * sizeof(int32_t));
+    }
+    k_d3_hardened<<<(rows + D3H_RPB - 1) / D3H_RPB, D3H_BS>>>(
+        (const uint32_t*)wt, ws, (const uint32_t*)xt, xs, y, cols, rows, num_groups);
+    if (use_l2_persist) {
+        clear_l2_persist();
+    }
+}
+// Run the same layer N times back-to-back to measure pipeline / L2 reuse benefit
+void trit_gemv_pipeline_bench(
+    const int32_t* pt, const float* ws,
+    const int32_t* xt_e, const int32_t* xt_o, const float* xs,
+    float* y, int cols, int rows, int num_groups,
+    int n_repeats, int use_l2_persist
+) {
+    if (use_l2_persist) {
+        set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
+    }
+    // Launch n_repeats sequential v28 kernels in the SAME stream
+    // This measures the pipeline benefit: back-to-back launches share L2
+    for (int i = 0; i < n_repeats; i++) {
+        k_v28<<<(rows + V28_RPB - 1) / V28_RPB, V28_BS>>>(
+            (const uint32_t*)pt, ws,
+            (const uint32_t*)xt_e, (const uint32_t*)xt_o, xs,
+            y, cols, rows, num_groups);
+    }
+    if (use_l2_persist) {
+        clear_l2_persist();
+    }
+}
+// Query L2 cache size (for deciding whether to use L2 persist)
+int get_l2_cache_bytes() {
+    cudaDeviceProp prop;
+    cudaGetDeviceProperties(&prop, 0);
+    return prop.l2CacheSize;
+}
+// Query GPU name
+void get_gpu_name(char* buf, int buflen) {
+    cudaDeviceProp prop;
+    cudaGetDeviceProperties(&prop, 0);
+    strncpy(buf, prop.name, buflen - 1);
+    buf[buflen - 1] = '\0';
+}
+// Synchronize (for timing from Python)
+void cuda_sync() {
+    cudaDeviceSynchronize();
+}
+}  // extern "C"