fix: address codex review BLOCKERs and SHOULD-FIXes; update KNOWN_ISSUES

Browse files

Files changed (4) hide show

KNOWN_ISSUES.md +40 -64
README.md +13 -6
trit_gemv.cu +11 -0
trit_gemv_standalone.cu +66 -4

KNOWN_ISSUES.md CHANGED Viewed

@@ -1,64 +1,40 @@
-# Known issues — tritllm-kernel
-Surfaced during a pre-release code review. None affect the published paper benchmark numbers (those were obtained on shapes that respect the contract), but anyone using these kernels with new shapes, custom launch parameters, or as a drop-in inference primitive should be aware.
-## BLOCKER — must respect or fix before relying on the kernel
-### 1. Implicit one-warp-per-block launch contract
-**Where:** [`trit_gemv.cu:190-237` (`trit_gemv_uniform`)](trit_gemv.cu#L190), [`trit_gemv.cu:245-290` (`trit_gemv_variable`)](trit_gemv.cu#L245)
-The kernels use `lane = threadIdx.x` directly as the lane index and reduce with a full-warp mask `__shfl_down_sync(0xFFFFFFFF, ...)`. This is correct only when `blockDim.x == 32`.
-If launched with `blockDim.x > 32`:
-- Threads with `threadIdx.x >= 32` will compute `idx = lane*2+i` past the 64-element group bound and read out-of-bounds.
-- All threads with lane 0 across multiple warps race to write `y[row]`.
-**Fix in caller:** always launch with `blockDim.x == 32`. The host-side wrappers in `trit_gemv_standalone.cu` do this correctly. Direct callers from custom code must respect it.
-**Future fix in kernel:** add `assert(blockDim.x == WARP_SIZE)` at kernel entry, or rewrite to handle multi-warp blocks correctly.
-### 2. `in_features` not a multiple of `GROUP_SIZE` is silently dropped
-**Where:** [`trit_gemv.cu:194`](trit_gemv.cu#L194), [`trit_gemv.cu:259`](trit_gemv.cu#L259)
-```cpp
-int num_groups = in_features / GROUP_SIZE;
-```
-Integer division truncates. If `in_features % 64 != 0`, the trailing partial group is silently skipped and that fragment of the dot product is missing from the output.
-**Fix in caller:** pad the input weight matrix (and activations) with zero rows to the next multiple of 64 before quantizing. The codec output already does this for Qwen, Llama, and Mistral architectures, all of which have `hidden_dim` divisible by 64.
-**Future fix in kernel:** add `assert(in_features % GROUP_SIZE == 0)` at kernel entry, or write a tail-handling path.
-## SHOULD-FIX
-### 3. C API performs no input validation
-**Where:** `trit_gemv_standalone.cu`, all `extern "C"` functions
-`trit_gemv_d2_fast`, `trit_gemv_d2_dp4a`, `trit_gemv_d3_native`, etc. accept null pointers, mismatched `rows`/`cols`/`num_groups`, and incorrectly packed buffers without complaint. Bad inputs become device faults or OOB reads.
-For a public ctypes-facing library this is sharp. We will add a validation pass in a future revision; for now, callers must guarantee their arguments.
-### 4. `get_gpu_name(char* buf, int buflen)` has no null/length guard
-**Where:** [`trit_gemv_standalone.cu:700`](trit_gemv_standalone.cu#L700)
-Calling with `buf == nullptr` or `buflen <= 0` is immediate UB on the host side. Trivial fix; pending.
-### 5. CUDA error returns are not surfaced
-**Where:** several places in `trit_gemv_standalone.cu` where `set_l2_persist`, kernel launches, and helper calls drop `cudaError_t` returns
-If a kernel launch fails (e.g., bad shapes that pass the (missing) input validation), the failure is silent until the next `cudaDeviceSynchronize()` or `cudaGetLastError()`. The public functions return `void` and have no error-reporting path.
-Workaround: call `cuda_sync()` after each operation and check `cudaGetLastError()` from your wrapper.
-### 6. Reduction wastes 31 lanes per group
-**Where:** [`trit_gemv.cu:223-232`](trit_gemv.cu#L223), [`trit_gemv.cu:279-286`](trit_gemv.cu#L279)
-After the warp reduction, only lane 0 multiplies by the group scale and accumulates into `row_acc`. The other 31 lanes idle for the scale/add path. This is correct, just leaves performance on the table relative to the deferred-reduction design used in `k_d3_hardened` (`trit_gemv_standalone.cu:493`).
-The headline 7.8× number is from the deferred-reduction path, so this only matters if you use the educational `trit_gemv_uniform` / `trit_gemv_variable` kernels directly.
-## NIT
-### 7. Multiple prototype kernels in production file
-`trit_gemv_standalone.cu` contains v9, v27, v28, v29, `k_d3_hardened`, plus the non-deferred kernels — a development history rather than a clean public surface. The `k_v29_pipeline` / `trit_pipeline` path was broken (passed nullptr for required arrays) and was removed in commit prior to this release. The remaining prototypes (`k_v27`, `k_v29`, `k_v28`) are still wired through public C functions; they work, but the API surface is wider than needed. A future revision will trim to one canonical entry per depth.

+# Known limitations — tritllm-kernel
+Items previously raised in code review have been addressed:
+- The implicit one-warp-per-block launch contract in the educational kernels
+  is now an early-return guard: kernels return without writing if launched
+  with `blockDim.x != 32` or `in_features % 64 != 0`.
+- The dead `trit_pipeline` / `k_v29_pipeline` path was removed.
+- The C API now validates pointers, dimensions, and the
+  `cols / GROUP_SIZE == num_groups` invariant, and reports the result via
+  `trit_gemv_get_last_error()`. CUDA launch errors are captured into the same
+  channel.
+- `get_gpu_name(buf, buflen)` now refuses null pointers and `buflen <= 0`.
+This document lists what remains.
+## Design tradeoff (not a bug)
+### Lane-0 scale-and-add in `trit_gemv_uniform` / `trit_gemv_variable`
+**Where:** [`trit_gemv.cu:223-232, 279-286`](trit_gemv.cu#L223)
+After the warp reduction in the educational kernels, only lane 0 multiplies
+the group sum by the scale and accumulates into `row_acc`. The other 31 lanes
+are idle for the scale/add path. This is correct, just slow — the published
+paper benchmarks are produced by the deferred-reduction kernel
+`k_d3_hardened` in `trit_gemv_standalone.cu`, which does not have this
+limitation.
+The `trit_gemv_uniform` / `trit_gemv_variable` kernels in `trit_gemv.cu` are
+kept as a smaller, single-file reference implementation that is easier to read
+and reason about. If you need maximum throughput, use the C API in
+`trit_gemv_standalone.cu`.
+## Future cleanup
+The C API in `trit_gemv_standalone.cu` exposes several historical kernel
+variants (`v9`, `v27`, `v28`, `v29`, plus `k_d3_hardened` via
+`trit_gemv_d3_int8_dp4a`). They all work, but the public API is wider than
+needed. A future release will trim to one canonical entry point per depth
+(`trit_gemv_d1`, `trit_gemv_d2`, `trit_gemv_d3`, `trit_gemv_d4`).

README.md CHANGED Viewed

@@ -91,14 +91,21 @@ void get_gpu_name(char* buf, int buflen);
 void cuda_sync();
 ```
-## Known issues
-Documented in [KNOWN_ISSUES.md](KNOWN_ISSUES.md). Summary:
-- **Launch contract is implicit, not enforced.** Kernels are correct only with `blockDim.x == 32`. There are no runtime asserts; the contract is guarded only by the host-side wrappers in this file. Direct callers must respect it.
-- **`in_features` not a multiple of 64 silently fails.** No assert. Pad your matrix.
-- **C API has no input validation.** Null pointers, wrong dimensions, and buffer-shape mismatches become device faults or OOB reads. This is a public-API hardening item we have not yet completed.
-- **CUDA error returns are not surfaced to the caller** in some helper paths. If a kernel launch fails, `cuda_sync()` will see it but the public functions return `void`.
 ## Citation

 void cuda_sync();
 ```
+## Error reporting
+All `extern "C"` entry points return `void`, so per-call status is delivered through a separate channel:
+```c
+int trit_gemv_get_last_error();
+```
+Returns `0` on success. Negative values are host-side argument-validation failures (`TRIT_ERR_NULL_PTR`, `TRIT_ERR_BAD_DIM`, `TRIT_ERR_BAD_GROUP`, `TRIT_ERR_BAD_BUFFER`). Positive values are `cudaError_t` codes captured from the most recent kernel launch.
+The host-side validator in each entry point checks pointer non-null, positive dimensions, `cols % 64 == 0`, and `cols / 64 == num_groups`. If validation fails, no kernel is launched, the error is recorded, and the call returns silently.
+## Known limitations
+The educational kernels in `trit_gemv.cu` use a lane-0 scale-and-add reduction that idles 31 lanes per group. This is a deliberate readability tradeoff — the headline 7.8× number is from the deferred-reduction `k_d3_hardened` kernel in `trit_gemv_standalone.cu`. See [KNOWN_ISSUES.md](KNOWN_ISSUES.md) for details and a planned API-cleanup item.
 ## Citation

trit_gemv.cu CHANGED Viewed

@@ -178,6 +178,10 @@ __device__ __forceinline__ float trit_mac_d4(
 // Simplified version: uniform depth across all groups in a tensor
 // (variable-depth version below)
 __global__ void trit_gemv_uniform(
     const uint32_t* __restrict__ packed_trits,  // packed trit data
     const float* __restrict__ scales,           // [num_groups] FP16 stored as float
@@ -187,6 +191,9 @@ __global__ void trit_gemv_uniform(
     int out_features,
     int depth                                   // uniform depth 1-4
 ) {
     int row = blockIdx.x;      // one block per output row
     if (row >= out_features) return;
@@ -242,6 +249,7 @@ __global__ void trit_gemv_uniform(
  * Variable-depth version: each group can have a different depth.
  * Uses a depth map and offset table to handle mixed-depth tensors.
  */
 __global__ void trit_gemv_variable(
     const uint32_t* __restrict__ packed_trits,
     const float* __restrict__ scales,
@@ -252,6 +260,9 @@ __global__ void trit_gemv_variable(
     int in_features,
     int out_features
 ) {
     int row = blockIdx.x;
     if (row >= out_features) return;

 // Simplified version: uniform depth across all groups in a tensor
 // (variable-depth version below)
+// Launch contract: blockDim.x == 32 (one warp per block), in_features % 64 == 0.
+// The kernel uses lane = threadIdx.x and a full-warp shuffle mask, so larger
+// blocks would alias the lane index and race on y[row]. Trailing partial groups
+// are an unsupported shape, not silently dropped.
 __global__ void trit_gemv_uniform(
     const uint32_t* __restrict__ packed_trits,  // packed trit data
     const float* __restrict__ scales,           // [num_groups] FP16 stored as float
     int out_features,
     int depth                                   // uniform depth 1-4
 ) {
+    if (blockDim.x != WARP_SIZE) return;       // launch contract: 1 warp/block
+    if (in_features % GROUP_SIZE) return;      // launch contract: K mod 64 == 0
     int row = blockIdx.x;      // one block per output row
     if (row >= out_features) return;
  * Variable-depth version: each group can have a different depth.
  * Uses a depth map and offset table to handle mixed-depth tensors.
  */
+// Launch contract: blockDim.x == 32 (one warp per block), in_features % 64 == 0.
 __global__ void trit_gemv_variable(
     const uint32_t* __restrict__ packed_trits,
     const float* __restrict__ scales,
     int in_features,
     int out_features
 ) {
+    if (blockDim.x != WARP_SIZE) return;
+    if (in_features % GROUP_SIZE) return;
     int row = blockIdx.x;
     if (row >= out_features) return;

trit_gemv_standalone.cu CHANGED Viewed

@@ -350,8 +350,47 @@ static void clear_l2_persist() {
 // C API — callable from any language via dlopen/ctypes/FFI
 // ============================================================
 extern "C" {
 // v27: d2 int4-packed + dp4a (champion for GPU)
 // pt: [rows * ng * 8] int32 (int4 packed weights)
 // ws: [rows * ng] float32 (weight scales)
@@ -364,11 +403,14 @@ void trit_gemv_d2_dp4a(
     float* y, int cols, int rows, int num_groups,
     int use_l2_persist
 ) {
     if (use_l2_persist) {
         set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
     }
     k_v27<<<(rows + V27_RPB - 1) / V27_RPB, V27_BS>>>(
         (const uint32_t*)pt, ws, (const uint32_t*)xt, xs, y, cols, rows, num_groups);
     if (use_l2_persist) {
         clear_l2_persist();
     }
@@ -384,8 +426,13 @@ void trit_gemv_d3_native(
     const float* x, float* y,
     int cols, int rows, int depth
 ) {
     k_v9<<<(rows + V9R - 1) / V9R, V9BS>>>(
         (const uint32_t*)pt, sc, x, y, cols, rows, depth);
 }
 // v29: d2 unsigned int4 + bias trick (no sign extension)
@@ -396,6 +443,8 @@ void trit_gemv_d2_bias(
     float* y, int cols, int rows, int num_groups,
     int use_l2_persist
 ) {
     if (use_l2_persist) {
         set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
     }
@@ -404,6 +453,7 @@ void trit_gemv_d2_bias(
         (const uint32_t*)xt_e, (const uint32_t*)xt_o,
         (const int*)x_bias, xs,
         y, cols, rows, num_groups);
     if (use_l2_persist) {
         clear_l2_persist();
     }
@@ -418,6 +468,8 @@ void trit_gemv_d2_fast(
     float* y, int cols, int rows, int num_groups,
     int use_l2_persist
 ) {
     if (use_l2_persist) {
         set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
     }
@@ -425,6 +477,7 @@ void trit_gemv_d2_fast(
         (const uint32_t*)pt, ws,
         (const uint32_t*)xt_e, (const uint32_t*)xt_o, xs,
         y, cols, rows, num_groups);
     if (use_l2_persist) {
         clear_l2_persist();
     }
@@ -542,11 +595,14 @@ void trit_gemv_d3_int8_dp4a(
     float* y, int cols, int rows, int num_groups,
     int use_l2_persist
 ) {
     if (use_l2_persist) {
         set_l2_persist((void*)wt, (size_t)rows * num_groups * 16 * sizeof(int32_t));
     }
     k_d3_hardened<<<(rows + D3H_RPB - 1) / D3H_RPB, D3H_BS>>>(
         (const uint32_t*)wt, ws, (const uint32_t*)xt, xs, y, cols, rows, num_groups);
     if (use_l2_persist) {
         clear_l2_persist();
     }
@@ -559,17 +615,20 @@ void trit_gemv_pipeline_bench(
     float* y, int cols, int rows, int num_groups,
     int n_repeats, int use_l2_persist
 ) {
     if (use_l2_persist) {
         set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
     }
-    // Launch n_repeats sequential v28 kernels in the SAME stream
-    // This measures the pipeline benefit: back-to-back launches share L2
     for (int i = 0; i < n_repeats; i++) {
         k_v28<<<(rows + V28_RPB - 1) / V28_RPB, V28_BS>>>(
             (const uint32_t*)pt, ws,
             (const uint32_t*)xt_e, (const uint32_t*)xt_o, xs,
             y, cols, rows, num_groups);
     }
     if (use_l2_persist) {
         clear_l2_persist();
     }
@@ -582,10 +641,13 @@ int get_l2_cache_bytes() {
     return prop.l2CacheSize;
 }
-// Query GPU name
 void get_gpu_name(char* buf, int buflen) {
     cudaDeviceProp prop;
-    cudaGetDeviceProperties(&prop, 0);
     strncpy(buf, prop.name, buflen - 1);
     buf[buflen - 1] = '\0';
 }

 // C API — callable from any language via dlopen/ctypes/FFI
 // ============================================================
+// Error codes for the last_error reporting channel.
+//   0 = success
+//   negative = host-side argument validation failure (no kernel was launched)
+//   positive = cudaError_t value from a kernel launch or runtime call
+#define TRIT_OK              0
+#define TRIT_ERR_NULL_PTR   -1
+#define TRIT_ERR_BAD_DIM    -2
+#define TRIT_ERR_BAD_GROUP  -3   // num_groups != cols / GROUP_SIZE
+#define TRIT_ERR_BAD_BUFFER -4   // buf too small / invalid
+// Last-error slot. Set by every public entrypoint; read via trit_gemv_get_last_error().
+static int g_last_error = TRIT_OK;
+// Host-side argument validation. Returns 0 on success, negative on failure.
+// Sets g_last_error and returns 1 (truthy) on failure for use in `if (validate(...)) return;`.
+static inline int trit_validate_gemv(
+    const void* pt, const void* ws, const void* y,
+    int cols, int rows, int num_groups
+) {
+    if (!pt || !ws || !y) { g_last_error = TRIT_ERR_NULL_PTR; return 1; }
+    if (cols <= 0 || rows <= 0 || num_groups <= 0) { g_last_error = TRIT_ERR_BAD_DIM; return 1; }
+    if (cols % GROUP_SIZE != 0) { g_last_error = TRIT_ERR_BAD_DIM; return 1; }
+    if (cols / GROUP_SIZE != num_groups) { g_last_error = TRIT_ERR_BAD_GROUP; return 1; }
+    return 0;
+}
+// Capture cudaGetLastError() after a kernel launch into g_last_error.
+static inline void trit_capture_launch_status() {
+    cudaError_t e = cudaGetLastError();
+    g_last_error = (e == cudaSuccess) ? TRIT_OK : (int)e;
+}
 extern "C" {
+// Returns the error code from the most recent public-API call.
+// 0 means success. Negative codes are host-side validation failures
+// (TRIT_ERR_*); positive codes are cudaError_t values from CUDA itself.
+int trit_gemv_get_last_error() {
+    return g_last_error;
+}
 // v27: d2 int4-packed + dp4a (champion for GPU)
 // pt: [rows * ng * 8] int32 (int4 packed weights)
 // ws: [rows * ng] float32 (weight scales)
     float* y, int cols, int rows, int num_groups,
     int use_l2_persist
 ) {
+    if (trit_validate_gemv(pt, ws, y, cols, rows, num_groups)) return;
+    if (!xt || !xs) { g_last_error = TRIT_ERR_NULL_PTR; return; }
     if (use_l2_persist) {
         set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
     }
     k_v27<<<(rows + V27_RPB - 1) / V27_RPB, V27_BS>>>(
         (const uint32_t*)pt, ws, (const uint32_t*)xt, xs, y, cols, rows, num_groups);
+    trit_capture_launch_status();
     if (use_l2_persist) {
         clear_l2_persist();
     }
     const float* x, float* y,
     int cols, int rows, int depth
 ) {
+    if (!pt || !sc || !x || !y) { g_last_error = TRIT_ERR_NULL_PTR; return; }
+    if (cols <= 0 || rows <= 0) { g_last_error = TRIT_ERR_BAD_DIM; return; }
+    if (cols % GROUP_SIZE != 0) { g_last_error = TRIT_ERR_BAD_DIM; return; }
+    if (depth < 1 || depth > 4) { g_last_error = TRIT_ERR_BAD_DIM; return; }
     k_v9<<<(rows + V9R - 1) / V9R, V9BS>>>(
         (const uint32_t*)pt, sc, x, y, cols, rows, depth);
+    trit_capture_launch_status();
 }
 // v29: d2 unsigned int4 + bias trick (no sign extension)
     float* y, int cols, int rows, int num_groups,
     int use_l2_persist
 ) {
+    if (trit_validate_gemv(pt, ws, y, cols, rows, num_groups)) return;
+    if (!xt_e || !xt_o || !x_bias || !xs) { g_last_error = TRIT_ERR_NULL_PTR; return; }
     if (use_l2_persist) {
         set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
     }
         (const uint32_t*)xt_e, (const uint32_t*)xt_o,
         (const int*)x_bias, xs,
         y, cols, rows, num_groups);
+    trit_capture_launch_status();
     if (use_l2_persist) {
         clear_l2_persist();
     }
     float* y, int cols, int rows, int num_groups,
     int use_l2_persist
 ) {
+    if (trit_validate_gemv(pt, ws, y, cols, rows, num_groups)) return;
+    if (!xt_e || !xt_o || !xs) { g_last_error = TRIT_ERR_NULL_PTR; return; }
     if (use_l2_persist) {
         set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
     }
         (const uint32_t*)pt, ws,
         (const uint32_t*)xt_e, (const uint32_t*)xt_o, xs,
         y, cols, rows, num_groups);
+    trit_capture_launch_status();
     if (use_l2_persist) {
         clear_l2_persist();
     }
     float* y, int cols, int rows, int num_groups,
     int use_l2_persist
 ) {
+    if (trit_validate_gemv(wt, ws, y, cols, rows, num_groups)) return;
+    if (!xt || !xs) { g_last_error = TRIT_ERR_NULL_PTR; return; }
     if (use_l2_persist) {
         set_l2_persist((void*)wt, (size_t)rows * num_groups * 16 * sizeof(int32_t));
     }
     k_d3_hardened<<<(rows + D3H_RPB - 1) / D3H_RPB, D3H_BS>>>(
         (const uint32_t*)wt, ws, (const uint32_t*)xt, xs, y, cols, rows, num_groups);
+    trit_capture_launch_status();
     if (use_l2_persist) {
         clear_l2_persist();
     }
     float* y, int cols, int rows, int num_groups,
     int n_repeats, int use_l2_persist
 ) {
+    if (trit_validate_gemv(pt, ws, y, cols, rows, num_groups)) return;
+    if (!xt_e || !xt_o || !xs || n_repeats <= 0) { g_last_error = TRIT_ERR_NULL_PTR; return; }
     if (use_l2_persist) {
         set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
     }
+    // Launch n_repeats sequential v28 kernels in the SAME stream — measures
+    // the L2-reuse benefit of back-to-back launches sharing weights.
     for (int i = 0; i < n_repeats; i++) {
         k_v28<<<(rows + V28_RPB - 1) / V28_RPB, V28_BS>>>(
             (const uint32_t*)pt, ws,
             (const uint32_t*)xt_e, (const uint32_t*)xt_o, xs,
             y, cols, rows, num_groups);
     }
+    trit_capture_launch_status();
     if (use_l2_persist) {
         clear_l2_persist();
     }
     return prop.l2CacheSize;
 }
+// Query GPU name. `buf` must be a writable buffer of `buflen >= 1` bytes.
+// On invalid input, the call is a no-op and g_last_error is set.
 void get_gpu_name(char* buf, int buflen) {
+    if (!buf || buflen <= 0) { g_last_error = TRIT_ERR_BAD_BUFFER; return; }
     cudaDeviceProp prop;
+    cudaError_t e = cudaGetDeviceProperties(&prop, 0);
+    if (e != cudaSuccess) { g_last_error = (int)e; buf[0] = '\0'; return; }
     strncpy(buf, prop.name, buflen - 1);
     buf[buflen - 1] = '\0';
 }