Entrit commited on
Commit
7c251e6
·
verified ·
1 Parent(s): 51e3123

fix: address codex review BLOCKERs and SHOULD-FIXes; update KNOWN_ISSUES

Browse files
Files changed (4) hide show
  1. KNOWN_ISSUES.md +40 -64
  2. README.md +13 -6
  3. trit_gemv.cu +11 -0
  4. trit_gemv_standalone.cu +66 -4
KNOWN_ISSUES.md CHANGED
@@ -1,64 +1,40 @@
1
- # Known issues — tritllm-kernel
2
-
3
- Surfaced during a pre-release code review. None affect the published paper benchmark numbers (those were obtained on shapes that respect the contract), but anyone using these kernels with new shapes, custom launch parameters, or as a drop-in inference primitive should be aware.
4
-
5
- ## BLOCKER must respect or fix before relying on the kernel
6
-
7
- ### 1. Implicit one-warp-per-block launch contract
8
- **Where:** [`trit_gemv.cu:190-237` (`trit_gemv_uniform`)](trit_gemv.cu#L190), [`trit_gemv.cu:245-290` (`trit_gemv_variable`)](trit_gemv.cu#L245)
9
-
10
- The kernels use `lane = threadIdx.x` directly as the lane index and reduce with a full-warp mask `__shfl_down_sync(0xFFFFFFFF, ...)`. This is correct only when `blockDim.x == 32`.
11
-
12
- If launched with `blockDim.x > 32`:
13
- - Threads with `threadIdx.x >= 32` will compute `idx = lane*2+i` past the 64-element group bound and read out-of-bounds.
14
- - All threads with lane 0 across multiple warps race to write `y[row]`.
15
-
16
- **Fix in caller:** always launch with `blockDim.x == 32`. The host-side wrappers in `trit_gemv_standalone.cu` do this correctly. Direct callers from custom code must respect it.
17
-
18
- **Future fix in kernel:** add `assert(blockDim.x == WARP_SIZE)` at kernel entry, or rewrite to handle multi-warp blocks correctly.
19
-
20
- ### 2. `in_features` not a multiple of `GROUP_SIZE` is silently dropped
21
- **Where:** [`trit_gemv.cu:194`](trit_gemv.cu#L194), [`trit_gemv.cu:259`](trit_gemv.cu#L259)
22
-
23
- ```cpp
24
- int num_groups = in_features / GROUP_SIZE;
25
- ```
26
-
27
- Integer division truncates. If `in_features % 64 != 0`, the trailing partial group is silently skipped and that fragment of the dot product is missing from the output.
28
-
29
- **Fix in caller:** pad the input weight matrix (and activations) with zero rows to the next multiple of 64 before quantizing. The codec output already does this for Qwen, Llama, and Mistral architectures, all of which have `hidden_dim` divisible by 64.
30
-
31
- **Future fix in kernel:** add `assert(in_features % GROUP_SIZE == 0)` at kernel entry, or write a tail-handling path.
32
-
33
- ## SHOULD-FIX
34
-
35
- ### 3. C API performs no input validation
36
- **Where:** `trit_gemv_standalone.cu`, all `extern "C"` functions
37
-
38
- `trit_gemv_d2_fast`, `trit_gemv_d2_dp4a`, `trit_gemv_d3_native`, etc. accept null pointers, mismatched `rows`/`cols`/`num_groups`, and incorrectly packed buffers without complaint. Bad inputs become device faults or OOB reads.
39
-
40
- For a public ctypes-facing library this is sharp. We will add a validation pass in a future revision; for now, callers must guarantee their arguments.
41
-
42
- ### 4. `get_gpu_name(char* buf, int buflen)` has no null/length guard
43
- **Where:** [`trit_gemv_standalone.cu:700`](trit_gemv_standalone.cu#L700)
44
-
45
- Calling with `buf == nullptr` or `buflen <= 0` is immediate UB on the host side. Trivial fix; pending.
46
-
47
- ### 5. CUDA error returns are not surfaced
48
- **Where:** several places in `trit_gemv_standalone.cu` where `set_l2_persist`, kernel launches, and helper calls drop `cudaError_t` returns
49
-
50
- If a kernel launch fails (e.g., bad shapes that pass the (missing) input validation), the failure is silent until the next `cudaDeviceSynchronize()` or `cudaGetLastError()`. The public functions return `void` and have no error-reporting path.
51
-
52
- Workaround: call `cuda_sync()` after each operation and check `cudaGetLastError()` from your wrapper.
53
-
54
- ### 6. Reduction wastes 31 lanes per group
55
- **Where:** [`trit_gemv.cu:223-232`](trit_gemv.cu#L223), [`trit_gemv.cu:279-286`](trit_gemv.cu#L279)
56
-
57
- After the warp reduction, only lane 0 multiplies by the group scale and accumulates into `row_acc`. The other 31 lanes idle for the scale/add path. This is correct, just leaves performance on the table relative to the deferred-reduction design used in `k_d3_hardened` (`trit_gemv_standalone.cu:493`).
58
-
59
- The headline 7.8× number is from the deferred-reduction path, so this only matters if you use the educational `trit_gemv_uniform` / `trit_gemv_variable` kernels directly.
60
-
61
- ## NIT
62
-
63
- ### 7. Multiple prototype kernels in production file
64
- `trit_gemv_standalone.cu` contains v9, v27, v28, v29, `k_d3_hardened`, plus the non-deferred kernels — a development history rather than a clean public surface. The `k_v29_pipeline` / `trit_pipeline` path was broken (passed nullptr for required arrays) and was removed in commit prior to this release. The remaining prototypes (`k_v27`, `k_v29`, `k_v28`) are still wired through public C functions; they work, but the API surface is wider than needed. A future revision will trim to one canonical entry per depth.
 
1
+ # Known limitations — tritllm-kernel
2
+
3
+ Items previously raised in code review have been addressed:
4
+
5
+ - The implicit one-warp-per-block launch contract in the educational kernels
6
+ is now an early-return guard: kernels return without writing if launched
7
+ with `blockDim.x != 32` or `in_features % 64 != 0`.
8
+ - The dead `trit_pipeline` / `k_v29_pipeline` path was removed.
9
+ - The C API now validates pointers, dimensions, and the
10
+ `cols / GROUP_SIZE == num_groups` invariant, and reports the result via
11
+ `trit_gemv_get_last_error()`. CUDA launch errors are captured into the same
12
+ channel.
13
+ - `get_gpu_name(buf, buflen)` now refuses null pointers and `buflen <= 0`.
14
+
15
+ This document lists what remains.
16
+
17
+ ## Design tradeoff (not a bug)
18
+
19
+ ### Lane-0 scale-and-add in `trit_gemv_uniform` / `trit_gemv_variable`
20
+ **Where:** [`trit_gemv.cu:223-232, 279-286`](trit_gemv.cu#L223)
21
+
22
+ After the warp reduction in the educational kernels, only lane 0 multiplies
23
+ the group sum by the scale and accumulates into `row_acc`. The other 31 lanes
24
+ are idle for the scale/add path. This is correct, just slow — the published
25
+ paper benchmarks are produced by the deferred-reduction kernel
26
+ `k_d3_hardened` in `trit_gemv_standalone.cu`, which does not have this
27
+ limitation.
28
+
29
+ The `trit_gemv_uniform` / `trit_gemv_variable` kernels in `trit_gemv.cu` are
30
+ kept as a smaller, single-file reference implementation that is easier to read
31
+ and reason about. If you need maximum throughput, use the C API in
32
+ `trit_gemv_standalone.cu`.
33
+
34
+ ## Future cleanup
35
+
36
+ The C API in `trit_gemv_standalone.cu` exposes several historical kernel
37
+ variants (`v9`, `v27`, `v28`, `v29`, plus `k_d3_hardened` via
38
+ `trit_gemv_d3_int8_dp4a`). They all work, but the public API is wider than
39
+ needed. A future release will trim to one canonical entry point per depth
40
+ (`trit_gemv_d1`, `trit_gemv_d2`, `trit_gemv_d3`, `trit_gemv_d4`).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -91,14 +91,21 @@ void get_gpu_name(char* buf, int buflen);
91
  void cuda_sync();
92
  ```
93
 
94
- ## Known issues
95
 
96
- Documented in [KNOWN_ISSUES.md](KNOWN_ISSUES.md). Summary:
97
 
98
- - **Launch contract is implicit, not enforced.** Kernels are correct only with `blockDim.x == 32`. There are no runtime asserts; the contract is guarded only by the host-side wrappers in this file. Direct callers must respect it.
99
- - **`in_features` not a multiple of 64 silently fails.** No assert. Pad your matrix.
100
- - **C API has no input validation.** Null pointers, wrong dimensions, and buffer-shape mismatches become device faults or OOB reads. This is a public-API hardening item we have not yet completed.
101
- - **CUDA error returns are not surfaced to the caller** in some helper paths. If a kernel launch fails, `cuda_sync()` will see it but the public functions return `void`.
 
 
 
 
 
 
 
102
 
103
  ## Citation
104
 
 
91
  void cuda_sync();
92
  ```
93
 
94
+ ## Error reporting
95
 
96
+ All `extern "C"` entry points return `void`, so per-call status is delivered through a separate channel:
97
 
98
+ ```c
99
+ int trit_gemv_get_last_error();
100
+ ```
101
+
102
+ Returns `0` on success. Negative values are host-side argument-validation failures (`TRIT_ERR_NULL_PTR`, `TRIT_ERR_BAD_DIM`, `TRIT_ERR_BAD_GROUP`, `TRIT_ERR_BAD_BUFFER`). Positive values are `cudaError_t` codes captured from the most recent kernel launch.
103
+
104
+ The host-side validator in each entry point checks pointer non-null, positive dimensions, `cols % 64 == 0`, and `cols / 64 == num_groups`. If validation fails, no kernel is launched, the error is recorded, and the call returns silently.
105
+
106
+ ## Known limitations
107
+
108
+ The educational kernels in `trit_gemv.cu` use a lane-0 scale-and-add reduction that idles 31 lanes per group. This is a deliberate readability tradeoff — the headline 7.8× number is from the deferred-reduction `k_d3_hardened` kernel in `trit_gemv_standalone.cu`. See [KNOWN_ISSUES.md](KNOWN_ISSUES.md) for details and a planned API-cleanup item.
109
 
110
  ## Citation
111
 
trit_gemv.cu CHANGED
@@ -178,6 +178,10 @@ __device__ __forceinline__ float trit_mac_d4(
178
 
179
  // Simplified version: uniform depth across all groups in a tensor
180
  // (variable-depth version below)
 
 
 
 
181
  __global__ void trit_gemv_uniform(
182
  const uint32_t* __restrict__ packed_trits, // packed trit data
183
  const float* __restrict__ scales, // [num_groups] FP16 stored as float
@@ -187,6 +191,9 @@ __global__ void trit_gemv_uniform(
187
  int out_features,
188
  int depth // uniform depth 1-4
189
  ) {
 
 
 
190
  int row = blockIdx.x; // one block per output row
191
  if (row >= out_features) return;
192
 
@@ -242,6 +249,7 @@ __global__ void trit_gemv_uniform(
242
  * Variable-depth version: each group can have a different depth.
243
  * Uses a depth map and offset table to handle mixed-depth tensors.
244
  */
 
245
  __global__ void trit_gemv_variable(
246
  const uint32_t* __restrict__ packed_trits,
247
  const float* __restrict__ scales,
@@ -252,6 +260,9 @@ __global__ void trit_gemv_variable(
252
  int in_features,
253
  int out_features
254
  ) {
 
 
 
255
  int row = blockIdx.x;
256
  if (row >= out_features) return;
257
 
 
178
 
179
  // Simplified version: uniform depth across all groups in a tensor
180
  // (variable-depth version below)
181
+ // Launch contract: blockDim.x == 32 (one warp per block), in_features % 64 == 0.
182
+ // The kernel uses lane = threadIdx.x and a full-warp shuffle mask, so larger
183
+ // blocks would alias the lane index and race on y[row]. Trailing partial groups
184
+ // are an unsupported shape, not silently dropped.
185
  __global__ void trit_gemv_uniform(
186
  const uint32_t* __restrict__ packed_trits, // packed trit data
187
  const float* __restrict__ scales, // [num_groups] FP16 stored as float
 
191
  int out_features,
192
  int depth // uniform depth 1-4
193
  ) {
194
+ if (blockDim.x != WARP_SIZE) return; // launch contract: 1 warp/block
195
+ if (in_features % GROUP_SIZE) return; // launch contract: K mod 64 == 0
196
+
197
  int row = blockIdx.x; // one block per output row
198
  if (row >= out_features) return;
199
 
 
249
  * Variable-depth version: each group can have a different depth.
250
  * Uses a depth map and offset table to handle mixed-depth tensors.
251
  */
252
+ // Launch contract: blockDim.x == 32 (one warp per block), in_features % 64 == 0.
253
  __global__ void trit_gemv_variable(
254
  const uint32_t* __restrict__ packed_trits,
255
  const float* __restrict__ scales,
 
260
  int in_features,
261
  int out_features
262
  ) {
263
+ if (blockDim.x != WARP_SIZE) return;
264
+ if (in_features % GROUP_SIZE) return;
265
+
266
  int row = blockIdx.x;
267
  if (row >= out_features) return;
268
 
trit_gemv_standalone.cu CHANGED
@@ -350,8 +350,47 @@ static void clear_l2_persist() {
350
  // C API — callable from any language via dlopen/ctypes/FFI
351
  // ============================================================
352
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
353
  extern "C" {
354
 
 
 
 
 
 
 
 
355
  // v27: d2 int4-packed + dp4a (champion for GPU)
356
  // pt: [rows * ng * 8] int32 (int4 packed weights)
357
  // ws: [rows * ng] float32 (weight scales)
@@ -364,11 +403,14 @@ void trit_gemv_d2_dp4a(
364
  float* y, int cols, int rows, int num_groups,
365
  int use_l2_persist
366
  ) {
 
 
367
  if (use_l2_persist) {
368
  set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
369
  }
370
  k_v27<<<(rows + V27_RPB - 1) / V27_RPB, V27_BS>>>(
371
  (const uint32_t*)pt, ws, (const uint32_t*)xt, xs, y, cols, rows, num_groups);
 
372
  if (use_l2_persist) {
373
  clear_l2_persist();
374
  }
@@ -384,8 +426,13 @@ void trit_gemv_d3_native(
384
  const float* x, float* y,
385
  int cols, int rows, int depth
386
  ) {
 
 
 
 
387
  k_v9<<<(rows + V9R - 1) / V9R, V9BS>>>(
388
  (const uint32_t*)pt, sc, x, y, cols, rows, depth);
 
389
  }
390
 
391
  // v29: d2 unsigned int4 + bias trick (no sign extension)
@@ -396,6 +443,8 @@ void trit_gemv_d2_bias(
396
  float* y, int cols, int rows, int num_groups,
397
  int use_l2_persist
398
  ) {
 
 
399
  if (use_l2_persist) {
400
  set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
401
  }
@@ -404,6 +453,7 @@ void trit_gemv_d2_bias(
404
  (const uint32_t*)xt_e, (const uint32_t*)xt_o,
405
  (const int*)x_bias, xs,
406
  y, cols, rows, num_groups);
 
407
  if (use_l2_persist) {
408
  clear_l2_persist();
409
  }
@@ -418,6 +468,8 @@ void trit_gemv_d2_fast(
418
  float* y, int cols, int rows, int num_groups,
419
  int use_l2_persist
420
  ) {
 
 
421
  if (use_l2_persist) {
422
  set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
423
  }
@@ -425,6 +477,7 @@ void trit_gemv_d2_fast(
425
  (const uint32_t*)pt, ws,
426
  (const uint32_t*)xt_e, (const uint32_t*)xt_o, xs,
427
  y, cols, rows, num_groups);
 
428
  if (use_l2_persist) {
429
  clear_l2_persist();
430
  }
@@ -542,11 +595,14 @@ void trit_gemv_d3_int8_dp4a(
542
  float* y, int cols, int rows, int num_groups,
543
  int use_l2_persist
544
  ) {
 
 
545
  if (use_l2_persist) {
546
  set_l2_persist((void*)wt, (size_t)rows * num_groups * 16 * sizeof(int32_t));
547
  }
548
  k_d3_hardened<<<(rows + D3H_RPB - 1) / D3H_RPB, D3H_BS>>>(
549
  (const uint32_t*)wt, ws, (const uint32_t*)xt, xs, y, cols, rows, num_groups);
 
550
  if (use_l2_persist) {
551
  clear_l2_persist();
552
  }
@@ -559,17 +615,20 @@ void trit_gemv_pipeline_bench(
559
  float* y, int cols, int rows, int num_groups,
560
  int n_repeats, int use_l2_persist
561
  ) {
 
 
562
  if (use_l2_persist) {
563
  set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
564
  }
565
- // Launch n_repeats sequential v28 kernels in the SAME stream
566
- // This measures the pipeline benefit: back-to-back launches share L2
567
  for (int i = 0; i < n_repeats; i++) {
568
  k_v28<<<(rows + V28_RPB - 1) / V28_RPB, V28_BS>>>(
569
  (const uint32_t*)pt, ws,
570
  (const uint32_t*)xt_e, (const uint32_t*)xt_o, xs,
571
  y, cols, rows, num_groups);
572
  }
 
573
  if (use_l2_persist) {
574
  clear_l2_persist();
575
  }
@@ -582,10 +641,13 @@ int get_l2_cache_bytes() {
582
  return prop.l2CacheSize;
583
  }
584
 
585
- // Query GPU name
 
586
  void get_gpu_name(char* buf, int buflen) {
 
587
  cudaDeviceProp prop;
588
- cudaGetDeviceProperties(&prop, 0);
 
589
  strncpy(buf, prop.name, buflen - 1);
590
  buf[buflen - 1] = '\0';
591
  }
 
350
  // C API — callable from any language via dlopen/ctypes/FFI
351
  // ============================================================
352
 
353
+ // Error codes for the last_error reporting channel.
354
+ // 0 = success
355
+ // negative = host-side argument validation failure (no kernel was launched)
356
+ // positive = cudaError_t value from a kernel launch or runtime call
357
+ #define TRIT_OK 0
358
+ #define TRIT_ERR_NULL_PTR -1
359
+ #define TRIT_ERR_BAD_DIM -2
360
+ #define TRIT_ERR_BAD_GROUP -3 // num_groups != cols / GROUP_SIZE
361
+ #define TRIT_ERR_BAD_BUFFER -4 // buf too small / invalid
362
+
363
+ // Last-error slot. Set by every public entrypoint; read via trit_gemv_get_last_error().
364
+ static int g_last_error = TRIT_OK;
365
+
366
+ // Host-side argument validation. Returns 0 on success, negative on failure.
367
+ // Sets g_last_error and returns 1 (truthy) on failure for use in `if (validate(...)) return;`.
368
+ static inline int trit_validate_gemv(
369
+ const void* pt, const void* ws, const void* y,
370
+ int cols, int rows, int num_groups
371
+ ) {
372
+ if (!pt || !ws || !y) { g_last_error = TRIT_ERR_NULL_PTR; return 1; }
373
+ if (cols <= 0 || rows <= 0 || num_groups <= 0) { g_last_error = TRIT_ERR_BAD_DIM; return 1; }
374
+ if (cols % GROUP_SIZE != 0) { g_last_error = TRIT_ERR_BAD_DIM; return 1; }
375
+ if (cols / GROUP_SIZE != num_groups) { g_last_error = TRIT_ERR_BAD_GROUP; return 1; }
376
+ return 0;
377
+ }
378
+
379
+ // Capture cudaGetLastError() after a kernel launch into g_last_error.
380
+ static inline void trit_capture_launch_status() {
381
+ cudaError_t e = cudaGetLastError();
382
+ g_last_error = (e == cudaSuccess) ? TRIT_OK : (int)e;
383
+ }
384
+
385
  extern "C" {
386
 
387
+ // Returns the error code from the most recent public-API call.
388
+ // 0 means success. Negative codes are host-side validation failures
389
+ // (TRIT_ERR_*); positive codes are cudaError_t values from CUDA itself.
390
+ int trit_gemv_get_last_error() {
391
+ return g_last_error;
392
+ }
393
+
394
  // v27: d2 int4-packed + dp4a (champion for GPU)
395
  // pt: [rows * ng * 8] int32 (int4 packed weights)
396
  // ws: [rows * ng] float32 (weight scales)
 
403
  float* y, int cols, int rows, int num_groups,
404
  int use_l2_persist
405
  ) {
406
+ if (trit_validate_gemv(pt, ws, y, cols, rows, num_groups)) return;
407
+ if (!xt || !xs) { g_last_error = TRIT_ERR_NULL_PTR; return; }
408
  if (use_l2_persist) {
409
  set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
410
  }
411
  k_v27<<<(rows + V27_RPB - 1) / V27_RPB, V27_BS>>>(
412
  (const uint32_t*)pt, ws, (const uint32_t*)xt, xs, y, cols, rows, num_groups);
413
+ trit_capture_launch_status();
414
  if (use_l2_persist) {
415
  clear_l2_persist();
416
  }
 
426
  const float* x, float* y,
427
  int cols, int rows, int depth
428
  ) {
429
+ if (!pt || !sc || !x || !y) { g_last_error = TRIT_ERR_NULL_PTR; return; }
430
+ if (cols <= 0 || rows <= 0) { g_last_error = TRIT_ERR_BAD_DIM; return; }
431
+ if (cols % GROUP_SIZE != 0) { g_last_error = TRIT_ERR_BAD_DIM; return; }
432
+ if (depth < 1 || depth > 4) { g_last_error = TRIT_ERR_BAD_DIM; return; }
433
  k_v9<<<(rows + V9R - 1) / V9R, V9BS>>>(
434
  (const uint32_t*)pt, sc, x, y, cols, rows, depth);
435
+ trit_capture_launch_status();
436
  }
437
 
438
  // v29: d2 unsigned int4 + bias trick (no sign extension)
 
443
  float* y, int cols, int rows, int num_groups,
444
  int use_l2_persist
445
  ) {
446
+ if (trit_validate_gemv(pt, ws, y, cols, rows, num_groups)) return;
447
+ if (!xt_e || !xt_o || !x_bias || !xs) { g_last_error = TRIT_ERR_NULL_PTR; return; }
448
  if (use_l2_persist) {
449
  set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
450
  }
 
453
  (const uint32_t*)xt_e, (const uint32_t*)xt_o,
454
  (const int*)x_bias, xs,
455
  y, cols, rows, num_groups);
456
+ trit_capture_launch_status();
457
  if (use_l2_persist) {
458
  clear_l2_persist();
459
  }
 
468
  float* y, int cols, int rows, int num_groups,
469
  int use_l2_persist
470
  ) {
471
+ if (trit_validate_gemv(pt, ws, y, cols, rows, num_groups)) return;
472
+ if (!xt_e || !xt_o || !xs) { g_last_error = TRIT_ERR_NULL_PTR; return; }
473
  if (use_l2_persist) {
474
  set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
475
  }
 
477
  (const uint32_t*)pt, ws,
478
  (const uint32_t*)xt_e, (const uint32_t*)xt_o, xs,
479
  y, cols, rows, num_groups);
480
+ trit_capture_launch_status();
481
  if (use_l2_persist) {
482
  clear_l2_persist();
483
  }
 
595
  float* y, int cols, int rows, int num_groups,
596
  int use_l2_persist
597
  ) {
598
+ if (trit_validate_gemv(wt, ws, y, cols, rows, num_groups)) return;
599
+ if (!xt || !xs) { g_last_error = TRIT_ERR_NULL_PTR; return; }
600
  if (use_l2_persist) {
601
  set_l2_persist((void*)wt, (size_t)rows * num_groups * 16 * sizeof(int32_t));
602
  }
603
  k_d3_hardened<<<(rows + D3H_RPB - 1) / D3H_RPB, D3H_BS>>>(
604
  (const uint32_t*)wt, ws, (const uint32_t*)xt, xs, y, cols, rows, num_groups);
605
+ trit_capture_launch_status();
606
  if (use_l2_persist) {
607
  clear_l2_persist();
608
  }
 
615
  float* y, int cols, int rows, int num_groups,
616
  int n_repeats, int use_l2_persist
617
  ) {
618
+ if (trit_validate_gemv(pt, ws, y, cols, rows, num_groups)) return;
619
+ if (!xt_e || !xt_o || !xs || n_repeats <= 0) { g_last_error = TRIT_ERR_NULL_PTR; return; }
620
  if (use_l2_persist) {
621
  set_l2_persist((void*)pt, (size_t)rows * num_groups * 8 * sizeof(int32_t));
622
  }
623
+ // Launch n_repeats sequential v28 kernels in the SAME stream — measures
624
+ // the L2-reuse benefit of back-to-back launches sharing weights.
625
  for (int i = 0; i < n_repeats; i++) {
626
  k_v28<<<(rows + V28_RPB - 1) / V28_RPB, V28_BS>>>(
627
  (const uint32_t*)pt, ws,
628
  (const uint32_t*)xt_e, (const uint32_t*)xt_o, xs,
629
  y, cols, rows, num_groups);
630
  }
631
+ trit_capture_launch_status();
632
  if (use_l2_persist) {
633
  clear_l2_persist();
634
  }
 
641
  return prop.l2CacheSize;
642
  }
643
 
644
+ // Query GPU name. `buf` must be a writable buffer of `buflen >= 1` bytes.
645
+ // On invalid input, the call is a no-op and g_last_error is set.
646
  void get_gpu_name(char* buf, int buflen) {
647
+ if (!buf || buflen <= 0) { g_last_error = TRIT_ERR_BAD_BUFFER; return; }
648
  cudaDeviceProp prop;
649
+ cudaError_t e = cudaGetDeviceProperties(&prop, 0);
650
+ if (e != cudaSuccess) { g_last_error = (int)e; buf[0] = '\0'; return; }
651
  strncpy(buf, prop.name, buflen - 1);
652
  buf[buflen - 1] = '\0';
653
  }