Known limitations — tritllm-codec
Items previously raised in code review have been addressed in the current release. This document only lists deliberate design tradeoffs that the codec review surfaced, not bugs.
Design tradeoffs
Scale codebook upper bound = max(group_abs_maxes)
Where: quantize_model_v2.py, trit_quantize_scales, log_max = np.max(...)
The 27-entry log-spaced scale codebook spans [log_min, log_max] where
log_max is taken to be the maximum group magnitude in the matrix. This is
intentional — an earlier 99.9th-percentile bound (commit prior to 0c16d24)
clipped large-scale outlier groups and lost their resolution.
The downside: a single extreme-scale outlier group can stretch the log-spaced range and reduce scale resolution for the bulk of normal-magnitude groups in the same matrix.
We do not see this cause measurable quality regressions on Qwen2.5, Llama-3.1, or Mistral-7B. If you observe unexpectedly high PPL on a new model family with heavy-tailed scale distributions, this is the first place to look.
We did not change this in the current release because changing it would alter
the bit-exact output of the codec and invalidate published paper numbers; a
future v3 may replace np.max with a soft-cap (e.g. min(max, 4 * p99)) that
is robust to single extreme outliers without giving up large-scale fidelity.
Scale candidate set is fixed at 4 percentiles
Where: quantize_model_v2.py, compute_best_scale_4cand
The MSE-best scale is selected from four fixed order statistics — indices
[gs-6, gs-4, gs-2, gs-1] of sorted |w|. This is a deliberate compute /
quality tradeoff (≈50× speedup over an exhaustive sweep, <1% PPL gap measured
on Qwen2.5-7B), not a bug. The function name and docstring now reflect this.