ASHQ1 — Autonomous Selective Hybrid Quantization

⚠️ Experimental. ASHQ1 is a personal research project that I will be refining over time. Use at your own risk. Results may vary between architectures and fine-tunes. Feedback and contributions welcome.

ASHQ1 is a post-training quantization method for GGUF models that uses an imatrix-driven priority queue to maximise theoretical quality per megabyte. Instead of uniform bit-depth or heuristic layer-blocking, it treats tied tensor groups as monolithic entities and greedily upgrades them by strict mathematical utility — the product of summed importance and theoretical MSE reduction, divided by size cost.

Results

Target	Model	Arch	MTP	Actual	PPL (ctx=1024)	vs Q6_K
5500	Qwable-9B	Qwen3.5	—	5,503 MiB	7.4334	—
5700	Qwythos-9B	Qwen3.5	yes	5,713 MiB	7.5411	−0.047

ASHQ1 at 5700 MiB beats uniform Q6_K by 0.047 PPL at 19% smaller (5713 vs 7076 MiB).

Real-World Validation

ASHQ1's theoretical quality advantage transfers to real agentic coding. We tested Ornith-1.0-9B ASHQ1 6500 (6.4 GB, 33% smaller than Q8_0) as the backend for Pi, an autonomous coding agent that uses llama.cpp as its LLM backend.

At temperature 0.6, the model was tasked with building a complete personal finance dashboard as a single HTML file — Canvas charts, budget tracker, dark mode, transaction filtering, upcoming bills, responsive layout. The agent worked autonomously: planned the architecture, wrote the entire ~1100-line file, caught its own bugs (date.now → date.getTime), fixed dark mode logic, ran Node.js validation, and iterated until all checks passed. The final finance-dashboard.html was a polished, production-quality single-page app — no external dependencies, no hallucinations, no broken features.

This is not cherry-picked. It's the first test we ran. The benchmarks didn't lie — ASHQ1 preserves enough quality that a 6.4 GB quant can drive an autonomous coding agent to build complete, working applications from scratch.

How It Works

1. Floor Assignment

Every tensor starts at a minimum tier by class. SSM params and norms lock at F16. Embeddings start at Q5_K. Weight matrices start at Q4_K (or IQ4_XS for QAT models). MTP heads get Q5_K.

2. Depth-Weighted Importance

Imatrix in_sum2 measures how much each weight contributes to the output variance. This raw importance gets scaled by layer position:

Layer 0: 2.0× (embedding proximity)
Last 5 layers: 1.5× (output proximity)
Middle: 1.0×

3. Tied Group Detection

Tensors with numerically identical in_sum2 arrays are tied (shared weights). They form a single upgrade group — all members upgrade together as one unit. Group importance is the sum of its members' importance, preventing large groups from being starved of budget.

4. Priority Queue Drain

All possible single-tier upgrades are pushed into a max-heap:

utility/MiB = sum(timp[group]) × (MSE(cur) − MSE(next)) / (size(next) − size(cur))

MSE per tier is theoretical: MSE = 2^(-2 × bpw). K-quants get +0.1 effective bpw vs IQ-quants at the same real bpw, so IQ4_NL→Q4_K is a free quality gain.

The queue pops the highest-utility upgrade, applies it, pushes the next upgrade for that group, and drains until the budget is exhausted. A final pass catches any remaining zero-cost upgrades.

Why It Works

Problem	ASHQ1 Solution
Uniform quant wastes bits on low-importance tensors	Priority queue allocates budget where it matters
Heuristic hand-tuning doesn't scale	Single knob: `--size` in MiB
Hand-tuned SHQ hybrids need days of PPL sweeps	Queue converges in ~1 sec for any budget
Large tied groups starved by per-tensor logic	`sum(timp)` prevents 32× group penalty
IQ4_NL→Q4_K at same bpw is a no-op	Free-upgrade pass catches zero-cost quality gains
No PPL-per-budget curve needed	Queue optimises for MSE directly

Supported Architectures

Arch	Detection	Features
`qwen35`	SSM + QKV	Hybrid attention, SSM layers, GQA, MTP support
`mellum2`	MoE (`exps` tensors)	Mixture of Experts, GQA, router F16
`gemma4`	Layer-scale norms	QAT support, Q4_K attention floor

MTP (Multi-Token Prediction) heads are handled explicitly: MTP tensors deploy at Q5_K and are excluded from the classifier's budget (their cost is subtracted from the target upfront). Tensor names with nextn.* or layers beyond n_layers are detected as MTP at runtime.

Code Structure

File	Role
`main.py`	CLI entry point, orchestration, `--show-floors`, multiple `--imatrix` support
`model_reader.py`	Reads GGUF, detects architecture/prefix/n_layers/MTP at runtime
`imatrix_reader.py`	Parses imatrix GGUF, detects tied groups via `np.allclose(in_sum2)`, combines multiple imatrix
`classifier.py`	Floor assignment → tied group building → priority queue drain → free upgrade pass
`config_generator.py`	Generates `--tensor-type` regex rules from classified tensors (valid ECMAScript regex with pipe-alternated ranges)
`quantizer.py`	Subprocess wrapper around `llama-quantize`
`constants.py`	TENSOR_CLASS mapping, CLASS_HARD_FLOORS, CLASS_MAX_TIER, MSE_BPW, TIER_BPW, ARCH_FEATURES

Usage

Quantization

pip install -r requirements.txt

# Dry run (∼1 sec)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6800

# Actual quant (∼10 min)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6800 --run

# Show hard floors
python main.py --show-floors

# Multiple imatrix (combined with max/mean)
python main.py --model model.gguf --imatrix i1.gguf --imatrix i2.gguf \
  --imatrix-method max --size 6800 --run

The llama-quantize binary path is set in quantizer.py:6.

Inference (llama-server)

Recommended server flags for serving ASHQ1 quants:

./build/bin/llama-server \
  -m model-ASHQ1.gguf \
  -c 50000 \
  --jinja \
  -fit off \
  -ngl 99 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --port 8080 \
  --mmap \
  --temp 1.0 \
  --top-p 0.95 \
  --min-p 0 \
  --top-k 20 \
  --seed -1 \
  --parallel 1

Tier Reference

Tier	BPW	MSE_BPW
F16	16.0	16.0
Q8_0	8.50	8.50
Q6_K	6.5625	6.5625
Q5_K	5.50	5.50
Q4_K	4.50	4.50
IQ4_NL	4.50	4.40
IQ4_XS	4.25	4.00
Q3_K	3.4375	3.4375

Quantization Configs

Generated configs are valid llama-quantize arguments with ECMAScript-compatible regex patterns. Each --tensor-type rule matches a group of tensors that share the same target tier, with layers grouped into contiguous ranges:

(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k=Q8_0 — specific attention layers at Q8_0
(blk|BLK)\.((?:22|23|24|25|26))\.ffn_gate=Q6_K — range of FFN layers at Q6_K
.*output_norm.*=F16 — global catch-all

Rules are sorted by specificity (specific layers, high tiers first) because llama-quantize uses first-match-wins.

References

Downloads last month: -; Downloads are not tracked for this model. How to track