ASHQ1 β Autonomous Selective Hybrid Quantization
β οΈ Experimental. ASHQ1 is a personal research project that I will be refining over time. Use at your own risk. Results may vary between architectures and fine-tunes. Feedback and contributions welcome.
ASHQ1 is a post-training quantization method for GGUF models that uses an imatrix-driven priority queue to maximise theoretical quality per megabyte. Instead of uniform bit-depth or heuristic layer-blocking, it treats tied tensor groups as monolithic entities and greedily upgrades them by strict mathematical utility β the product of summed importance and theoretical MSE reduction, divided by size cost.
Results
| Target | Model | Arch | MTP | Actual | PPL (ctx=1024) | vs Q6_K |
|---|---|---|---|---|---|---|
| 5500 | Qwable-9B | Qwen3.5 | β | 5,503 MiB | 7.4334 | β |
| 5700 | Qwythos-9B | Qwen3.5 | yes | 5,713 MiB | 7.5411 | β0.047 |
ASHQ1 at 5700 MiB beats uniform Q6_K by 0.047 PPL at 19% smaller (5713 vs 7076 MiB).
Real-World Validation
ASHQ1's theoretical quality advantage transfers to real agentic coding. We tested Ornith-1.0-9B ASHQ1 6500 (6.4 GB, 33% smaller than Q8_0) as the backend for Pi, an autonomous coding agent that uses llama.cpp as its LLM backend.
At temperature 0.6, the model was tasked with building a complete personal finance dashboard as a single HTML file β Canvas charts, budget tracker, dark mode, transaction filtering, upcoming bills, responsive layout. The agent worked autonomously: planned the architecture, wrote the entire ~1100-line file, caught its own bugs (date.now β date.getTime), fixed dark mode logic, ran Node.js validation, and iterated until all checks passed. The final finance-dashboard.html was a polished, production-quality single-page app β no external dependencies, no hallucinations, no broken features.
This is not cherry-picked. It's the first test we ran. The benchmarks didn't lie β ASHQ1 preserves enough quality that a 6.4 GB quant can drive an autonomous coding agent to build complete, working applications from scratch.
How It Works
1. Floor Assignment
Every tensor starts at a minimum tier by class. SSM params and norms lock at F16. Embeddings start at Q5_K. Weight matrices start at Q4_K (or IQ4_XS for QAT models). MTP heads get Q5_K.
2. Depth-Weighted Importance
Imatrix in_sum2 measures how much each weight contributes to the output variance. This raw importance gets scaled by layer position:
- Layer 0: 2.0Γ (embedding proximity)
- Last 5 layers: 1.5Γ (output proximity)
- Middle: 1.0Γ
3. Tied Group Detection
Tensors with numerically identical in_sum2 arrays are tied (shared weights). They form a single upgrade group β all members upgrade together as one unit. Group importance is the sum of its members' importance, preventing large groups from being starved of budget.
4. Priority Queue Drain
All possible single-tier upgrades are pushed into a max-heap:
utility/MiB = sum(timp[group]) Γ (MSE(cur) β MSE(next)) / (size(next) β size(cur))
MSE per tier is theoretical: MSE = 2^(-2 Γ bpw). K-quants get +0.1 effective bpw vs IQ-quants at the same real bpw, so IQ4_NLβQ4_K is a free quality gain.
The queue pops the highest-utility upgrade, applies it, pushes the next upgrade for that group, and drains until the budget is exhausted. A final pass catches any remaining zero-cost upgrades.
Why It Works
| Problem | ASHQ1 Solution |
|---|---|
| Uniform quant wastes bits on low-importance tensors | Priority queue allocates budget where it matters |
| Heuristic hand-tuning doesn't scale | Single knob: --size in MiB |
| Hand-tuned SHQ hybrids need days of PPL sweeps | Queue converges in ~1 sec for any budget |
| Large tied groups starved by per-tensor logic | sum(timp) prevents 32Γ group penalty |
| IQ4_NLβQ4_K at same bpw is a no-op | Free-upgrade pass catches zero-cost quality gains |
| No PPL-per-budget curve needed | Queue optimises for MSE directly |
Supported Architectures
| Arch | Detection | Features |
|---|---|---|
qwen35 |
SSM + QKV | Hybrid attention, SSM layers, GQA, MTP support |
mellum2 |
MoE (exps tensors) |
Mixture of Experts, GQA, router F16 |
gemma4 |
Layer-scale norms | QAT support, Q4_K attention floor |
MTP (Multi-Token Prediction) heads are handled explicitly: MTP tensors deploy at Q5_K and are excluded from the classifier's budget (their cost is subtracted from the target upfront). Tensor names with nextn.* or layers beyond n_layers are detected as MTP at runtime.
Code Structure
| File | Role |
|---|---|
main.py |
CLI entry point, orchestration, --show-floors, multiple --imatrix support |
model_reader.py |
Reads GGUF, detects architecture/prefix/n_layers/MTP at runtime |
imatrix_reader.py |
Parses imatrix GGUF, detects tied groups via np.allclose(in_sum2), combines multiple imatrix |
classifier.py |
Floor assignment β tied group building β priority queue drain β free upgrade pass |
config_generator.py |
Generates --tensor-type regex rules from classified tensors (valid ECMAScript regex with pipe-alternated ranges) |
quantizer.py |
Subprocess wrapper around llama-quantize |
constants.py |
TENSOR_CLASS mapping, CLASS_HARD_FLOORS, CLASS_MAX_TIER, MSE_BPW, TIER_BPW, ARCH_FEATURES |
Usage
Quantization
pip install -r requirements.txt
# Dry run (βΌ1 sec)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6800
# Actual quant (βΌ10 min)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6800 --run
# Show hard floors
python main.py --show-floors
# Multiple imatrix (combined with max/mean)
python main.py --model model.gguf --imatrix i1.gguf --imatrix i2.gguf \
--imatrix-method max --size 6800 --run
The llama-quantize binary path is set in quantizer.py:6.
Inference (llama-server)
Recommended server flags for serving ASHQ1 quants:
./build/bin/llama-server \
-m model-ASHQ1.gguf \
-c 50000 \
--jinja \
-fit off \
-ngl 99 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--port 8080 \
--mmap \
--temp 1.0 \
--top-p 0.95 \
--min-p 0 \
--top-k 20 \
--seed -1 \
--parallel 1
Tier Reference
| Tier | BPW | MSE_BPW |
|---|---|---|
| F16 | 16.0 | 16.0 |
| Q8_0 | 8.50 | 8.50 |
| Q6_K | 6.5625 | 6.5625 |
| Q5_K | 5.50 | 5.50 |
| Q4_K | 4.50 | 4.50 |
| IQ4_NL | 4.50 | 4.40 |
| IQ4_XS | 4.25 | 4.00 |
| Q3_K | 3.4375 | 3.4375 |
Quantization Configs
Generated configs are valid llama-quantize arguments with ECMAScript-compatible regex patterns. Each --tensor-type rule matches a group of tensors that share the same target tier, with layers grouped into contiguous ranges:
(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k=Q8_0β specific attention layers at Q8_0(blk|BLK)\.((?:22|23|24|25|26))\.ffn_gate=Q6_Kβ range of FFN layers at Q6_K.*output_norm.*=F16β global catch-all
Rules are sorted by specificity (specific layers, high tiers first) because llama-quantize uses first-match-wins.