ASHQ1 β€” Autonomous Selective Hybrid Quantization

⚠️ Experimental. ASHQ1 is a personal research project that I will be refining over time. Use at your own risk. Results may vary between architectures and fine-tunes. Feedback and contributions welcome.

ASHQ1 is a post-training quantization method for GGUF models that uses an imatrix-driven priority queue to maximise theoretical quality per megabyte. Instead of uniform bit-depth or heuristic layer-blocking, it treats tied tensor groups as monolithic entities and greedily upgrades them by strict mathematical utility β€” the product of summed importance and theoretical MSE reduction, divided by size cost.

Results

Target Model Arch MTP Actual PPL (ctx=1024) vs Q6_K
5500 Qwable-9B Qwen3.5 β€” 5,503 MiB 7.4334 β€”
5700 Qwythos-9B Qwen3.5 yes 5,713 MiB 7.5411 βˆ’0.047

ASHQ1 at 5700 MiB beats uniform Q6_K by 0.047 PPL at 19% smaller (5713 vs 7076 MiB).

Real-World Validation

ASHQ1's theoretical quality advantage transfers to real agentic coding. We tested Ornith-1.0-9B ASHQ1 6500 (6.4 GB, 33% smaller than Q8_0) as the backend for Pi, an autonomous coding agent that uses llama.cpp as its LLM backend.

At temperature 0.6, the model was tasked with building a complete personal finance dashboard as a single HTML file β€” Canvas charts, budget tracker, dark mode, transaction filtering, upcoming bills, responsive layout. The agent worked autonomously: planned the architecture, wrote the entire ~1100-line file, caught its own bugs (date.now β†’ date.getTime), fixed dark mode logic, ran Node.js validation, and iterated until all checks passed. The final finance-dashboard.html was a polished, production-quality single-page app β€” no external dependencies, no hallucinations, no broken features.

This is not cherry-picked. It's the first test we ran. The benchmarks didn't lie β€” ASHQ1 preserves enough quality that a 6.4 GB quant can drive an autonomous coding agent to build complete, working applications from scratch.

How It Works

1. Floor Assignment

Every tensor starts at a minimum tier by class. SSM params and norms lock at F16. Embeddings start at Q5_K. Weight matrices start at Q4_K (or IQ4_XS for QAT models). MTP heads get Q5_K.

2. Depth-Weighted Importance

Imatrix in_sum2 measures how much each weight contributes to the output variance. This raw importance gets scaled by layer position:

  • Layer 0: 2.0Γ— (embedding proximity)
  • Last 5 layers: 1.5Γ— (output proximity)
  • Middle: 1.0Γ—

3. Tied Group Detection

Tensors with numerically identical in_sum2 arrays are tied (shared weights). They form a single upgrade group β€” all members upgrade together as one unit. Group importance is the sum of its members' importance, preventing large groups from being starved of budget.

4. Priority Queue Drain

All possible single-tier upgrades are pushed into a max-heap:

utility/MiB = sum(timp[group]) Γ— (MSE(cur) βˆ’ MSE(next)) / (size(next) βˆ’ size(cur))

MSE per tier is theoretical: MSE = 2^(-2 × bpw). K-quants get +0.1 effective bpw vs IQ-quants at the same real bpw, so IQ4_NL→Q4_K is a free quality gain.

The queue pops the highest-utility upgrade, applies it, pushes the next upgrade for that group, and drains until the budget is exhausted. A final pass catches any remaining zero-cost upgrades.

Why It Works

Problem ASHQ1 Solution
Uniform quant wastes bits on low-importance tensors Priority queue allocates budget where it matters
Heuristic hand-tuning doesn't scale Single knob: --size in MiB
Hand-tuned SHQ hybrids need days of PPL sweeps Queue converges in ~1 sec for any budget
Large tied groups starved by per-tensor logic sum(timp) prevents 32Γ— group penalty
IQ4_NL→Q4_K at same bpw is a no-op Free-upgrade pass catches zero-cost quality gains
No PPL-per-budget curve needed Queue optimises for MSE directly

Supported Architectures

Arch Detection Features
qwen35 SSM + QKV Hybrid attention, SSM layers, GQA, MTP support
mellum2 MoE (exps tensors) Mixture of Experts, GQA, router F16
gemma4 Layer-scale norms QAT support, Q4_K attention floor

MTP (Multi-Token Prediction) heads are handled explicitly: MTP tensors deploy at Q5_K and are excluded from the classifier's budget (their cost is subtracted from the target upfront). Tensor names with nextn.* or layers beyond n_layers are detected as MTP at runtime.

Code Structure

File Role
main.py CLI entry point, orchestration, --show-floors, multiple --imatrix support
model_reader.py Reads GGUF, detects architecture/prefix/n_layers/MTP at runtime
imatrix_reader.py Parses imatrix GGUF, detects tied groups via np.allclose(in_sum2), combines multiple imatrix
classifier.py Floor assignment β†’ tied group building β†’ priority queue drain β†’ free upgrade pass
config_generator.py Generates --tensor-type regex rules from classified tensors (valid ECMAScript regex with pipe-alternated ranges)
quantizer.py Subprocess wrapper around llama-quantize
constants.py TENSOR_CLASS mapping, CLASS_HARD_FLOORS, CLASS_MAX_TIER, MSE_BPW, TIER_BPW, ARCH_FEATURES

Usage

Quantization

pip install -r requirements.txt

# Dry run (∼1 sec)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6800

# Actual quant (∼10 min)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6800 --run

# Show hard floors
python main.py --show-floors

# Multiple imatrix (combined with max/mean)
python main.py --model model.gguf --imatrix i1.gguf --imatrix i2.gguf \
  --imatrix-method max --size 6800 --run

The llama-quantize binary path is set in quantizer.py:6.

Inference (llama-server)

Recommended server flags for serving ASHQ1 quants:

./build/bin/llama-server \
  -m model-ASHQ1.gguf \
  -c 50000 \
  --jinja \
  -fit off \
  -ngl 99 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --port 8080 \
  --mmap \
  --temp 1.0 \
  --top-p 0.95 \
  --min-p 0 \
  --top-k 20 \
  --seed -1 \
  --parallel 1

Tier Reference

Tier BPW MSE_BPW
F16 16.0 16.0
Q8_0 8.50 8.50
Q6_K 6.5625 6.5625
Q5_K 5.50 5.50
Q4_K 4.50 4.50
IQ4_NL 4.50 4.40
IQ4_XS 4.25 4.00
Q3_K 3.4375 3.4375

Quantization Configs

Generated configs are valid llama-quantize arguments with ECMAScript-compatible regex patterns. Each --tensor-type rule matches a group of tensors that share the same target tier, with layers grouped into contiguous ranges:

  • (blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k=Q8_0 β€” specific attention layers at Q8_0
  • (blk|BLK)\.((?:22|23|24|25|26))\.ffn_gate=Q6_K β€” range of FFN layers at Q6_K
  • .*output_norm.*=F16 β€” global catch-all

Rules are sorted by specificity (specific layers, high tiers first) because llama-quantize uses first-match-wins.

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support