Zamba2-2.7B-Instruct-HXQ

Zamba2-2.7B-Instruct compressed with HXQ (HelixCode vector quantization).

Available as both HuggingFace safetensors (via helix-substrate) and native GGUF (via llama.cpp HXQ fork).

First hybrid Mamba2+Transformer architecture with HXQ runtime benchmarks.

GGUF Runtime Benchmark (RTX 3090)

Benchmarked against standard GGUF K-quants on RTX 3090, full GPU offload (-ngl 99), using the hxq-affine-type branch at commit 580e9a2.

Decode Speed (tg128, 3 runs)

Format Size bpw tok/s vs Q4 vs Q6
Q4_K_M 2.11 GB 4.5 47.30 100% 105.2%
HXQ_AF6 2.79 GB 6.27 45.87 97.0% 102.0%
Q6_K 2.93 GB 6.58 44.98 95.1% 100%
Q5_K_M 2.51 GB 5.62 43.38 91.7% 96.4%

Perplexity (WikiText-2, 654 chunks, ctx=512)

Format bpw PPL vs Q4
Q6_K 6.58 22.573 -0.705 (best)
HXQ_AF6 6.27 22.653 -0.626 (2nd)
Q5_K_M 5.62 22.743 -0.535
Q4_K_M 4.5 23.278 baseline

Prefill (pp512, 3 runs)

Format tok/s vs Q4
Q4_K_M 325.24 100%
Q6_K 322.19 99.1%
Q5_K_M 321.32 98.8%
HXQ_AF6 318.55 97.9%

Summary: HXQ_AF6 decodes faster than both Q6_K (+2.0%) and Q5_K_M (+5.7%) while being smaller than Q6_K (2.79 vs 2.93 GB). PPL is second-best, only 0.080 behind Q6_K. Prefill is within 2% across all formats (SSM-dominated, not matmul-dominated). This is the first HXQ runtime benchmark on a hybrid Mamba2+Transformer architecture.

Reproducibility

All claims are within-run comparisons using the same dataset, llama.cpp commit, and hardware. Do not compare these PPL numbers with numbers from other runs using different model variants, dataset files, or build configurations.

Note: Zamba2 absolute PPL (22) is higher than Qwen (8-10) on this dataset due to different tokenizer and training distribution. The important metric is relative ranking within this run.

Receipt with SHA256 artifact hashes, exact commands, and dataset provenance: hxq_runtime_3090_zamba2_2.7b_20260509

Install and Run

Option 1: Native GGUF (llama.cpp)

# Build llama.cpp with HXQ + Zamba2 support
git clone -b hxq-affine-type https://github.com/echo313unfolding/llama.cpp.git
cd llama.cpp && mkdir build && cd build
cmake .. -DGGML_CUDA=ON && make -j$(nproc) llama-cli

# Run
./bin/llama-cli -m zamba2-2.7b-instruct-hxq-affine6.gguf \
  -ngl 99 -p "Explain the theory of relativity in simple terms:" -n 128

Option 2: HuggingFace (Python)

pip install "helix-substrate[hf]"
import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-2.7b-instruct-hxq")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-2.7b-instruct-hxq")

inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Safetensors Benchmark

Dense (BF16) HXQ (safetensors)
Size 5.1 GB 2.8 GB
Perplexity (WikiText-2, 2048 ctx) 5.33 5.68 (+6.59%)
Compression ratio 1x 1.83x
Compressed modules 0 147 HelixLinear layers

Note: The safetensors PPL (5.68) and GGUF PPL (22.653) use different evaluation configurations (ctx=2048/stride=512 vs ctx=512/654 chunks) and different tokenization. They are not directly comparable.

Good to Know

  • GPU and CPU supported — runs on any CUDA GPU or CPU via standard PyTorch. Native GGUF runs via llama.cpp.
  • Hybrid architecture — 45 Mamba2 layers + 9 shared Transformer layers. SSM tensors (ssm_in, ssm_out, ssm_mix) and shared FFN all compressed.
  • Fine-tunable via LoRA — compressed weights remain frozen, but LoRA adapters attach to each HelixLinear layer via HelixLinearSTE. See helix-substrate for training infrastructure.
  • Requires helix-substrate for safetensors path — the quantizer is not built into transformers.
  • Requires llama.cpp HXQ fork for GGUF path — standard llama.cpp does not have HXQ type support yet. The hxq-affine-type branch also includes Zamba2 architecture support.

What is HXQ?

HXQ is a weight compression codec based on vector quantization with per-group affine correction:

  • Each weight matrix is replaced by a 256-entry codebook + uint8 index matrix + per-group affine scale/offset
  • The compressed form is the executable — codebook[indices] * scale + offset during matmul, no decompression step
  • Works on any nn.Linear regardless of architecture (Transformer, Mamba, MLP)
  • No calibration data required — codebooks are fit from the weights alone via k-means
  • 6.27 bits per weight in the GGUF affine-6 format

Companion Models

Same codec, multiple architectures:

Model Architecture GGUF Safetensors
qwen2.5-7b-instruct-hxq Transformer Yes Yes
qwen2.5-coder-3b-hxq Transformer (code) Yes Yes
qwen2.5-3b-instruct-hxq Transformer Yes Yes
qwen2.5-coder-1.5b-hxq Transformer (code) Yes Yes
qwen2.5-14b-instruct-hxq Transformer Yes Yes
qwen2.5-sentinel-3b-hxq Transformer (security) -- Yes

Citation

@software{hxq_2026,
  title={HXQ: Vector Quantization with Per-Group Affine Correction for Neural Network Weight Compression},
  author={Echo Labs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}

License

Apache 2.0 (inherited from Zyphra/Zamba2-2.7B-instruct).

Downloads last month
379
Safetensors
Model size
3B params
Tensor type
I64
·
F32
·
BF16
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EchoLabs33/zamba2-2.7b-instruct-hxq

Quantized
(1)
this model

Collection including EchoLabs33/zamba2-2.7b-instruct-hxq

Evaluation results