Instructions to use EchoLabs33/zamba2-2.7b-instruct-hxq with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use EchoLabs33/zamba2-2.7b-instruct-hxq with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="EchoLabs33/zamba2-2.7b-instruct-hxq")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-2.7b-instruct-hxq")
model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-2.7b-instruct-hxq")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

llama-cpp-python

How to use EchoLabs33/zamba2-2.7b-instruct-hxq with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="EchoLabs33/zamba2-2.7b-instruct-hxq",
	filename="zamba2-2.7b-instruct-hxq-affine6.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use EchoLabs33/zamba2-2.7b-instruct-hxq with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf EchoLabs33/zamba2-2.7b-instruct-hxq
# Run inference directly in the terminal:
llama-cli -hf EchoLabs33/zamba2-2.7b-instruct-hxq

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf EchoLabs33/zamba2-2.7b-instruct-hxq
# Run inference directly in the terminal:
llama-cli -hf EchoLabs33/zamba2-2.7b-instruct-hxq

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf EchoLabs33/zamba2-2.7b-instruct-hxq
# Run inference directly in the terminal:
./llama-cli -hf EchoLabs33/zamba2-2.7b-instruct-hxq

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf EchoLabs33/zamba2-2.7b-instruct-hxq
# Run inference directly in the terminal:
./build/bin/llama-cli -hf EchoLabs33/zamba2-2.7b-instruct-hxq

Use Docker

docker model run hf.co/EchoLabs33/zamba2-2.7b-instruct-hxq

LM Studio
Jan

vLLM

How to use EchoLabs33/zamba2-2.7b-instruct-hxq with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "EchoLabs33/zamba2-2.7b-instruct-hxq"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EchoLabs33/zamba2-2.7b-instruct-hxq",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/EchoLabs33/zamba2-2.7b-instruct-hxq

SGLang

How to use EchoLabs33/zamba2-2.7b-instruct-hxq with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "EchoLabs33/zamba2-2.7b-instruct-hxq" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EchoLabs33/zamba2-2.7b-instruct-hxq",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "EchoLabs33/zamba2-2.7b-instruct-hxq" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "EchoLabs33/zamba2-2.7b-instruct-hxq",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use EchoLabs33/zamba2-2.7b-instruct-hxq with Ollama:
```
ollama run hf.co/EchoLabs33/zamba2-2.7b-instruct-hxq
```

Unsloth Studio

How to use EchoLabs33/zamba2-2.7b-instruct-hxq with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for EchoLabs33/zamba2-2.7b-instruct-hxq to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for EchoLabs33/zamba2-2.7b-instruct-hxq to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for EchoLabs33/zamba2-2.7b-instruct-hxq to start chatting

Docker Model Runner
How to use EchoLabs33/zamba2-2.7b-instruct-hxq with Docker Model Runner:
```
docker model run hf.co/EchoLabs33/zamba2-2.7b-instruct-hxq
```

Lemonade

How to use EchoLabs33/zamba2-2.7b-instruct-hxq with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull EchoLabs33/zamba2-2.7b-instruct-hxq

Run and chat with the model

lemonade run user.zamba2-2.7b-instruct-hxq-{{QUANT_TAG}}

List all available models

lemonade list

Zamba2-2.7B-Instruct-HXQ

Zamba2-2.7B-Instruct compressed with HXQ (HelixCode vector quantization).

Available as both HuggingFace safetensors (via helix-substrate) and native GGUF (via llama.cpp HXQ fork).

First hybrid Mamba2+Transformer architecture with HXQ runtime benchmarks.

GGUF Runtime Benchmark (RTX 3090)

Benchmarked against standard GGUF K-quants on RTX 3090, full GPU offload (-ngl 99), using the hxq-affine-type branch at commit 580e9a2.

Decode Speed (tg128, 3 runs)

Format	Size	bpw	tok/s	vs Q4	vs Q6
Q4_K_M	2.11 GB	4.5	47.30	100%	105.2%
HXQ_AF6	2.79 GB	6.27	45.87	97.0%	102.0%
Q6_K	2.93 GB	6.58	44.98	95.1%	100%
Q5_K_M	2.51 GB	5.62	43.38	91.7%	96.4%

Perplexity (WikiText-2, 654 chunks, ctx=512)

Format	bpw	PPL	vs Q4
Q6_K	6.58	22.573	-0.705 (best)
HXQ_AF6	6.27	22.653	-0.626 (2nd)
Q5_K_M	5.62	22.743	-0.535
Q4_K_M	4.5	23.278	baseline

Prefill (pp512, 3 runs)

Format	tok/s	vs Q4
Q4_K_M	325.24	100%
Q6_K	322.19	99.1%
Q5_K_M	321.32	98.8%
HXQ_AF6	318.55	97.9%

Summary: HXQ_AF6 decodes faster than both Q6_K (+2.0%) and Q5_K_M (+5.7%) while being smaller than Q6_K (2.79 vs 2.93 GB). PPL is second-best, only 0.080 behind Q6_K. Prefill is within 2% across all formats (SSM-dominated, not matmul-dominated). This is the first HXQ runtime benchmark on a hybrid Mamba2+Transformer architecture.

Reproducibility

All claims are within-run comparisons using the same dataset, llama.cpp commit, and hardware. Do not compare these PPL numbers with numbers from other runs using different model variants, dataset files, or build configurations.

Note: Zamba2 absolute PPL (~~22) is higher than Qwen (~~8-10) on this dataset due to different tokenizer and training distribution. The important metric is relative ranking within this run.

Receipt with SHA256 artifact hashes, exact commands, and dataset provenance: hxq_runtime_3090_zamba2_2.7b_20260509

Install and Run

Option 1: Native GGUF (llama.cpp)

# Build llama.cpp with HXQ + Zamba2 support
git clone -b hxq-affine-type https://github.com/echo313unfolding/llama.cpp.git
cd llama.cpp && mkdir build && cd build
cmake .. -DGGML_CUDA=ON && make -j$(nproc) llama-cli

# Run
./bin/llama-cli -m zamba2-2.7b-instruct-hxq-affine6.gguf \
  -ngl 99 -p "Explain the theory of relativity in simple terms:" -n 128

Option 2: HuggingFace (Python)

pip install "helix-substrate[hf]"

import helix_substrate  # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-2.7b-instruct-hxq")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-2.7b-instruct-hxq")

inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Safetensors Benchmark

	Dense (BF16)	HXQ (safetensors)
Size	5.1 GB	2.8 GB
Perplexity (WikiText-2, 2048 ctx)	5.33	5.68 (+6.59%)
Compression ratio	1x	1.83x
Compressed modules	0	147 HelixLinear layers

Note: The safetensors PPL (5.68) and GGUF PPL (22.653) use different evaluation configurations (ctx=2048/stride=512 vs ctx=512/654 chunks) and different tokenization. They are not directly comparable.

Good to Know

GPU and CPU supported — runs on any CUDA GPU or CPU via standard PyTorch. Native GGUF runs via llama.cpp.
Hybrid architecture — 45 Mamba2 layers + 9 shared Transformer layers. SSM tensors (ssm_in, ssm_out, ssm_mix) and shared FFN all compressed.
Fine-tunable via LoRA — compressed weights remain frozen, but LoRA adapters attach to each HelixLinear layer via HelixLinearSTE. See helix-substrate for training infrastructure.
Requires helix-substrate for safetensors path — the quantizer is not built into transformers.
Requires llama.cpp HXQ fork for GGUF path — standard llama.cpp does not have HXQ type support yet. The hxq-affine-type branch also includes Zamba2 architecture support.

What is HXQ?

HXQ is a weight compression codec based on vector quantization with per-group affine correction:

Each weight matrix is replaced by a 256-entry codebook + uint8 index matrix + per-group affine scale/offset
The compressed form is the executable — codebook[indices] * scale + offset during matmul, no decompression step
Works on any nn.Linear regardless of architecture (Transformer, Mamba, MLP)
No calibration data required — codebooks are fit from the weights alone via k-means
6.27 bits per weight in the GGUF affine-6 format

Companion Models

Same codec, multiple architectures:

Model	Architecture	GGUF	Safetensors
qwen2.5-7b-instruct-hxq	Transformer	Yes	Yes
qwen2.5-coder-3b-hxq	Transformer (code)	Yes	Yes
qwen2.5-3b-instruct-hxq	Transformer	Yes	Yes
qwen2.5-coder-1.5b-hxq	Transformer (code)	Yes	Yes
qwen2.5-14b-instruct-hxq	Transformer	Yes	Yes
qwen2.5-sentinel-3b-hxq	Transformer (security)	--	Yes

Citation

@software{hxq_2026,
  title={HXQ: Vector Quantization with Per-Group Affine Correction for Neural Network Weight Compression},
  author={Echo Labs},
  year={2026},
  url={https://github.com/echo313unfolding/helix-substrate}
}

License

Apache 2.0 (inherited from Zyphra/Zamba2-2.7B-instruct).

Downloads last month: 379

Safetensors

Model size

3B params

Tensor type

I64

F32

BF16

F16

Model tree for EchoLabs33/zamba2-2.7b-instruct-hxq

Base model

Zyphra/Zamba2-2.7B

Finetuned

Zyphra/Zamba2-2.7B-instruct

Quantized

(1)

this model

Collection including EchoLabs33/zamba2-2.7b-instruct-hxq

HXQ Compressed Models

Collection

Calibration-free VQ-256 compression. Transformers, SSMs, hybrids. Beats GPTQ/AWQ quality. Zero calibration data. • 10 items • Updated Mar 30

Evaluation results

Perplexity (GGUF, ctx=512) on WikiText-2
test set self-reported

22.653