Instructions to use EchoLabs33/zamba2-2.7b-instruct-hxq with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use EchoLabs33/zamba2-2.7b-instruct-hxq with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="EchoLabs33/zamba2-2.7b-instruct-hxq") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-2.7b-instruct-hxq") model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-2.7b-instruct-hxq") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use EchoLabs33/zamba2-2.7b-instruct-hxq with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="EchoLabs33/zamba2-2.7b-instruct-hxq", filename="zamba2-2.7b-instruct-hxq-affine6.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use EchoLabs33/zamba2-2.7b-instruct-hxq with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf EchoLabs33/zamba2-2.7b-instruct-hxq # Run inference directly in the terminal: llama-cli -hf EchoLabs33/zamba2-2.7b-instruct-hxq
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf EchoLabs33/zamba2-2.7b-instruct-hxq # Run inference directly in the terminal: llama-cli -hf EchoLabs33/zamba2-2.7b-instruct-hxq
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf EchoLabs33/zamba2-2.7b-instruct-hxq # Run inference directly in the terminal: ./llama-cli -hf EchoLabs33/zamba2-2.7b-instruct-hxq
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf EchoLabs33/zamba2-2.7b-instruct-hxq # Run inference directly in the terminal: ./build/bin/llama-cli -hf EchoLabs33/zamba2-2.7b-instruct-hxq
Use Docker
docker model run hf.co/EchoLabs33/zamba2-2.7b-instruct-hxq
- LM Studio
- Jan
- vLLM
How to use EchoLabs33/zamba2-2.7b-instruct-hxq with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "EchoLabs33/zamba2-2.7b-instruct-hxq" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EchoLabs33/zamba2-2.7b-instruct-hxq", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/EchoLabs33/zamba2-2.7b-instruct-hxq
- SGLang
How to use EchoLabs33/zamba2-2.7b-instruct-hxq with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "EchoLabs33/zamba2-2.7b-instruct-hxq" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EchoLabs33/zamba2-2.7b-instruct-hxq", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "EchoLabs33/zamba2-2.7b-instruct-hxq" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "EchoLabs33/zamba2-2.7b-instruct-hxq", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use EchoLabs33/zamba2-2.7b-instruct-hxq with Ollama:
ollama run hf.co/EchoLabs33/zamba2-2.7b-instruct-hxq
- Unsloth Studio
How to use EchoLabs33/zamba2-2.7b-instruct-hxq with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for EchoLabs33/zamba2-2.7b-instruct-hxq to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for EchoLabs33/zamba2-2.7b-instruct-hxq to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for EchoLabs33/zamba2-2.7b-instruct-hxq to start chatting
- Docker Model Runner
How to use EchoLabs33/zamba2-2.7b-instruct-hxq with Docker Model Runner:
docker model run hf.co/EchoLabs33/zamba2-2.7b-instruct-hxq
- Lemonade
How to use EchoLabs33/zamba2-2.7b-instruct-hxq with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull EchoLabs33/zamba2-2.7b-instruct-hxq
Run and chat with the model
lemonade run user.zamba2-2.7b-instruct-hxq-{{QUANT_TAG}}List all available models
lemonade list
Zamba2-2.7B-Instruct-HXQ
Zamba2-2.7B-Instruct compressed with HXQ (HelixCode vector quantization).
Available as both HuggingFace safetensors (via
helix-substrate) and native GGUF (viallama.cppHXQ fork).First hybrid Mamba2+Transformer architecture with HXQ runtime benchmarks.
GGUF Runtime Benchmark (RTX 3090)
Benchmarked against standard GGUF K-quants on RTX 3090, full GPU offload (-ngl 99), using the hxq-affine-type branch at commit 580e9a2.
Decode Speed (tg128, 3 runs)
| Format | Size | bpw | tok/s | vs Q4 | vs Q6 |
|---|---|---|---|---|---|
| Q4_K_M | 2.11 GB | 4.5 | 47.30 | 100% | 105.2% |
| HXQ_AF6 | 2.79 GB | 6.27 | 45.87 | 97.0% | 102.0% |
| Q6_K | 2.93 GB | 6.58 | 44.98 | 95.1% | 100% |
| Q5_K_M | 2.51 GB | 5.62 | 43.38 | 91.7% | 96.4% |
Perplexity (WikiText-2, 654 chunks, ctx=512)
| Format | bpw | PPL | vs Q4 |
|---|---|---|---|
| Q6_K | 6.58 | 22.573 | -0.705 (best) |
| HXQ_AF6 | 6.27 | 22.653 | -0.626 (2nd) |
| Q5_K_M | 5.62 | 22.743 | -0.535 |
| Q4_K_M | 4.5 | 23.278 | baseline |
Prefill (pp512, 3 runs)
| Format | tok/s | vs Q4 |
|---|---|---|
| Q4_K_M | 325.24 | 100% |
| Q6_K | 322.19 | 99.1% |
| Q5_K_M | 321.32 | 98.8% |
| HXQ_AF6 | 318.55 | 97.9% |
Summary: HXQ_AF6 decodes faster than both Q6_K (+2.0%) and Q5_K_M (+5.7%) while being smaller than Q6_K (2.79 vs 2.93 GB). PPL is second-best, only 0.080 behind Q6_K. Prefill is within 2% across all formats (SSM-dominated, not matmul-dominated). This is the first HXQ runtime benchmark on a hybrid Mamba2+Transformer architecture.
Reproducibility
All claims are within-run comparisons using the same dataset, llama.cpp commit, and hardware. Do not compare these PPL numbers with numbers from other runs using different model variants, dataset files, or build configurations.
Note: Zamba2 absolute PPL (22) is higher than Qwen (8-10) on this dataset due to different tokenizer and training distribution. The important metric is relative ranking within this run.
Receipt with SHA256 artifact hashes, exact commands, and dataset provenance: hxq_runtime_3090_zamba2_2.7b_20260509
Install and Run
Option 1: Native GGUF (llama.cpp)
# Build llama.cpp with HXQ + Zamba2 support
git clone -b hxq-affine-type https://github.com/echo313unfolding/llama.cpp.git
cd llama.cpp && mkdir build && cd build
cmake .. -DGGML_CUDA=ON && make -j$(nproc) llama-cli
# Run
./bin/llama-cli -m zamba2-2.7b-instruct-hxq-affine6.gguf \
-ngl 99 -p "Explain the theory of relativity in simple terms:" -n 128
Option 2: HuggingFace (Python)
pip install "helix-substrate[hf]"
import helix_substrate # registers the HXQ quantizer with HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("EchoLabs33/zamba2-2.7b-instruct-hxq")
tokenizer = AutoTokenizer.from_pretrained("EchoLabs33/zamba2-2.7b-instruct-hxq")
inputs = tokenizer("The capital of France is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Safetensors Benchmark
| Dense (BF16) | HXQ (safetensors) | |
|---|---|---|
| Size | 5.1 GB | 2.8 GB |
| Perplexity (WikiText-2, 2048 ctx) | 5.33 | 5.68 (+6.59%) |
| Compression ratio | 1x | 1.83x |
| Compressed modules | 0 | 147 HelixLinear layers |
Note: The safetensors PPL (5.68) and GGUF PPL (22.653) use different evaluation configurations (ctx=2048/stride=512 vs ctx=512/654 chunks) and different tokenization. They are not directly comparable.
Good to Know
- GPU and CPU supported — runs on any CUDA GPU or CPU via standard PyTorch. Native GGUF runs via llama.cpp.
- Hybrid architecture — 45 Mamba2 layers + 9 shared Transformer layers. SSM tensors (ssm_in, ssm_out, ssm_mix) and shared FFN all compressed.
- Fine-tunable via LoRA — compressed weights remain frozen, but LoRA adapters attach to each
HelixLinearlayer viaHelixLinearSTE. Seehelix-substratefor training infrastructure. - Requires
helix-substratefor safetensors path — the quantizer is not built into transformers. - Requires llama.cpp HXQ fork for GGUF path — standard llama.cpp does not have HXQ type support yet. The
hxq-affine-typebranch also includes Zamba2 architecture support.
What is HXQ?
HXQ is a weight compression codec based on vector quantization with per-group affine correction:
- Each weight matrix is replaced by a 256-entry codebook + uint8 index matrix + per-group affine scale/offset
- The compressed form is the executable —
codebook[indices] * scale + offsetduring matmul, no decompression step - Works on any
nn.Linearregardless of architecture (Transformer, Mamba, MLP) - No calibration data required — codebooks are fit from the weights alone via k-means
- 6.27 bits per weight in the GGUF affine-6 format
Companion Models
Same codec, multiple architectures:
| Model | Architecture | GGUF | Safetensors |
|---|---|---|---|
| qwen2.5-7b-instruct-hxq | Transformer | Yes | Yes |
| qwen2.5-coder-3b-hxq | Transformer (code) | Yes | Yes |
| qwen2.5-3b-instruct-hxq | Transformer | Yes | Yes |
| qwen2.5-coder-1.5b-hxq | Transformer (code) | Yes | Yes |
| qwen2.5-14b-instruct-hxq | Transformer | Yes | Yes |
| qwen2.5-sentinel-3b-hxq | Transformer (security) | -- | Yes |
Citation
@software{hxq_2026,
title={HXQ: Vector Quantization with Per-Group Affine Correction for Neural Network Weight Compression},
author={Echo Labs},
year={2026},
url={https://github.com/echo313unfolding/helix-substrate}
}
License
Apache 2.0 (inherited from Zyphra/Zamba2-2.7B-instruct).
- Downloads last month
- 379
Model tree for EchoLabs33/zamba2-2.7b-instruct-hxq
Collection including EchoLabs33/zamba2-2.7b-instruct-hxq
Evaluation results
- Perplexity (GGUF, ctx=512) on WikiText-2test set self-reported22.653