Qwable-v1-AWQ

AWQ 4-bit (W4A16) quantization of lordx64/Qwable-v1 — a 35B-total / 3B-active text generation Mixture-of-Experts model (Qwen3_5MoeForConditionalGeneration, Qwen3.6 family, with hybrid linear / full attention). Per the base model card it is text-only and aimed at reasoning, agentic tool-use, and coding (see Capabilities).

Variant: AWQ weight-only (W4A16) — int4 symmetric weights, group size 128, activation-aware scaling; activations stay BF16 Disk size: ~22 GB (vs ~72 GB BF16, ~3.3×) Quantized by: sahilchachra Tooling: llm-compressor AWQ (oneshot) — activation-aware, calibrated on general instruct chat (UltraChat-200k)

Note on what is quantized: only the linear weights that hold the bulk of the parameters are taken to int4 — the 256-way routed experts, the shared experts, and the full-attention projections. The linear/Gated-Delta-Net (mamba-style) layers, the MoE routers, embeddings, lm_head, the MTP head and all norms are kept in BF16 for stability. The architecture also carries a vision tower (Qwen3_5MoeForConditionalGeneration), which is likewise kept in BF16 — but the base model is documented as text-only, so this quantization neither adds nor validates any image capability. The headline variant name reflects the dominant (expert/attention) quantization; the on-disk size averages the int4 and BF16 halves of the model.

Capabilities

Unchanged from the base model — quantization only changes weight precision, not behavior. Per the base model card:

  • Reasoning — thinks in explicit <think>…</think> chains-of-thought.
  • Agentic tool-use — emits <tool_use> XML blocks for file/shell operations (activates with agent-style system prompts or prior <tool_result> turns).
  • Coding — designed for agentic coding tasks with multi-turn agent interactions.
  • Context length: 4096 tokens (training) / 16384 tokens (serving).

See the base card for limitations (narrow training distribution, tool-name differences, reasoning inherited from the Opus-4.7 distill).

Smoke test

Loaded and run with transformers on an NVIDIA Thor (Blackwell) device. The model loads, runs the hybrid linear-attention + int4 MoE path, and produces coherent text from a chat-templated prompt. A structure census confirms only the intended decoder Linears are int4 (routed experts, shared expert, full-attention q/k/v/o) with the routers, linear-attention, vision, MTP and norms left in BF16. This is a functional smoke test only — it is not a quality benchmark.

Test device

  • GPU: NVIDIA Thor (Blackwell)
  • CPU / memory: 14-core ARM (aarch64), 122 GB unified memory
  • Software: JetPack / L4T R38.4 (Ubuntu 24.04), CUDA 13.0, driver 580, kernel 6.8.12-tegra

What's quantized

Quantized → int4 (AWQ W4A16) Kept in BF16
Routed experts (mlp.experts.*.{gate,up,down}_proj, 40 layers × 256 experts) Linear / Gated-Delta-Net layers (*.linear_attn.*)
Shared experts (mlp.shared_expert.{gate,up,down}_proj) MoE routers (mlp.gate), shared-expert gates
Full-attention projections (self_attn.{q,k,v,o}_proj) Embeddings, lm_head, MTP head, all norms
Vision tower (model.visual.*) — present in the arch, unused for text

Usage (vLLM)

from vllm import LLM, SamplingParams

llm = LLM(model="sahilchachra/Qwable-v1-AWQ", dtype="bfloat16", max_model_len=16384, trust_remote_code=True)
out = llm.generate(["Hello!"], SamplingParams(temperature=0.7, top_p=0.9, max_tokens=128))
print(out[0].outputs[0].text)

Runs on GPUs with compressed-tensors W4A16 support (vLLM unpacks the int4 weights for you).

Notes

  • Weight-only AWQ (W4A16): weights are int4 (group size 128, symmetric, activation-aware scales), activations remain BF16.
  • Format: pack-quantized (compressed-tensors), per-expert layout — the standard layout vLLM consumes for quantized MoE.
  • Loading requires compressed-tensors and a recent transformers (the qwen3_5_moe architecture).
  • Smoke-tested only; not formally benchmarked for quality.
  • Sibling quantization: sahilchachra/Qwable-v1-NVFP4A16 (NVFP4 for Blackwell GPUs).

Original model

See lordx64/Qwable-v1 for full lineage, intended use, and limitations. License (AGPL-3.0) is inherited from the base model.

Downloads last month
115
Safetensors
Model size
35B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/Qwable-v1-AWQ