TMax-27B-AWQ-BF16-INT4

An AWQ W4A16 (4-bit weight, 16-bit activation) quantization of allenai/tmax-27b, produced with llm-compressor and stored in the compressed-tensors pack-quantized format.

The quantization recipe mirrors cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4: INT4 weights, group size 32, asymmetric, MSE observer — with a grafted MTP head so multi-token-prediction speculative decoding works (see below).

Quantization details

Property Value
Method AWQ (activation-aware) via llm-compressor 0.12.0
Format compressed-tensors / pack-quantized
Weight precision INT4 (num_bits=4, type=int)
Group size 32
Symmetric No (asymmetric, with zero-point)
Observer MSE
Activations BF16 (unquantized)
Calibration 512 samples, 2048 tokens, HuggingFaceH4/ultrachat_200k
Architecture Qwen3_5ForConditionalGeneration, language_model_only: true

What is quantized

Quantized to INT4: the full-attention and feed-forward Linear layers — self_attn.{q,k,v,o}_proj in the full-attention layers (indices 3, 7, 11, …, 63) and all mlp.{gate,up,down}_proj, in both the language model and the MTP layer.

Kept in BF16 (excluded from quantization):

  • All Gated DeltaNet (linear_attn.*) projections — the linear-attention layers of the Qwen3.5/3.6 hybrid architecture.
  • lm_head, the MTP fusion (mtp.fc), and all norms.

AWQ smoothing was scoped to the full-attention layers, since the DeltaNet layers do not expose q/k/v/o_proj to balance against.

Multi-Token Prediction (MTP)

allenai/tmax-27b ships without an MTP head (it was removed during training, together with the vision head), yet its config still declares mtp_num_hidden_layers: 1. Serving the bare model with MTP therefore runs speculation against random weights and yields a ~0% acceptance rate.

This repo grafts the MTP head from base Qwen3.6-27B (the same head cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4 uses), quantized in the identical INT4/g32/asymmetric format. All dimensions match exactly (tmax is Qwen3.6-27B fine-tuned), so the graft is drop-in.

Measured acceptance rate (MTP=3, greedy): ~85% overall.

Draft position Acceptance
0 ~95%
1 ~88%
2 ~72%

Note: the MTP head is base Qwen3.6-27B's, not a tmax-native one (none exists — it was stripped from the release). It performs excellently despite tmax's DPPO distribution shift; the steeper decay at position 2 reflects that mismatch. Disable speculation (--speculative-config omitted) for a plain, still-correct model.

Usage (vLLM)

The model has no vision weights (the vision head was removed in training), so it serves as a language model. Vision is disabled via language_model_only: true in the config — no extra flag needed.

vllm serve bannert/tmax-27b-AWQ-BF16-INT4 \
  --served-model-name tmax-27b \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --kv-cache-dtype fp8 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --quantization compressed-tensors \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

Hardware notes

  • The INT4 weights are ~13 GB, but the BF16 Gated-DeltaNet layers add substantial memory. On 2× 24 GB GPUs (e.g. RTX 3090) use --tensor-parallel-size 2; a single 24 GB card overflows at large context.
  • The group-size-32 + asymmetric + zero-point kernels (CutlassW4A8 / Machete) require compute capability ≥ 9.0 (Hopper/Blackwell). On Ampere (RTX 3090, cc 8.6) vLLM falls back to Marlin/Triton, which work for this model's layer dimensions.

Base model

TMax 27B is a terminal-agent model trained with DPPO on top of Qwen 3.6 27B by Ai2. See the base model card for details, benchmarks, and intended use.

License

Apache 2.0, inherited from the base model. Intended for research and educational use per Ai2's Responsible Use Guidelines.

Downloads last month
-
Safetensors
Model size
29B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bannert/tmax-27b-AWQ-BF16-INT4

Base model

Qwen/Qwen3.6-27B
Finetuned
allenai/tmax-27b
Quantized
(5)
this model

Paper for bannert/tmax-27b-AWQ-BF16-INT4