TMax-27B-AWQ-BF16-INT4

An AWQ W4A16 (4-bit weight, 16-bit activation) quantization of allenai/tmax-27b, produced with llm-compressor and stored in the compressed-tensors pack-quantized format.

The quantization recipe mirrors cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4: INT4 weights, group size 32, asymmetric, MSE observer — with a grafted MTP head so multi-token-prediction speculative decoding works (see below).

Quantization details

Property	Value
Method	AWQ (activation-aware) via llm-compressor 0.12.0
Format	`compressed-tensors` / `pack-quantized`
Weight precision	INT4 (`num_bits=4`, `type=int`)
Group size	32
Symmetric	No (asymmetric, with zero-point)
Observer	MSE
Activations	BF16 (unquantized)
Calibration	512 samples, 2048 tokens, `HuggingFaceH4/ultrachat_200k`
Architecture	`Qwen3_5ForConditionalGeneration`, `language_model_only: true`

What is quantized

Quantized to INT4: the full-attention and feed-forward Linear layers — self_attn.{q,k,v,o}_proj in the full-attention layers (indices 3, 7, 11, …, 63) and all mlp.{gate,up,down}_proj, in both the language model and the MTP layer.

Kept in BF16 (excluded from quantization):

All Gated DeltaNet (linear_attn.*) projections — the linear-attention layers of the Qwen3.5/3.6 hybrid architecture.
lm_head, the MTP fusion (mtp.fc), and all norms.

AWQ smoothing was scoped to the full-attention layers, since the DeltaNet layers do not expose q/k/v/o_proj to balance against.

Multi-Token Prediction (MTP)

allenai/tmax-27b ships without an MTP head (it was removed during training, together with the vision head), yet its config still declares mtp_num_hidden_layers: 1. Serving the bare model with MTP therefore runs speculation against random weights and yields a ~0% acceptance rate.

This repo grafts the MTP head from base Qwen3.6-27B (the same head cyankiwi/Qwen3.6-27B-AWQ-BF16-INT4 uses), quantized in the identical INT4/g32/asymmetric format. All dimensions match exactly (tmax is Qwen3.6-27B fine-tuned), so the graft is drop-in.

Measured acceptance rate (MTP=3, greedy): ~85% overall.

Draft position	Acceptance
0	~95%
1	~88%
2	~72%

Note: the MTP head is base Qwen3.6-27B's, not a tmax-native one (none exists — it was stripped from the release). It performs excellently despite tmax's DPPO distribution shift; the steeper decay at position 2 reflects that mismatch. Disable speculation (--speculative-config omitted) for a plain, still-correct model.

Usage (vLLM)

The model has no vision weights (the vision head was removed in training), so it serves as a language model. Vision is disabled via language_model_only: true in the config — no extra flag needed.

vllm serve bannert/tmax-27b-AWQ-BF16-INT4 \
  --served-model-name tmax-27b \
  --tensor-parallel-size 2 \
  --max-model-len 262144 \
  --kv-cache-dtype fp8 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --quantization compressed-tensors \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 3}'

Hardware notes

The INT4 weights are ~13 GB, but the BF16 Gated-DeltaNet layers add substantial memory. On 2× 24 GB GPUs (e.g. RTX 3090) use --tensor-parallel-size 2; a single 24 GB card overflows at large context.
The group-size-32 + asymmetric + zero-point kernels (CutlassW4A8 / Machete) require compute capability ≥ 9.0 (Hopper/Blackwell). On Ampere (RTX 3090, cc 8.6) vLLM falls back to Marlin/Triton, which work for this model's layer dimensions.

Base model

TMax 27B is a terminal-agent model trained with DPPO on top of Qwen 3.6 27B by Ai2. See the base model card for details, benchmarks, and intended use.

License

Apache 2.0, inherited from the base model. Intended for research and educational use per Ai2's Responsible Use Guidelines.

Downloads last month: -

Safetensors

Model size

29B params

Tensor type

I64

I32

BF16

Model tree for bannert/tmax-27b-AWQ-BF16-INT4

Base model

Qwen/Qwen3.6-27B

Finetuned

allenai/tmax-27b

Quantized

(5)

this model

Paper for bannert/tmax-27b-AWQ-BF16-INT4

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Paper • 2306.00978 • Published Jun 1, 2023 • 13