Instructions to use srswti/axe-strada-28b-nvfp4a16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use srswti/axe-strada-28b-nvfp4a16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="srswti/axe-strada-28b-nvfp4a16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("srswti/axe-strada-28b-nvfp4a16")
model = AutoModelForImageTextToText.from_pretrained("srswti/axe-strada-28b-nvfp4a16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use srswti/axe-strada-28b-nvfp4a16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "srswti/axe-strada-28b-nvfp4a16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "srswti/axe-strada-28b-nvfp4a16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/srswti/axe-strada-28b-nvfp4a16

SGLang

How to use srswti/axe-strada-28b-nvfp4a16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "srswti/axe-strada-28b-nvfp4a16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "srswti/axe-strada-28b-nvfp4a16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "srswti/axe-strada-28b-nvfp4a16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "srswti/axe-strada-28b-nvfp4a16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use srswti/axe-strada-28b-nvfp4a16 with Docker Model Runner:
```
docker model run hf.co/srswti/axe-strada-28b-nvfp4a16
```

Axe Strada 28B - NVFP4A16

A 28 billion parameter multimodal model with weights compressed to 4-bit floating point and activations kept at full FP16. No calibration data. No activation statistics. No offline preprocessing of any kind. The compression is derived entirely from the weight tensors themselves.

If you are optimising for maximum throughput at large batch sizes and have a calibrated deployment pipeline, Axe Strada 28B runs both weights and activations at FP4 and makes full use of the Blackwell FP4 tensor core path.

The standard approach of quantizing everything uniformly trades correctness for simplicity. We take the opposite position: quantize aggressively where it is safe to do so, and preserve precision exactly where the architecture is sensitive.

NVFP4A16 vs NVFP4 -- What Is Different and Why It Matters

There are two distinct operating modes for 4-bit floating point compression on Blackwell hardware. Understanding the difference matters for choosing the right variant for your deployment.

NVFP4 (W4A4) quantizes both weights and activations to FP4. Both operands of the matrix multiply enter the Blackwell FP4 tensor core path. This delivers the highest possible throughput at large batch sizes but requires a calibration pass to compute the global activation scale -- a per-tensor FP32 value that normalizes activations before they are mapped onto the FP4 grid.

NVFP4A16 (W4A16) quantizes weights to FP4 and leaves activations in FP16. The matrix multiply runs on the FP16 accumulation path, using FP4 weights that are dequantized inline before the multiply-accumulate. No activation calibration is needed because activations never leave FP16. The weight storage savings are identical to NVFP4. The compute path is different.

The practical tradeoff:

Property	NVFP4 (W4A4)	NVFP4A16 (W4A16)
Weight precision	FP4	FP4
Activation precision	FP4	FP16
Calibration data required	Yes	No
Tensor core path	FP4 native	FP16 mature
Peak throughput (large batch)	Higher	Moderate
Decode throughput (small batch)	Comparable	Comparable
Weight memory footprint	~3.5x smaller than BF16	~3.5x smaller than BF16

For memory-constrained deployments and latency-sensitive single-request workloads, NVFP4A16 performs on par with its fully quantized counterpart while being simpler to produce and more broadly compatible with existing FP16 kernel paths in vLLM.

How the Compression Works

The Weight Format

Every quantized weight is stored as an E2M1 4-bit float: 1 sign bit, 2 exponent bits, 1 mantissa bit. The representable codebook is:

$\{0, \pm 0.5, \pm 1, \pm 1.5, \pm 2, \pm 3, \pm 4, \pm 6\}$

Sixteen consecutive weights share a single F8_E4M3 block scale. A F32 global scale anchors the full tensor. The reconstruction of any weight value at inference time is:

$\hat{w}_i = s_{F32} \times s_{block} \times w_i^{E2M1}$

This two-level hierarchy is what makes FP4 viable at model scale. The block scale handles local variation within each group of 16. The global scale handles the tensor-wide dynamic range. Neither level alone would be sufficient.

The simple version. Every 16 weights share a local zoom factor stored in 8 bits. The whole tensor has one global zoom factor stored in 32 bits. At compute time, the GPU reads the 4-bit weight, applies both zoom factors inline, and feeds the result directly into the FP16 multiply-accumulate. There is no separate dequantization step. It is fused into the matrix multiply kernel.

The effective storage cost per weight:

$\text{Effective bits/param} = 4 + \frac{8}{16} = 4.5 \text{ bits}$

$\text{Compression vs BF16} \approx \frac{16}{4.5} \approx 3.5\times$

How the Matrix Multiply Changes

In BF16, the standard linear layer computes:

$Y = X W^{T}$

where both $X$ (activations) and $W$ (weights) are 16-bit values. The GPU loads 2 bytes per weight element from VRAM into the compute units.

In NVFP4A16, $X$ remains FP16 and $W$ is loaded as packed FP4 -- 0.5 bytes per weight element. The kernel unpacks the FP4 values, applies the two-level scale inline, and runs the multiply-accumulate on the FP16 path:

$Y = X \cdot \text{dequant}(W_{FP4},\ s_{F32},\ s_{block})^T$

Because activations are never quantized, there is no per-token scale computation, no activation calibration overhead, and no risk of activation outliers degrading the output. The FP16 accumulation path is the most mature and heavily optimised GEMM path in both vLLM and CUTLASS. Weight-only compression on this path is particularly effective at autoregressive decode, where bandwidth -- not compute -- is the bottleneck. Loading 4-bit weights from VRAM instead of 16-bit weights reduces the data movement cost by 3.5x, which maps almost directly to faster per-token latency at small batch sizes.

Precision Mapping Across the Architecture

Through our own layer-by-layer profiling of activation distributions, routing sensitivity, and accumulated rounding error, we identified exactly which components of this architecture can absorb 4-bit weight compression without behavioral change.

Quantized to FP4 (weights only)

All standard linear projections within the language model transformer blocks: Q, K, V, and output projections in attention, and the up, gate, and down projections in the routed expert MLPs.

Preserved at full precision

Component	Reason
Visual encoder	Vision features have a structurally different activation distribution from language features. Weight compression here degrades spatial grounding in ways that propagate into cross-modal attention.
Gated DeltaNet / linear attention	The fused projection structure of the Gated DeltaNet layers is architecturally incompatible with per-group-16 FP4 weight quantization. These layers are excluded entirely.
MoE router gates	Routing is a discrete decision. Small weight errors here can misroute tokens to the wrong expert, with effects that are not recoverable in the same forward pass.
Language model head	The final projection onto vocabulary logits. Precision here determines the shape of the output distribution and the integrity of structured generation.
MTP layers	Not loaded through the model class used for quantization. No action needed.

Memory

Original Qwen3.6-27B in BF16 occupies approximately 55 GB. Axe Strada NVFP4A16 brings the quantized layer weights to approximately 4.5 bits per parameter -- a 3.5x reduction over BF16 on those layers. On disk, the full model including preserved BF16 components lands significantly below the original. The freed VRAM goes directly into KV cache budget, which at long context lengths is the difference between fitting a request and rejecting it.

Deployment via vLLM

Axe Strada NVFP4A16 is compatible with vLLM on NVIDIA Blackwell hardware.

Text only -- skip the vision encoder to free VRAM for additional KV cache:

vllm serve srswti/axe-strada-28b-nvfp4a16 --reasoning-parser qwen3 --language-model-only

Multimodal -- full vision and language support:

vllm serve srswti/axe-strada-28b-nvfp4a16 --reasoning-parser qwen3

Tool use:

vllm serve srswti/axe-strada-28b-nvfp4a16 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

Speculative decoding via Multi-Token Prediction:

vllm serve srswti/axe-strada-28b-nvfp4a16 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}'

Production config -- 256K context with FP8 KV cache:

vllm serve srswti/axe-strada-28b-nvfp4a16 \
  --trust-remote-code \
  --max-model-len 262144 \
  --max-num-seqs 2 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.9 \
  --reasoning-parser qwen3

Send requests using the OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://<your-server-host>:8000/v1",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

response = client.chat.completions.create(
    model="srswti/axe-strada-28b-nvfp4a16",
    messages=messages,
)

print(response.choices[0].message.content)

Requirements: NVIDIA Blackwell GPU (SM120), vLLM >= 0.19.

Evaluation

Benchmarks are in progress. This page will be updated when results across the full suite are verified.

Downloads last month: 162

Safetensors

Model size

19B params

Tensor type

F32

BF16

F8_E4M3

Model tree for srswti/axe-strada-28b-nvfp4a16

Base model

Qwen/Qwen3.6-27B

Quantized

(340)

this model

Collection including srswti/axe-strada-28b-nvfp4a16

cuDega

Collection

Optimized for cuda acceleration • 10 items • Updated 19 days ago