Instructions to use srswti/axe-strada-28b-nvfp4a16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use srswti/axe-strada-28b-nvfp4a16 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="srswti/axe-strada-28b-nvfp4a16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("srswti/axe-strada-28b-nvfp4a16") model = AutoModelForImageTextToText.from_pretrained("srswti/axe-strada-28b-nvfp4a16") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use srswti/axe-strada-28b-nvfp4a16 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "srswti/axe-strada-28b-nvfp4a16" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "srswti/axe-strada-28b-nvfp4a16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/srswti/axe-strada-28b-nvfp4a16
- SGLang
How to use srswti/axe-strada-28b-nvfp4a16 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "srswti/axe-strada-28b-nvfp4a16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "srswti/axe-strada-28b-nvfp4a16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "srswti/axe-strada-28b-nvfp4a16" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "srswti/axe-strada-28b-nvfp4a16", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use srswti/axe-strada-28b-nvfp4a16 with Docker Model Runner:
docker model run hf.co/srswti/axe-strada-28b-nvfp4a16
Axe Strada 28B - NVFP4A16
A 28 billion parameter multimodal model with weights compressed to 4-bit floating point and activations kept at full FP16. No calibration data. No activation statistics. No offline preprocessing of any kind. The compression is derived entirely from the weight tensors themselves.
If you are optimising for maximum throughput at large batch sizes and have a calibrated deployment pipeline, Axe Strada 28B runs both weights and activations at FP4 and makes full use of the Blackwell FP4 tensor core path.
The standard approach of quantizing everything uniformly trades correctness for simplicity. We take the opposite position: quantize aggressively where it is safe to do so, and preserve precision exactly where the architecture is sensitive.
NVFP4A16 vs NVFP4 -- What Is Different and Why It Matters
There are two distinct operating modes for 4-bit floating point compression on Blackwell hardware. Understanding the difference matters for choosing the right variant for your deployment.
NVFP4 (W4A4) quantizes both weights and activations to FP4. Both operands of the matrix multiply enter the Blackwell FP4 tensor core path. This delivers the highest possible throughput at large batch sizes but requires a calibration pass to compute the global activation scale -- a per-tensor FP32 value that normalizes activations before they are mapped onto the FP4 grid.
NVFP4A16 (W4A16) quantizes weights to FP4 and leaves activations in FP16. The matrix multiply runs on the FP16 accumulation path, using FP4 weights that are dequantized inline before the multiply-accumulate. No activation calibration is needed because activations never leave FP16. The weight storage savings are identical to NVFP4. The compute path is different.
The practical tradeoff:
| Property | NVFP4 (W4A4) | NVFP4A16 (W4A16) |
|---|---|---|
| Weight precision | FP4 | FP4 |
| Activation precision | FP4 | FP16 |
| Calibration data required | Yes | No |
| Tensor core path | FP4 native | FP16 mature |
| Peak throughput (large batch) | Higher | Moderate |
| Decode throughput (small batch) | Comparable | Comparable |
| Weight memory footprint | ~3.5x smaller than BF16 | ~3.5x smaller than BF16 |
For memory-constrained deployments and latency-sensitive single-request workloads, NVFP4A16 performs on par with its fully quantized counterpart while being simpler to produce and more broadly compatible with existing FP16 kernel paths in vLLM.
How the Compression Works
The Weight Format
Every quantized weight is stored as an E2M1 4-bit float: 1 sign bit, 2 exponent bits, 1 mantissa bit. The representable codebook is:
Sixteen consecutive weights share a single F8_E4M3 block scale. A F32 global scale anchors the full tensor. The reconstruction of any weight value at inference time is:
This two-level hierarchy is what makes FP4 viable at model scale. The block scale handles local variation within each group of 16. The global scale handles the tensor-wide dynamic range. Neither level alone would be sufficient.
The simple version. Every 16 weights share a local zoom factor stored in 8 bits. The whole tensor has one global zoom factor stored in 32 bits. At compute time, the GPU reads the 4-bit weight, applies both zoom factors inline, and feeds the result directly into the FP16 multiply-accumulate. There is no separate dequantization step. It is fused into the matrix multiply kernel.
The effective storage cost per weight:
How the Matrix Multiply Changes
In BF16, the standard linear layer computes:
where both $X$ (activations) and $W$ (weights) are 16-bit values. The GPU loads 2 bytes per weight element from VRAM into the compute units.
In NVFP4A16, $X$ remains FP16 and $W$ is loaded as packed FP4 -- 0.5 bytes per weight element. The kernel unpacks the FP4 values, applies the two-level scale inline, and runs the multiply-accumulate on the FP16 path:
Because activations are never quantized, there is no per-token scale computation, no activation calibration overhead, and no risk of activation outliers degrading the output. The FP16 accumulation path is the most mature and heavily optimised GEMM path in both vLLM and CUTLASS. Weight-only compression on this path is particularly effective at autoregressive decode, where bandwidth -- not compute -- is the bottleneck. Loading 4-bit weights from VRAM instead of 16-bit weights reduces the data movement cost by 3.5x, which maps almost directly to faster per-token latency at small batch sizes.
Precision Mapping Across the Architecture
Through our own layer-by-layer profiling of activation distributions, routing sensitivity, and accumulated rounding error, we identified exactly which components of this architecture can absorb 4-bit weight compression without behavioral change.
Quantized to FP4 (weights only)
All standard linear projections within the language model transformer blocks: Q, K, V, and output projections in attention, and the up, gate, and down projections in the routed expert MLPs.
Preserved at full precision
| Component | Reason |
|---|---|
| Visual encoder | Vision features have a structurally different activation distribution from language features. Weight compression here degrades spatial grounding in ways that propagate into cross-modal attention. |
| Gated DeltaNet / linear attention | The fused projection structure of the Gated DeltaNet layers is architecturally incompatible with per-group-16 FP4 weight quantization. These layers are excluded entirely. |
| MoE router gates | Routing is a discrete decision. Small weight errors here can misroute tokens to the wrong expert, with effects that are not recoverable in the same forward pass. |
| Language model head | The final projection onto vocabulary logits. Precision here determines the shape of the output distribution and the integrity of structured generation. |
| MTP layers | Not loaded through the model class used for quantization. No action needed. |
Memory
Original Qwen3.6-27B in BF16 occupies approximately 55 GB. Axe Strada NVFP4A16 brings the quantized layer weights to approximately 4.5 bits per parameter -- a 3.5x reduction over BF16 on those layers. On disk, the full model including preserved BF16 components lands significantly below the original. The freed VRAM goes directly into KV cache budget, which at long context lengths is the difference between fitting a request and rejecting it.
Deployment via vLLM
Axe Strada NVFP4A16 is compatible with vLLM on NVIDIA Blackwell hardware.
Text only -- skip the vision encoder to free VRAM for additional KV cache:
vllm serve srswti/axe-strada-28b-nvfp4a16 --reasoning-parser qwen3 --language-model-only
Multimodal -- full vision and language support:
vllm serve srswti/axe-strada-28b-nvfp4a16 --reasoning-parser qwen3
Tool use:
vllm serve srswti/axe-strada-28b-nvfp4a16 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder
Speculative decoding via Multi-Token Prediction:
vllm serve srswti/axe-strada-28b-nvfp4a16 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}'
Production config -- 256K context with FP8 KV cache:
vllm serve srswti/axe-strada-28b-nvfp4a16 \
--trust-remote-code \
--max-model-len 262144 \
--max-num-seqs 2 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.9 \
--reasoning-parser qwen3
Send requests using the OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://<your-server-host>:8000/v1",
)
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
response = client.chat.completions.create(
model="srswti/axe-strada-28b-nvfp4a16",
messages=messages,
)
print(response.choices[0].message.content)
Requirements: NVIDIA Blackwell GPU (SM120), vLLM >= 0.19.
Evaluation
Benchmarks are in progress. This page will be updated when results across the full suite are verified.
- Downloads last month
- 162
Model tree for srswti/axe-strada-28b-nvfp4a16
Base model
Qwen/Qwen3.6-27B