Qwable-5-27B-Coder-NVFP4
Qwable-5-27B-Coder-NVFP4 is the compact NVIDIA ModelOpt release of Qwable: a Qwen3.6-based coder-agent tune trained first on Claude Fable 5 traces, then continued on Kimi 2.7 Coder traces.
This is the serving-oriented checkpoint: NVFP4 safetensors for runtimes that understand ModelOpt quantization. Use the GGUF repo for llama.cpp and Ollama.
Serving shape
| Item | Value |
|---|---|
| Format | ModelOpt NVFP4 safetensors |
| Producer | nvidia-modelopt 0.44.0 |
| Quant algorithm | NVFP4 |
| Group size | 16 |
| Weight files | 2 safetensors shards |
| Approx. size | 19.7 GB total |
| Runtime target | vLLM / TensorRT-LLM / NVIDIA ModelOpt-compatible serving |
| Source checkpoint | DJLougen/Qwable-5-27B-Coder |
BF16 source checkpoint
-> NVIDIA ModelOpt export
-> NVFP4 Linear targets, group size 16
-> lm_head / SSM conv1d / visual modules excluded
-> compact safetensors serving repo
What carries over from Qwable
The NVFP4 checkpoint is a quantized release of the same coder-agent model. The target workload remains repository navigation, patch planning, terminal-output analysis, verifier recovery, tool-shaped answers, and long-context coding prompts.
Early maintainer runs show the source Qwable checkpoint outperforming the base model on a private coder benchmark. Public benchmark details are pending, so treat that as early maintainer signal rather than a reproducible leaderboard claim.
Quantization details
From hf_quant_config.json:
| Field | Value |
|---|---|
| Producer | modelopt |
| Producer version | 0.44.0 |
| Quant algorithm | NVFP4 |
| Group size | 16 |
| KV cache quantization | Not set in this checkpoint |
| Higher-precision exclusions | lm_head, SSM conv1d modules, and model.visual* |
The config records 4-bit floating-point weights and input activations for Linear targets. The excluded modules are intentional: they preserve sensitive output, SSM convolution, and vision-tower components outside the main NVFP4 target set.
Quickstart
Install a runtime with support for ModelOpt NVFP4 checkpoints. Exact package versions change quickly; prefer the current vLLM or TensorRT-LLM documentation for your CUDA stack.
Download locally:
hf download DJLougen/Qwable-5-27B-Coder-NVFP4 --local-dir Qwable-5-27B-Coder-NVFP4
Example vLLM-style serving command, adjusted for your installed version and GPU topology:
vllm serve DJLougen/Qwable-5-27B-Coder-NVFP4 \
--tensor-parallel-size 1 \
--max-model-len 32768
If your runtime does not recognize the ModelOpt quantization config, use the BF16 source checkpoint or the GGUF release instead.
Recommended use
- Use this repo for NVIDIA-serving experiments where NVFP4 is supported directly.
- Use the BF16 repo for conversion, further training, or quality-ceiling evaluation.
- Use the GGUF repo for llama.cpp and Ollama workflows.
- Keep benchmark comparisons identical across model variants: same prompts, context, sampling, max tokens, and tool schema exposure.
Related releases
- Source BF16 Transformers checkpoint:
DJLougen/Qwable-5-27B-Coder - llama.cpp GGUF release:
DJLougen/Qwable-5-27B-Coder-GGUF
Limitations
- Public benchmark tables are pending.
- NVFP4 runtime compatibility depends on serving stack, CUDA version, GPU architecture, and package versions.
- This is not a GGUF repo; llama.cpp users should use the GGUF release.
- Quantization can change instruction following, code precision, and tool-call reliability. Validate on your own tasks.
- Vision components are excluded from the main NVFP4 target set, but this release is marketed for coding behavior, not vision improvement.
- Safety behavior is inherited from the base model and fine-tuning data; no separate safety alignment claim is made here.
License
Released under Apache-2.0, following the upstream base model license metadata.
- Downloads last month
- 51