Fallback to marlin kernel give wired/incorrect and sometime gable result
#14
by sapjunior - opened
does any now encounter some wired/gable output mixed Chinese character/in consistency code switch with this quantization using the following configuration on 5090 (Driver Version 595.71.05)
services:
vllm:
image: vllm/vllm-openai:v0.24.0-cu129
container_name: qwen3.6-27b-nvfp4
restart: unless-stopped
network_mode: host
ipc: host
devices:
- "nvidia.com/gpu=all"
volumes:
- /home/thananop/.cache/huggingface:/root/.cache/huggingface
environment:
- HF_TOKEN=${HF_TOKEN}
- VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
entrypoint: ["/bin/bash", "-c"]
command:
- >-
vllm serve nvidia/Qwen3.6-27B-NVFP4
--served-model-name "qwen3.6-27b-nvfp4"
--host=0.0.0.0
--port=8000
--trust-remote-code
--quantization modelopt
--max-model-len=128000
--kv-cache-dtype=fp8
--mm-processor-kwargs '{"min_pixels": 16384, "max_pixels": 12000000}'
--gpu-memory-utilization=0.87
--max-num-seqs=2
--max-num-batched-tokens=8192
--reasoning-parser=qwen3
--enable-auto-tool-choice
--tool-call-parser=qwen3_coder
--default-chat-template-kwargs '{"enable_thinking": false}'
--enable-chunked-prefill
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 600s
one thing that I spot is compared to this model >> https://huggingface.co/sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQ (may be W4A4) and other nvfp4 quant
nvidia quant seems to have a message like the following not sure it is marlin kernel bug from vLLM upstream or not
[marlin.py:34] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads