Fallback to marlin kernel give wired/incorrect and sometime gable result

#14
by sapjunior - opened

does any now encounter some wired/gable output mixed Chinese character/in consistency code switch with this quantization using the following configuration on 5090 (Driver Version 595.71.05)

services:
  vllm:
    image: vllm/vllm-openai:v0.24.0-cu129
    container_name: qwen3.6-27b-nvfp4
    restart: unless-stopped
    network_mode: host
    ipc: host
    devices:
      - "nvidia.com/gpu=all"
    volumes:
      - /home/thananop/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
    entrypoint: ["/bin/bash", "-c"]
    command:
      - >-
        vllm serve nvidia/Qwen3.6-27B-NVFP4
        --served-model-name "qwen3.6-27b-nvfp4"
        --host=0.0.0.0
        --port=8000
        --trust-remote-code
        --quantization modelopt
        --max-model-len=128000
        --kv-cache-dtype=fp8
        --mm-processor-kwargs '{"min_pixels": 16384, "max_pixels": 12000000}'
        --gpu-memory-utilization=0.87
        --max-num-seqs=2
        --max-num-batched-tokens=8192
        --reasoning-parser=qwen3
        --enable-auto-tool-choice
        --tool-call-parser=qwen3_coder
        --default-chat-template-kwargs '{"enable_thinking": false}'
        --enable-chunked-prefill
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 600s

one thing that I spot is compared to this model >> https://huggingface.co/sudo-0x2a/Qwen3.6-27B-NVFP4-GPTQ (may be W4A4) and other nvfp4 quant
nvidia quant seems to have a message like the following not sure it is marlin kernel bug from vLLM upstream or not

[marlin.py:34] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads

Sign up or log in to comment