Rio 3.5 Open 397B NVFP4
NVFP4 quantization of prefeitura-rio/Rio-3.5-Open-397B (Qwen 3.5 397B post-trained with vision).
Run
huggingface-cli download mitomtuna/Rio-3.5-Open-397B-NVFP4 --local-dir ./Rio-3.5-Open-397B-NVFP4
docker pull ghcr.io/tunamitom/rio:latest
docker compose -f docker-compose.rio.yaml up -d
Requires 4Γ RTX 6000 Blackwell (384GB+ VRAM total) and NVIDIA Container Toolkit.
docker-compose.rio.yaml
services:
sglang:
image: ghcr.io/tunamitom/rio:latest
container_name: rio
entrypoint: ["/bin/bash"]
ipc: host
shm_size: "16g"
mem_limit: 200g
memswap_limit: 200g
restart: "no"
cap_add: [SYS_NICE]
ulimits:
memlock: -1
stack: 67108864
nofile: { soft: 1048576, hard: 1048576 }
ports:
- "8001:8001"
healthcheck:
test: ["CMD-SHELL", "curl -fs http://localhost:8001/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 900s
environment:
OMP_NUM_THREADS: "8"
SAFETENSORS_FAST_GPU: "1"
CUTE_DSL_ARCH: "sm_120a"
SGLANG_ENABLE_SPEC_V2: "1"
SGLANG_ENABLE_HEALTH_ENDPOINT_GENERATION: "false"
SGLANG_SKIP_SGL_KERNEL_VERSION_CHECK: "1"
SGLANG_ENABLE_TP_MEMORY_INBALANCE_CHECK: "false"
NCCL_DEBUG: WARN
SGLANG_SET_CPU_AFFINITY: "1"
NCCL_IB_DISABLE: "1"
NCCL_P2P_LEVEL: SYS
NCCL_ALLOC_P2P_NET_LL_BUFFERS: "1"
NCCL_MIN_NCHANNELS: "8"
NCCL_CUMEM_HOST_ENABLE: "0"
NCCL_NET_GDR_LEVEL: "SYS"
PYTORCH_CUDA_ALLOC_CONF: "expandable_segments:True"
SGLANG_ENABLE_JIT_DEEPGEMM: "0"
CUDA_VISIBLE_DEVICES: "0,1,2,3"
SGLANG_PREVENT_THOUGHT_LOOPS: "0"
B12X_ENABLE_DYNAMIC_DOWN_SCALE: "1"
SGLANG_PCIE_AUTOTUNE: "1"
TORCHINDUCTOR_CACHE_DIR: "/cache/torchinductor"
TRITON_CACHE_DIR: "/cache/triton"
CUTE_DSL_CACHE_DIR: "/cache/cute_dsl"
B12X_AUTOTUNE_CACHE_DIR: "/cache/b12x_autotune"
volumes:
- ./Rio-3.5-Open-397B-NVFP4:/models/Rio-3.5-Open-397B-NVFP4:ro
- rio-cache:/cache
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0", "1", "2", "3"]
capabilities: [gpu]
command:
- -lc
- >-
set -euo pipefail;
exec python3 -m sglang.launch_server
--model-path /models/Rio-3.5-Open-397B-NVFP4
--tokenizer-path /models/Rio-3.5-Open-397B-NVFP4
--served-model-name rio
--tp-size 4
--host 0.0.0.0
--port 8001
--trust-remote-code
--quantization modelopt_fp4
--kv-cache-dtype fp8_e4m3
--mem-fraction-static 0.93
--chunked-prefill-size 16384
--cuda-graph-max-bs 64
--cuda-graph-bs 1 2 3 4 5 6 7 8 16 24 32 48 64
--max-running-requests 64
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--attention-backend flashinfer
--fp4-gemm-backend b12x
--moe-runner-backend b12x
--mamba-scheduler-strategy extra_buffer
--enable-pcie-oneshot-allreduce
--enable-metrics
--sleep-on-idle
volumes:
rio-cache:
driver: local
Performance
Tested on 4Γ RTX 6000 Blackwell (300W Max-Q, TP=4).
Decode tok/s (aggregate):
| ctx | C=1 | C=10 | C=20 | C=32 |
|---|---|---|---|---|
| 0 | 130 | 539 | 823 | 1159 |
| 16k | 128 | 511 | β | β |
| 32k | 124 | 495 | β | β |
| 64k | 120 | 475 | β | β |
| 128k | 113 | 435 | β | β |
Prefill tok/s:
| ctx | tok/s |
|---|---|
| 8k | 12,614 |
| 16k | 11,850 |
| 32k | 11,375 |
| 64k | 10,150 |
| 128k | 7,943 |
Note: Speculative decoding (NEXTN/MTP) is intentionally disabled β the shipped draft head was trained for base Qwen3.5 and doesn't work well with Rio's post-trained weights. See prefeitura-rio/Rio-3.5-Open-397B for the original model.
Details
- Quantization: NVFP4 via quant-toolkit (ModelOpt)
- KV cache: FP8 E4M3 (2.33M tokens)
- Attention: FlashInfer
- MoE: B12X
- Server: SGLang
- Downloads last month
- 202
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for mitomtuna/Rio-3.5-Open-397B-NVFP4
Base model
Qwen/Qwen3.5-397B-A17B Finetuned
prefeitura-rio/Rio-3.5-Open-397B