NOTE: The parent model has been pulled offline. Consider these quants to be outdated/deprecated.
Rio-3.5-Open-397B GGUF Quants
- Parent Model Card: https://huggingface.co/prefeitura-rio/Rio-3.5-Open-397B
- Original Model Card: https://huggingface.co/Qwen/Qwen3.5-397B-A17B
This repository contains GGUF quantizations of prefeitura-rio/Rio-3.5-Open-397B.
Rio-3.5-Open-397B is based on Qwen3.5-397B-A17B. These GGUF files were converted with b9619 llama.cpp and quantized for llama.cpp testing.
See llama.cpp github for details on llama.cpp: https://github.com/ggml-org/llama.cpp
Files
| File | Quant | MTP | Notes |
|---|---|---|---|
| Rio-3.5-Open-397B-Q6_K-MTP.gguf | Q6_K | yes | High-quality quant, ~308 GiB |
| Rio-3.5-Open-397B-IQ4_XS-MTP.gguf | IQ4_XS | yes | iMatrix-assisted quant, ~200 GiB |
Quantization notes
The IQ4_XS quant was created using Unsloth's published iMatrix for Qwen3.5-397B-A17B-MTP:
unsloth/Qwen3.5-397B-A17B-MTP-GGUF/imatrix_unsloth.gguf_file- https://huggingface.co/unsloth/Qwen3.5-397B-A17B-MTP-GGUF
The MTP layer is retained:
- qwen35moe.block_count = 61
- qwen35moe.nextn_predict_layers = 1
Note: the published Unsloth iMatrix did not include weights for the final blk.60.* MTP tensors, so those tensors were quantized without iMatrix weighting. The main model layers used the iMatrix.
Example llama.cpp launch
llama-server \
--model Rio-3.5-Open-397B-IQ4_XS-MTP.gguf \
--ctx-size 262144 \
--parallel 1 \
--n-gpu-layers 999 \
--flash-attn on \
--cache-type-k bf16 \
--cache-type-v bf16 \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
--spec-draft-type-k q8_0 \
--spec-draft-type-v q8_0 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0
Attribution
- Parent model: prefeitura-rio/Rio-3.5-Open-397B
- Base model family: Qwen3.5-397B-A17B
- iMatrix source for IQ4_XS: unsloth/Qwen3.5-397B-A17B-MTP-GGUF
- Quantization performed independently by Foxipanda.
Model tree for foxipanda/Rio-3.5-Open-397B-GGUF
Base model
Qwen/Qwen3.5-397B-A17B