Instructions to use prefeitura-rio/Rio-3.5-Open-397B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prefeitura-rio/Rio-3.5-Open-397B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="prefeitura-rio/Rio-3.5-Open-397B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("prefeitura-rio/Rio-3.5-Open-397B")
model = AutoModelForMultimodalLM.from_pretrained("prefeitura-rio/Rio-3.5-Open-397B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use prefeitura-rio/Rio-3.5-Open-397B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prefeitura-rio/Rio-3.5-Open-397B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prefeitura-rio/Rio-3.5-Open-397B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/prefeitura-rio/Rio-3.5-Open-397B

SGLang

How to use prefeitura-rio/Rio-3.5-Open-397B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "prefeitura-rio/Rio-3.5-Open-397B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prefeitura-rio/Rio-3.5-Open-397B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "prefeitura-rio/Rio-3.5-Open-397B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prefeitura-rio/Rio-3.5-Open-397B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use prefeitura-rio/Rio-3.5-Open-397B with Docker Model Runner:
```
docker model run hf.co/prefeitura-rio/Rio-3.5-Open-397B
```

About the SwiReasoning (requirement?)

by abc-nix - opened about 22 hours ago

Discussion

abc-nix

about 22 hours ago

First, sorry for my ignorance. I am interested in this finetune and I have some questions.

Rio 3.5 Open 397B features SwiReasoning, a training-free inference framework based on Shi et al. (2025) that dynamically switches between explicit chain-of-thought and latent-space reasoning, guided by entropy-based confidence signals. This enables both higher accuracy and dramatically improved token efficiency. This model was explicitly trained to maximize the efficiency gained via latent reasoning.

Are the benchmark numbers you report achieved with SwiReasoning active
during inference, or do they reflect the model's baseline (non-SwiReasoning)
performance?
Since SwiReasoning requires feeding continuous "soft" embeddings
(probability-weighted mixtures of the full embedding matrix) rather than
discrete token IDs, inference engines like llama.cpp that only support
discrete token generation cannot currently implement it. How would this
model perform on such engines? Is there still a meaningful improvement
over the base Qwen 3.5 397B even without the SwiReasoning inference
procedure?

Thanks for your time and sorry for any inconveniences.

Sangu1nius

Prefeitura do Rio de Janeiro (City of Rio de Janeiro) org about 20 hours ago

With!
Yes, this is a major problem. As we release new models using latent reasoning, we hope most inference frameworks will adapt in due time.
Yes! We measured IMOAnswerBench and SWE-Bench Pro also without latent reasoning, just so we get a good baseline understanding of its capabilities:

Qwen 3.5 397B

IMOAnswerBench 80.9
SWE-Bench Pro 50.9
APEX 9.4

Training

IMOAnswerBench 84.5
SWE-Bench Pro 54.8
APEX 22.9

Latent Reasoning

IMOAnswerBench 89.5
SWE-Bench Pro 58.1
APEX 29.2

Therefore, the improvement is meaningful, but Latent Reasoning approximately doubles the capability jump, so we highly recommend using it when possible, and we will push for inference engines to implement it!

abc-nix

about 20 hours ago

Thanks for the answer. I will be eagerly waiting for llama.cpp/ik_llama.cpp to add this new feature. It would also help more models, so I think more people will be in favor of their inclusion.

zenlkq

26 minutes ago

Hi! Still haven't figured out that latent reasoning mode, but wanted to share my results:

Without latent I got:

APEX -> 29/120 = ~24.2%, compared to the claimed 22.9

Looks good to me! Great job to the Rio team!

Sangu1nius

Prefeitura do Rio de Janeiro (City of Rio de Janeiro) org 8 minutes ago

Thank you for your contribution @zenlkq ! We highly encourage members of the community to benchmark and stress test our model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment