Instructions to use prefeitura-rio/Rio-3.5-Open-397B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use prefeitura-rio/Rio-3.5-Open-397B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="prefeitura-rio/Rio-3.5-Open-397B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("prefeitura-rio/Rio-3.5-Open-397B") model = AutoModelForMultimodalLM.from_pretrained("prefeitura-rio/Rio-3.5-Open-397B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use prefeitura-rio/Rio-3.5-Open-397B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "prefeitura-rio/Rio-3.5-Open-397B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prefeitura-rio/Rio-3.5-Open-397B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/prefeitura-rio/Rio-3.5-Open-397B
- SGLang
How to use prefeitura-rio/Rio-3.5-Open-397B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "prefeitura-rio/Rio-3.5-Open-397B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prefeitura-rio/Rio-3.5-Open-397B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "prefeitura-rio/Rio-3.5-Open-397B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "prefeitura-rio/Rio-3.5-Open-397B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use prefeitura-rio/Rio-3.5-Open-397B with Docker Model Runner:
docker model run hf.co/prefeitura-rio/Rio-3.5-Open-397B
About the SwiReasoning (requirement?)
First, sorry for my ignorance. I am interested in this finetune and I have some questions.
Rio 3.5 Open 397B features SwiReasoning, a training-free inference framework based on Shi et al. (2025) that dynamically switches between explicit chain-of-thought and latent-space reasoning, guided by entropy-based confidence signals. This enables both higher accuracy and dramatically improved token efficiency. This model was explicitly trained to maximize the efficiency gained via latent reasoning.
Are the benchmark numbers you report achieved with SwiReasoning active
during inference, or do they reflect the model's baseline (non-SwiReasoning)
performance?Since SwiReasoning requires feeding continuous "soft" embeddings
(probability-weighted mixtures of the full embedding matrix) rather than
discrete token IDs, inference engines like llama.cpp that only support
discrete token generation cannot currently implement it. How would this
model perform on such engines? Is there still a meaningful improvement
over the base Qwen 3.5 397B even without the SwiReasoning inference
procedure?
Thanks for your time and sorry for any inconveniences.
With!
Yes, this is a major problem. As we release new models using latent reasoning, we hope most inference frameworks will adapt in due time.
Yes! We measured IMOAnswerBench and SWE-Bench Pro also without latent reasoning, just so we get a good baseline understanding of its capabilities:
Qwen 3.5 397B
IMOAnswerBench 80.9
SWE-Bench Pro 50.9
APEX 9.4
- Training
IMOAnswerBench 84.5
SWE-Bench Pro 54.8
APEX 22.9
- Latent Reasoning
IMOAnswerBench 89.5
SWE-Bench Pro 58.1
APEX 29.2
Therefore, the improvement is meaningful, but Latent Reasoning approximately doubles the capability jump, so we highly recommend using it when possible, and we will push for inference engines to implement it!
Thanks for the answer. I will be eagerly waiting for llama.cpp/ik_llama.cpp to add this new feature. It would also help more models, so I think more people will be in favor of their inclusion.
Hi! Still haven't figured out that latent reasoning mode, but wanted to share my results:
Without latent I got:
APEX -> 29/120 = ~24.2%, compared to the claimed 22.9
Looks good to me! Great job to the Rio team!
Thank you for your contribution @zenlkq ! We highly encourage members of the community to benchmark and stress test our model.