YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Confucius4
中文 | English
📑 Confucius4: Advancing Multimodal Reasoning via Joint Optimization of Iterative SFT-RL and Compact Chain-of-Thought
Based on Qwen3.5-27B | License: Apache 2.0
Figure 1: Confucius4 achieves SOTA among models of comparable scale across multiple visual math benchmarks.
Figure 2: Confucius4 significantly reduces output token counts across multiple visual math benchmarks.
Quick Links
1 Model Downloads
| Model | HuggingFace | ModelScope |
|---|---|---|
| Confucius4 | 🤗 HuggingFace | ModelScope |
2 Introduction
Confucius4 is an open-source multimodal LLM developed by the NetEase Youdao AI Team, built upon the Qwen3.5 architecture and designed for advanced mathematical reasoning. Key features:
✅ 1. Cost-Effective Multimodal Training Set with SFT-RL Iterative Optimization, Achieving SOTA among Comparable-Scale Models
- Image-gain filtering is used to eliminate low-value visual redundancy, building a cost-effective multimodal training set.
- An iterative SFT + RL paradigm continuously elevates performance across both text-only and multimodal scenarios.
✅ 2. Pure-Text Reasoning Data Augmentation in SFT for Stronger Reasoning Foundation (+23.2% on Math-Hard-500)
- Pure-text reasoning data injected during SFT reinforces the multimodal model's reasoning backbone.
- A hybrid "text reasoning + multimodal problem-solving" training paradigm achieves synergistic gains between reasoning capability and multimodal perception.
✅ 3. Refined CoT SFT + Length-Aware RL: Achieving Both Accuracy and Efficiency (CoT length reduced by 43.2%)
- SFT stage: chain-of-thought reasoning is reconstructed to eliminate redundant steps, producing concise yet complete high-quality chains.
- RL stage: a Length-Aware Advantage mechanism introduces an explore-exploit tradeoff into advantage computation, constraining reasoning length for non-challenging problems and effectively eliminating overthinking.
Additionally, through targeted optimization on Chinese-language data, Confucius4 produces outputs that are more aligned with the linguistic habits, cultural context, and expression preferences of Chinese-speaking users.
3 Performance
| Benchmark | Confucius3-Math | Qwen3.5-27B | Qwen3.5-35B-A3B | Qwen3.6-27B | Confucius4 |
|---|---|---|---|---|---|
| Math-Hard-500 | 0.626 | 0.582 | 0.614 | 0.756 | 0.814 |
| Math-Figure | - | 0.866 | 0.834 | 0.865 | 0.907 |
| MathVision (testmini) | - | 0.651 | 0.625 | 0.648 | 0.724 |
| logicVista | - | 0.734 | 0.741 | 0.743 | 0.779 |
| MathVerse | - | 0.866 | 0.859 | 0.865 | 0.876 |
| MathVista (testmini) | - | 0.874 | 0.864 | 0.871 | 0.874 |
| DynaMath | - | 0.877 | 0.889 | 0.856 | 0.893 |
| We-Math | - | 0.913 | 0.906 | 0.907 | 0.912 |
Notes:
Math-Figureis an internally collected, proprietary multimodal problem-solving dataset (664 samples).Math-Hard-500is a proprietary dataset focusing on challenging middle and high school problems, including a subset of competition-level questions (500 samples).Confucius3-Mathis a text-only model, therefore only evaluated on Math-Hard-500.
4 Inference
The environmental requirements for running Confucius4 are exactly the same as those of the Qwen3.5 model. Therefore, you can easily use Transformers or vLLM to load and run the model for inference, and deploy your services.
The only thing you need to pay attention to is to use the predefined system message and user message templates provided below to request the model. Other templates may also be usable, but we haven't tested them yet.
SYSTEM_PROMPT_TEMPLATE = """You are a helpful assistant."""
Then you can create your messages as follows and use them to request model results. For multimodal input, encode the image as base64 and include it alongside the text question.
import base64
from transformers import AutoModelForCausalLM, AutoProcessor
model_name = "netease-youdao/Confucius4"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_name)
# Load and encode image
image_path = "path/to/your/image.jpg"
with open(image_path, "rb") as f:
image_bytes = f.read()
image_base64 = base64.b64encode(image_bytes).decode("utf-8")
messages = [
{"role": "system", "content": SYSTEM_PROMPT_TEMPLATE},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": f"data:image;base64,{image_base64}"}},
{"type": "text", "text": "请你仔细观察图像中的题目,并回答对应的问题。"},
]},
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
generated_ids = model.generate(
**inputs,
max_new_tokens=16384,
temperature=0.6,
top_p=0.95,
top_k=20,
do_sample=True
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
Generate Parameters: We suggest using Temperature=0.6, TopP=0.95, TopK=20 to sample.
After obtaining the model results (e.g., via vLLM API), you can parse out the reasoning and answer as follows.
res = response.json()['choices']
for index in range(len(res)):
content = res[index]['message']['content'] # summary
reasoning = res[index]['message']['reasoning'] # thinking
Acknowledgements
Significant thanks to the ms-swift team for providing an efficient and scalable fine-tuning framework that greatly facilitated the training of this model. We also acknowledge the Qwen team for their foundational models.
Model Declaration
This project is fine-tuned and optimized based on Qwen3.5-27B.
License
This project is released under the Apache License 2.0.
- Free for commercial use, modification, and distribution
- Modified versions must indicate the changes made
- Derivative works must retain the original open-source notice
Citation
If you find our work helpful, feel free to give us a cite.
@misc{confucius4,
title = {Confucius4: Advancing Multimodal Reasoning with Iterative SFT-RL Optimization and Compact Chain-of-Thought},
author = {NetEase Youdao AI Team},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/netease-youdao/Confucius4}}
}
- Downloads last month
- 96