YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Confucius4

Confucius4

中文 | English

📑 Confucius4: Advancing Multimodal Reasoning via Joint Optimization of Iterative SFT-RL and Compact Chain-of-Thought

Based on Qwen3.5-27B | License: Apache 2.0

Confucius4 Performance

Figure 1: Confucius4 achieves SOTA among models of comparable scale across multiple visual math benchmarks.


Token Comparison

Figure 2: Confucius4 significantly reduces output token counts across multiple visual math benchmarks.

Quick Links

1 Model Downloads

Model HuggingFace ModelScope
Confucius4 🤗 HuggingFace ModelScope

2 Introduction

Confucius4 is an open-source multimodal LLM developed by the NetEase Youdao AI Team, built upon the Qwen3.5 architecture and designed for advanced mathematical reasoning. Key features:

1. Cost-Effective Multimodal Training Set with SFT-RL Iterative Optimization, Achieving SOTA among Comparable-Scale Models

  • Image-gain filtering is used to eliminate low-value visual redundancy, building a cost-effective multimodal training set.
  • An iterative SFT + RL paradigm continuously elevates performance across both text-only and multimodal scenarios.

2. Pure-Text Reasoning Data Augmentation in SFT for Stronger Reasoning Foundation (+23.2% on Math-Hard-500)

  • Pure-text reasoning data injected during SFT reinforces the multimodal model's reasoning backbone.
  • A hybrid "text reasoning + multimodal problem-solving" training paradigm achieves synergistic gains between reasoning capability and multimodal perception.

3. Refined CoT SFT + Length-Aware RL: Achieving Both Accuracy and Efficiency (CoT length reduced by 43.2%)

  • SFT stage: chain-of-thought reasoning is reconstructed to eliminate redundant steps, producing concise yet complete high-quality chains.
  • RL stage: a Length-Aware Advantage mechanism introduces an explore-exploit tradeoff into advantage computation, constraining reasoning length for non-challenging problems and effectively eliminating overthinking.

Additionally, through targeted optimization on Chinese-language data, Confucius4 produces outputs that are more aligned with the linguistic habits, cultural context, and expression preferences of Chinese-speaking users.

3 Performance

Benchmark Confucius3-Math Qwen3.5-27B Qwen3.5-35B-A3B Qwen3.6-27B Confucius4
Math-Hard-500 0.626 0.582 0.614 0.756 0.814
Math-Figure - 0.866 0.834 0.865 0.907
MathVision (testmini) - 0.651 0.625 0.648 0.724
logicVista - 0.734 0.741 0.743 0.779
MathVerse - 0.866 0.859 0.865 0.876
MathVista (testmini) - 0.874 0.864 0.871 0.874
DynaMath - 0.877 0.889 0.856 0.893
We-Math - 0.913 0.906 0.907 0.912

Notes:

  • Math-Figure is an internally collected, proprietary multimodal problem-solving dataset (664 samples).
  • Math-Hard-500 is a proprietary dataset focusing on challenging middle and high school problems, including a subset of competition-level questions (500 samples).
  • Confucius3-Math is a text-only model, therefore only evaluated on Math-Hard-500.

4 Inference

The environmental requirements for running Confucius4 are exactly the same as those of the Qwen3.5 model. Therefore, you can easily use Transformers or vLLM to load and run the model for inference, and deploy your services.

The only thing you need to pay attention to is to use the predefined system message and user message templates provided below to request the model. Other templates may also be usable, but we haven't tested them yet.

SYSTEM_PROMPT_TEMPLATE = """You are a helpful assistant."""

Then you can create your messages as follows and use them to request model results. For multimodal input, encode the image as base64 and include it alongside the text question.

import base64
from transformers import AutoModelForCausalLM, AutoProcessor

model_name = "netease-youdao/Confucius4"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_name)

# Load and encode image
image_path = "path/to/your/image.jpg"
with open(image_path, "rb") as f:
    image_bytes = f.read()
image_base64 = base64.b64encode(image_bytes).decode("utf-8")

messages = [
    {"role": "system", "content": SYSTEM_PROMPT_TEMPLATE},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": f"data:image;base64,{image_base64}"}},
        {"type": "text", "text": "请你仔细观察图像中的题目,并回答对应的问题。"},
    ]},
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
inputs = processor(text=text, return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=16384,
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    do_sample=True
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]

response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Generate Parameters: We suggest using Temperature=0.6, TopP=0.95, TopK=20 to sample.

After obtaining the model results (e.g., via vLLM API), you can parse out the reasoning and answer as follows.

res = response.json()['choices']
for index in range(len(res)):
    content = res[index]['message']['content']      # summary
    reasoning = res[index]['message']['reasoning']  # thinking

Acknowledgements

Significant thanks to the ms-swift team for providing an efficient and scalable fine-tuning framework that greatly facilitated the training of this model. We also acknowledge the Qwen team for their foundational models.

Model Declaration

This project is fine-tuned and optimized based on Qwen3.5-27B.

License

This project is released under the Apache License 2.0.

  • Free for commercial use, modification, and distribution
  • Modified versions must indicate the changes made
  • Derivative works must retain the original open-source notice

Citation

If you find our work helpful, feel free to give us a cite.

@misc{confucius4,
  title        = {Confucius4: Advancing Multimodal Reasoning with Iterative SFT-RL Optimization and Compact Chain-of-Thought},
  author       = {NetEase Youdao AI Team},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/netease-youdao/Confucius4}}
}
Downloads last month
96
Safetensors
Model size
27B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support