arxiv:2504.10479

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Published on Apr 14

· Submitted by

Weiyun1025 on Apr 15

#1 Paper of the day

Upvote

232

Authors:

Weiyun Wang ,

Shenglong Ye ,

Lixin Gu ,

Yuchen Duan ,

Weijie Su ,

Jie Shao ,

Erfei Cui ,

Yue Cao ,

Jiahao Wang ,

Han Lv ,

Songze Li ,

Yinan He ,

Abstract

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

View arXiv page View PDF Project page GitHub repository Add to collection

Community

Weiyun1025

Paper author Paper submitter 7 days ago

vansin

7 days ago

•

edited 7 days ago

Weiyun1025

Paper author Paper submitter 6 days ago

Also see these papers for more details about the training strategies of InternVL3:

MPO: https://huggingface.co/papers/2411.10442

V2PE: https://huggingface.co/papers/2412.09616

VisualPRM: https://huggingface.co/papers/2503.10291

JK-TK

6 days ago

Wonderful

ductm104

5 days ago

Nice work. How many languages does this model support?

librarian-bot

5 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

BillionZheng

4 days ago

Nice work. In Table 12, V2PE with δ = 1 equals to the conventional positional encoding used in InternVL2.5, which serves as the baseline without V2PE？Why does δ=1 result in a 0.5 gain (75.2 to 75.7)? Does the baseline without V2PE mean it does not use positional encoding at all?"

Lechatelia

4 days ago

Thanks for your question! Table 12 actually includes two separate experiments: one without V2PE (the first row) and one with V2PE (the following five rows).

In the baseline experiment (without V2PE), we adopt the standard positional encoding—each visual token is simply assigned a position index incremented by 1. In the V2PE setting, the model is trained with dynamic sampling of different δ values (1, 1/2, 1/4, ..., 1/128, 1/256 ), as mentioned in Eq.~4. At inference time, however, we fix δ to a specific value to evaluate its impact.

So to clarify:
The first row represents the baseline model trained without V2PE, using conventional position encoding.
The next five rows correspond to the same V2PE-trained model evaluated under different δ values.

Thus, the δ=1 setting in the lower part of the table is not the same as the baseline—it is the V2PE-trained model evaluated with a fixed δ=1 at inference time,