InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Abstract
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
Community
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
Also see these papers for more details about the training strategies of InternVL3:
MPO
: https://huggingface.co/papers/2411.10442
V2PE
: https://huggingface.co/papers/2412.09616
VisualPRM
: https://huggingface.co/papers/2503.10291
Wonderful
Nice work. How many languages does this model support?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs (2025)
- M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance (2025)
- Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI (2025)
- LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning (2025)
- SmolVLM: Redefining small and efficient multimodal models (2025)
- VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering (2025)
- BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Nice work. In Table 12, V2PE with δ = 1 equals to the conventional positional encoding used in InternVL2.5, which serves as the baseline without V2PE?Why does δ=1 result in a 0.5 gain (75.2 to 75.7)? Does the baseline without V2PE mean it does not use positional encoding at all?"
Thanks for your question! Table 12 actually includes two separate experiments: one without V2PE (the first row) and one with V2PE (the following five rows).
In the baseline experiment (without V2PE), we adopt the standard positional encoding—each visual token is simply assigned a position index incremented by 1. In the V2PE setting, the model is trained with dynamic sampling of different δ values (1, 1/2, 1/4, ..., 1/128, 1/256 ), as mentioned in Eq.~4. At inference time, however, we fix δ to a specific value to evaluate its impact.
So to clarify:
The first row represents the baseline model trained without V2PE, using conventional position encoding.
The next five rows correspond to the same V2PE-trained model evaluated under different δ values.
Thus, the δ=1 setting in the lower part of the table is not the same as the baseline—it is the V2PE-trained model evaluated with a fixed δ=1 at inference time,