Model Overview

NVLM 1.0 model is trained with legacy Megatron-LM. In this repo, we reproduce NVLM-1.0 results using the latest Megatron-core training code and share the Megatron-core model weights, training code, and evaluation scripts.

Description

This family of models performs vision-language and text-only tasks including optical character recognition, multimodal reasoning, localization, common sense reasoning, world knowledge utilization, and coding.

This model is ready for non-commercial use.

License/Terms of Use

Governing Terms: Deed - Attribution-NonCommercial 4.0 International - Creative Commons.

Additional Information: LICENSE · Qwen/Qwen2-72B-Instruct at main for Qwen2-72B-Instruct and The MIT License – Open Source Initiative for InternViT-6B-448px-V1-2.

Model Details

Today (September 17th, 2024), we introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training.

In this repo, we are open-sourcing NVLM-1.0-D-72B-mcore (decoder-only architecture), the decoder-only model weights for the community. The model is trained through Megatron-Core.

Reference(s)

Paper Inference Code (HF) Training Code Website

Benchmark Results

We train our model with Megatron-Core and adapt the codebase to Huggingface for model hosting, reproducibility, and inference. We observe numerical differences between the Megatron and Huggingface codebases, which are within the expected range of variation. We provide the results from both the Huggingface codebase and the Megatron codebase for reproducibility and comparison with other models.

Results (as of September 17th, 2024) in the multimodal benchmarks are as follows:

Vision-language Benchmarks

Benchmark	MMMU (val / test)	MathVista	OCRBench	AI2D	ChartQA	DocVQA	TextVQA	RealWorldQA	VQAv2
NVLM-D 1.0 72B (Megatron-Core)	59.9 / 54.1	67.4	851	94.4	86.9	92.1	81.2	66.8	85.4
Llama 3.2 90B	60.3 / -	57.3	-	92.3	85.5	90.1	-	-	78.1
Llama 3-V 70B	60.6 / -	-	-	93.0	83.2	92.2	83.4	-	79.1
Llama 3-V 405B	64.5 / -	-	-	94.1	85.8	92.6	84.8	-	80.2
InternVL2-Llama3-76B	55.2 / -	65.5	839	94.8	88.4	94.1	84.4	72.2	-
GPT-4V	56.8 / 55.7	49.9	645	78.2	78.5	88.4	78.0	61.4	77.2
GPT-4o	69.1 / -	63.8	736	94.2	85.7	92.8	-	-	-
Claude 3.5 Sonnet	68.3 / -	67.7	788	94.7	90.8	95.2	-	-	-
Gemini 1.5 Pro (Aug 2024)	62.2 / -	63.9	754	94.4	87.2	93.1	78.7	70.4	80.2

Model Architectures

Network Architecture: Decoder-Only Transformer

Text-only LLM backbone: Qwen2-72B-Instruct

Vision encoder: InternViT-6B

Robustness

The model trained on this dataset cannot regenerate its training data:

The model has no image generation capability since its output is only text. Hence it cannot regenerate any image it would have seen during training.
The model cannot regenerate training text data: during training, the model takes text and images as inputs, and the model output (text) is conditioned on both inputs. During inference, without training images as input, the models would not be able to reproduce any part of the training text data.

Input

Input Type(s): Text, Image
Input Format(s): String, Pillow Library-Supported Formats
Input Dimensions: One-Dimensional (1D), Two Dimensional (2D)
Other Properties Related to Input: Maximum Token Length = 128K Tokens

Output

Output Type(s): Text
Output Format: String
Model Output: 1D
Other Properties Related to Output: None

How to use

For training code, please refer to Megatron-LM.

Prepare the environment

We provide a docker build file in the Dockerfile for reproduction.

Evaluation

Run the text generation script.

examples/multimodal/nvlm/run_text_generation_qwen20_72b_internvit_6b.sh --input-image-path /path/to/input/images --output-path /some/output/directory \
    --model-path /path/to/model.pt --gt-path /path/to/groundtruth/file --task generation-task-name --use-tiling

where --task generation-task-name is the name of the evaluation benchmark such as captioning, MMMU or TextVQA.

Then, run one of the evaluation scripts from examples/multimodal. For example

python examples/multimodal/evaluate_mmmu.py --input-path /output/directory/from/generation

Software Integration

Runtime Engine(s)

PyTorch

Supported Hardware Microarchitecture Compatibility:

NVIDIA Hopper

[Preferred/Supported] Operating System(s):

Linux

Inference

Engine: PyTorch
Test Hardware:

H100

Model Version(s)

v1.0-D (NVLM-D)

Training, Testing, and Evaluation Datasets

Pre-Training Dataset

Link

See Table 4

Data Collection Method by dataset

Hybrid: Automated, Human, Synthetic, Unknown

Labeling Method by dataset

Hybrid: Automated, Human, Synthetic, Unknown

Properties

Trained on image captions, image-text pairs, natural images, charts, documents, scene descriptions, and mathematical reasoning.

Supervised Fine-Tuning Dataset

Link

See Table 6

Data Collection Method by dataset

Hybrid: Automated, Human, Synthetic, Unknown

Labeling Method by dataset

Hybrid: Automated, Human, Synthetic, Unknown

Properties

Trained on image captions; general knowledge; image-text pairs; natural images; charts; diagrams; documents; scene descriptions; science diagrams, lessons, textbook data, and question-answer pairs; visual instruction tuning; and mathematical reasoning.

Evaluation Dataset

Link

See Section 6.1, "Benchmark"

Data collection method by dataset

Human

Labeling method by dataset

Human

Properties

Evaluated on general knowledge, visual answering, chart understanding, table, optical character recognition, and mathematical reasoning.

Correspondence to

Wenliang Dai* (wdai@nvidia.com), Nayeon Lee* (nayeonl@nvidia.com), Boxin Wang* (boxinw@nvidia.com), Zhuolin Yang* (zhuoliny@nvidia.com), Wei Ping* (wping@nvidia.com)

*Equal contribution

Citation

@article{nvlm2024,
  title={NVLM: Open Frontier-Class Multimodal LLMs},
  author={Dai, Wenliang and Lee, Nayeon and Wang, Boxin and Yang, Zhuolin and Liu, Zihan and Barker, Jon and Rintamaki, Tuomas and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
  journal={arXiv preprint},
  year={2024}}

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.

nvidia
/

NVLM-D-72B-mcore

Model Overview

Description

License/Terms of Use

Model Details

Reference(s)

Benchmark Results

Vision-language Benchmarks

Model Architectures

Robustness

Input

Output

How to use

Prepare the environment

Evaluation

Software Integration

Inference

Model Version(s)

Training, Testing, and Evaluation Datasets

Pre-Training Dataset

Supervised Fine-Tuning Dataset

Evaluation Dataset

Correspondence to

Citation

Ethical Considerations

Collection including nvidia/NVLM-D-72B-mcore

NVLM 1.0