Model Overview

Description:

The NVIDIA llama-nemotron-embed-vl-1b-v2-fp8 is the quantized version of the NVIDIA llama-nemotron-embed-vl-1b-v2, which is developed for multimodal question-answering retrieval. For more information, please check here. The NVIDIA llama-nemotron-embed-vl-1b-v2-fp8 model is quantized with TensorRT Model Optimizer.

This model is ready for commercial use.

License/Terms of use

Use of this model is governed by the NVIDIA Open Model License Agreement. Additional Information: Llama 3.2 Community Model License Agreement.

Deployment Geography:

Global

Use Case:

The llama-nemotron-embed-vl-1b-v2-fp8 is suitable for users who want to build a multimodal question-and-answer application over a large corpus, leveraging the latest dense retrieval technology.
The input of the model is a text or document image and the output a fixed-size embedding vector. The embedding model is a bi-encoder that supports context in textual format (e.g. the query or the OCR text of a page or a section of a document) or the image of a document page.
Typically, the embedding model is used first to embed (vectorize) the whole corpus (document images or text chunks), and embeddings are stored in a vector database associated to its raw content (image or text). Then at inference time, the embedding model is used to embed the query. The embeddings of the query and relevant context from the corpus should be close in the embedding space.

For retrieval deployments that already use document embeddings generated with the BF16 llama-nemotron-embed-vl-1b-v2 checkpoint, re-indexing the document corpus is generally not required when switching to this FP8 checkpoint. We recommend validating retrieval quality on a representative sample before updating a production deployment.

Release Date:

Hugging Face 06/01/2026 via https://huggingface.co/nvidia/llama-nemotron-embed-vl-1b-v2-fp8

Citation

@inproceedings{moreira2025_nvretriever,
    author = {Moreira, Gabriel de Souza P. and Osmulski, Radek and Xu, Mengyao and Ak, Ronay and Schifferer, Benedikt and Oldridge, Even},
    title = {Improving Text Embedding Models with Positive-aware Hard-negative Mining},
    year = {2025},
    isbn = {9798400720406},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3746252.3761254},
    doi = {10.1145/3746252.3761254},
    pages = {2169–2178},
    numpages = {10},
    keywords = {contrastive learning, distillation, embedding models, hard-negative mining, rag, text retrieval, transformers},
    location = {Seoul, Republic of Korea},
    series = {CIKM '25}
}

Model Architecture

Architecture Type: Transformer

Network Architecture: Eagle VLM architecture with Llama 3.2 1B language model and SigLip2 400 image encoder

The llama-nemotron-embed-vl-1b-v2 embedding model is a transformer encoder, with approximately 1.7B parameters. It is a fine-tuned version of NVIDIA Eagle family of models, using Llama 3.2 1B language model and SigLip2 400M image encoder. The language model submodule has 16 layers with embedding size of 2048, and is pre-trained on public datasets. Embedding models for retrieval are typically trained with a bi-encoder architecture, that encodes query and document independently. The model applies mean pooling over the output token embeddings from the language model, so that it outputs a single embedding with 2048 dimensions. Contrastive learning is used to train the embedding model to maximize the similarity between the query and the document page that contains the answer, while minimizing the similarity between the query and sampled negative pages that are not useful to answer the question.

The vision-language model encoder incorporates key innovations from NVIDIA, including Eagle 2 research and nemoretriever-parse, which use a tiling-based VLM architecture. This architecture, available on Hugging Face, significantly enhances multimodal understanding through its dynamic tiling and mixture of vision encoders design. It particularly improves performance on tasks that involve high-resolution images and complex visual content.

Number of model parameters:

  • Llama 3.2 1B language model: 1.23 B (Transformer parameters: 973 M, Token embedding parameters: 262 M)
  • SigLip 2 image encoder: 428.77 M

Input(s):

Input Type(s): Image, Text

Input Format(s):

  • Image: Red, Green, Blue (RGB)
  • Text: String

Input Parameters:

  • Image: Two-Dimensional (2D)
  • Text: One-Dimensional (1D)

Other Properties Related to Input:

  • The model's maximum context length we evaluated is 10240 tokens.
  • Each image tile consumes 256 tokens. We have tested this model extensively with these settings on config.json - max_input_tiles = 6, use_thumbnails = True, so that every image is split into maximum 6 tiles + 1 thumbnail (whole image at lower resolution), consuming about 1792 visual tokens. If you embed both page image and text (e.g. page OCR), the sum of the visual tokens (explained above) and the text tokens should not be higher than 10240 tokens.

Output(s)

Output Type: Floats
Output Format: List of float arrays (embeddings)
Output: Model outputs embedding vectors of maximum dimension 2048 for each input.
Output Parameters:

  • Image/Text Embedding (2D) - embedding of 2048 dimensions

Other Properties Related to Output: Model outputs embedding vectors of maximum dimension 2048 for each input.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (such as GPU cores) and software frameworks (such as CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

vLLM Usage

The model can be served with vLLM for high-throughput embedding. A chat template must be provided via --chat-template to correctly apply the query: / passage: prefix based on the message role — without it, the prefix is not applied and results will be incorrect.

See the vLLM documentation for full details.

Online Serving

Create the chat template file and start the server:

cat > nemotron-embed-vl.jinja << 'JINJA'
{%- if messages | length > 1 -%}
    {{ raise_exception('Embedding models should only embed one message at a time') }}
{%- endif -%}

{% set vars = namespace(prefix='', images=[], texts=[]) %}
{%- for message in messages -%}
    {%- if message['role'] == 'query' -%}
        {%- set vars.prefix = 'query: ' %}
    {%- elif message['role'] == 'document' -%}
        {%- set vars.prefix = 'passage: ' %}
    {%- endif -%}
    {%- for content in message['content'] -%}
        {%- if content['type'] == 'text' -%}
            {%- set vars.texts = vars.texts + [content['text']] %}
        {%- elif content['type'] == 'image' -%}
            {%- set vars.images = vars.images + ['<image> '] %}
        {%- endif -%}
    {%- endfor -%}
{%- endfor -%}
{{- bos_token }}{{ vars.prefix }}{{ (vars.images + vars.texts) | join('') }}
JINJA

vllm serve nvidia/llama-nemotron-embed-vl-1b-v2-fp8 \
  --trust-remote-code \
  --max-model-len 10240 \
  --chat-template nemotron-embed-vl.jinja

Note: Use --max-model-len 10240 to support all modalities including image+text. A smaller value like 2048 can be used if only processing image-only or text-only inputs.

The chat template uses the message role to apply the correct prefix: set role to "query" for queries (prepends query:) or "document" for passages (prepends passage:).

Send embedding requests:

import requests

url = "http://localhost:8000/v1/embeddings"

# Text query embedding
response = requests.post(url, json={
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "messages": [
        {
            "role": "query",
            "content": [
                {"type": "text", "text": "How is AI improving the intelligence and capabilities of robots?"}
            ]
        }
    ],
})
query_embedding = response.json()["data"][0]["embedding"]

# Text document embedding
response = requests.post(url, json={
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "messages": [
        {
            "role": "document",
            "content": [
                {"type": "text", "text": "AI enables robots to perceive, plan, and act autonomously."}
            ]
        }
    ],
})
doc_embedding = response.json()["data"][0]["embedding"]

# Image document embedding
response = requests.post(url, json={
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2",
    "messages": [
        {
            "role": "document",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/page.png"}}
            ]
        }
    ],
})
image_embedding = response.json()["data"][0]["embedding"]

# Image + text document embedding
response = requests.post(url, json={
    "model": "nvidia/llama-nemotron-embed-vl-1b-v2-fp8",
    "messages": [
        {
            "role": "document",
            "content": [
                {"type": "image_url", "image_url": {"url": "https://example.com/page.png"}},
                {"type": "text", "text": "AI enables robots to perceive, plan, and act autonomously."}
            ]
        }
    ],
})
image_text_embedding = response.json()["data"][0]["embedding"]

Offline / In-Process

For offline usage, format prompts directly with the query: or passage: prefix. Use the <image> placeholder and pass image data via multi_modal_data for image inputs.

from vllm import LLM
from vllm.multimodal.utils import fetch_image

llm = LLM(
    model="nvidia/llama-nemotron-embed-vl-1b-v2-fp8",
    max_model_len=10240,
    trust_remote_code=True,
)

query = "How is AI improving the intelligence and capabilities of robots?"
documents = [
    "AI enables robots to perceive, plan, and act autonomously.",
    "A biological foundation model designed to analyze DNA, RNA, and protein sequences.",
]

# Embed a text query
query_output = llm.embed("query: " + query)
print(f"Query embedding dim: {len(query_output[0].outputs.embedding)}")
# Query embedding dim: 2048

# Embed text documents
doc_outputs = llm.embed(["passage: " + doc for doc in documents])
for doc, output in zip(documents, doc_outputs):
    print(f"Embedding dim: {len(output.outputs.embedding)} | {doc}")
# Embedding dim: 2048 | AI enables robots to perceive, plan, and act autonomously.
# Embedding dim: 2048 | A biological foundation model designed to analyze DNA, RNA, and protein sequences.

# Embed an image document
image = fetch_image("https://developer.download.nvidia.com/images/isaac/nvidia-isaac-lab-1920x1080.jpg")
image_output = llm.embed({
    "prompt": "passage: <image> ",
    "multi_modal_data": {"image": image},
})
print(f"Image embedding dim: {len(image_output[0].outputs.embedding)}")
# Image embedding dim: 2048

# Embed an image + text document
multimodal_output = llm.embed({
    "prompt": "passage: <image> AI enables robots to perceive, plan, and act autonomously.",
    "multi_modal_data": {"image": image},
})
print(f"Multimodal embedding dim: {len(multimodal_output[0].outputs.embedding)}")
# Multimodal embedding dim: 2048

Software Integration:

Runtime Engine(s): vLLM
Supported Hardware Microarchitecture Compatibility: NVIDIA Blackwell, NVIDIA Hopper, NVIDIA Lovelace
Preferred/Supported Operating System(s): Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

llama-nemotron-embed-vl-1b-v2-fp8
The model is quantized with nvidia-modelopt v0.42.0

Post Training Quantization:

This model was obtained by quantizing the weights and activations of nvidia/llama-nemotron-embed-vl-1b-v2 to FP8 data type, ready for inference with vLLM

Training, Testing, and Evaluation Datasets:

Dataset Overview

This checkpoint is an FP8 post-training-quantized derivative of nvidia/llama-nemotron-embed-vl-1b-v2. No additional supervised training or fine-tuning data was used to create this FP8 checkpoint.

For this FP8 release, data was used for two purposes: post-training quantization calibration and evaluation. The cnn_dailymail dataset was used for FP8 quantization calibration.

Total Size: 512 samples from the train split of cnn_dailymail were used for calibration.

Time period for training data collection: Not applicable; no additional training data was collected or used for this FP8 checkpoint. An existing public dataset was used for quantization calibration.

Time period for testing data collection: Not newly collected for this release; existing benchmark datasets were used.

Time period for validation data collection: Not newly collected for this release; existing benchmark datasets were used.

The FP8 checkpoint was produced by applying post-training quantization to the BF16 parent model using NVIDIA TensorRT Model Optimizer. Calibration samples were converted into model inputs and used to estimate quantization parameters for weights and activations. This calibration process did not involve supervised training, fine-tuning, or modification of the model objective.

Training Dataset:

Data Collection Method by dataset

  • Not Applicable

Labeling Method by dataset

  • Not Applicable

Properties: Not Applicable

Evaluation Dataset:

Quantization Benchmark Scores:

In this section, we compare the performance of quantized model llama-nemotron-embed-vl-1b-v2-fp8 with baseline implementation llama-nemotron-embed-vl-1b-v2.

ViDoRe V3, KoViDoRe, and ZhViDoRe (an internal Chinese visual document) retrieval benchmark were used to evaluate retrieval quality relative to the BF16 baseline model. Evaluation inputs were formatted as text-only, image-only, and image-plus-text examples to compare retrieval accuracy between the FP8 checkpoint and the BF16 baseline.

In below table, we present the FP8 quantized model's accuracy relative to the baseline model for different input modalities Note: Image+Text modality means that both the page image and its text (that might be extracted by some OCR library like NV-Ingest) are fed as input to the embedding model for more accurate representation and retrieval. Accuracy numbers were measured with vLLM v0.19.0 on H100 GPU.

FP8 model accuracy relative to baseline BF16 model, on Visual Document Retrieval benchmarks. Chinese: ZhViDoRe (internal); Korean: KoViDoRe; English/French: ViDoRe V3.
Dataset
Modality All Chinese/Korean English/French
image+text 99.32% 98.42% 99.55%
image 99.07% 98.21% 99.20%
text 99.61% 101% 99.25%

Data Collection Method by dataset

  • Hybrid: Human, Automated, Synthetic

Labeling Method by dataset

  • Hybrid: Human, Automated, Synthetic

Properties: ZhViDoRe comprises 922 queries, KoViDoRe comprises 706 queries, and ViDoRe V3 comprises 14,514 queries

Inference:

Acceleration Engine: vLLM
Test Hardware:

  • H100 SXM

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. Developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please make sure you have proper rights and permissions for all input image content; if image includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

For more detailed information on ethical considerations for this model, see the Model Card++ tab for the Explainability, Bias, Safety & Security, and Privacy subcards.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Bias

Field Response
Participation considerations from adversely impacted groups protected classes in model design and testing None
Measures taken to mitigate against unwanted bias None

Explainability

Field Response
Intended Application & Domain: Document and query embedding for question and answer retrieval.
Model Type: Transformer encoder.
Intended User: Generative AI creators working with conversational AI models. Users who want to build a question and answer application over a large corpus, leveraging the latest dense retrieval technologies. The corpus can be images of PDFs, such as text, tables, charts or infographics, or extracted plain text.
Output: Array of float numbers (Dense Vector Representation for the input text).
Describe how the model works: Model transforms the input into a dense vector representation.
Technical Limitations: The model's max sequence length is 10240. Longer text inputs should be truncated.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: N/A
Verified to have met prescribed NVIDIA quality standards: Yes
Performance Metrics: Accuracy, Throughput, and Latency.
Potential Known Risks: This model does not guarantee to always retrieve the correct passage(s) for a given query.
Licensing & Terms of Use: The use of this model is governed by the NVIDIA Open Model License Agreement and the use of the post-processing scripts are licensed under Apache 2.0. Additional Information: Llama 3.2 Community Model License Agreement. Built with Llama.

Privacy

Field Response
Generatable or reverse engineerable personal data? None
Personal data used to create this model? None Known
How often is dataset reviewed? Dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for changes.
Was data from user interactions with the AI model (e.g. user input and prompts) used to train the model? No
Is there provenance for all datasets used in training? Yes
Does data labeling (annotation, metadata) comply with privacy laws? Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made? No, not possible with externally-sourced data.
Applicable Privacy Policy https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

Safety & Security

Field Response
Model Application(s): Document Embedding for Retrieval. User queries can be text and documents can be text, document page images, charts, tables, and infographics.
Describe the life critical impact (if present) Not applicable
Use Case Restrictions: The use of this model is governed by the NVIDIA Open Model License Agreement and the use of the post-processing scripts are licensed under Apache 2.0. Additional Information: Llama 3.2 Community Model License Agreement. Built with Llama.
Model and dataset restrictions: The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
Downloads last month
80
Safetensors
Model size
2B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nvidia/llama-nemotron-embed-vl-1b-v2-fp8

Quantized
(1)
this model

Collection including nvidia/llama-nemotron-embed-vl-1b-v2-fp8

Paper for nvidia/llama-nemotron-embed-vl-1b-v2-fp8