Instructions to use nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16

SGLang

How to use nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16 with Docker Model Runner:
```
docker model run hf.co/nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16
```

Configuration Parsing Warning:Invalid JSON for config file config.json

Nemotron-TwoTower-30B-A3B-Base-BF16

Category-level comparison between the Nemotron-3-Nano-30B-A3B autoregressive baseline and Nemotron-TwoTower Diffusion.

Model Overview

Model Developer: NVIDIA Corporation

Model Dates: September 2025 – April 2026

Data Freshness: The pre-training data has a cutoff date of June 25, 2025.

Nemotron-TwoTower-30B-A3B-Base-BF16 is a block-wise autoregressive diffusion language model built on the NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 backbone. It generates text by iteratively denoising blocks of tokens in parallel rather than one token at a time.

This model is ready for commercial use.

Description

Nemotron-TwoTower uses two towers:

Context tower (AR / context) — a frozen causal, autoregressive tower that processes the clean prompt and all previously committed tokens, producing the per-layer KV cache (attention) and final Mamba-2 states.
Denoiser tower (diffusion / denoiser) — a trainable tower that generates a block of tokens at a time via mask diffusion, refining noisy blocks with bidirectional in-block attention, layer-aligned cross-attention to the context tower, and context-seeded Mamba states.

Both towers are copies of the same 52-layer hybrid Mamba-2 / attention / MoE backbone; only the diffusion/denoiser tower is trained, and the AR/context tower stays frozen. The denoiser is trained on ~2.1T tokens (the backbone was pretrained on 25T tokens). At the default operating point, Nemotron-TwoTower retains 98.7% of the autoregressive baseline's aggregate benchmark quality and provides 2.42× the AR baseline's wall-clock generation throughput.

What is Nemotron?

NVIDIA Nemotron™ is a family of open models with open weights, training data, and recipes, delivering leading efficiency and accuracy for building specialized AI agents.

Nemotron-TwoTower: Diffusion LLM with Autoregressive Context

      AR / Context Tower            Diffusion / Denoiser Tower

          clean tokens                   noisy token blocks
               │                                 │
       ┌───────▼───────┐                 ┌───────▼───────┐
       │   Embedding   │                 │   Embedding   │
       └───────┬───────┘                 └───────┬───────┘
               │                                 │
       ┌───────▼───────┐    KV + Mamba   ┌───────▼───────┐
       │ Mamba-2/Attn  │─────states─────▶│ Mamba-2/Attn  │
       └───────┬───────┘                 └───────┬───────┘
               │                                 │
       ┌───────▼───────┐                 ┌───────▼───────┐
       │      MoE      │                 │      MoE      │
       └───────┬───────┘                 └───────┬───────┘
               │                                 │
       ┌───────▼───────┐                 ┌───────▼───────┐
       │  Output Head  │                 │  Output Head  │
       └───────┬───────┘                 └───────┬───────┘
               │                                 │
         logits / loss                     logits / loss
          (optional)

The AR/context tower (left) runs causally over clean token blocks and exposes, at every layer, the attention KV cache and the Mamba-2 conv/SSM boundary states. The diffusion/denoiser tower (right) consumes a noisy block; at each layer it cross-attends to the corresponding layer of the context tower (KV) and seeds its Mamba-2 layer from the corresponding context Mamba state.

Two-Tower Generation Modes

Mode	Description	Tokens / step	API
Mask Diffusion	Diffusion/denoiser mode: block-wise iterative denoising with confidence-based unmasking.	up to `block_size`	`generate_mask_diffusion()`
Mock-AR	Two-tower autoregressive: AR/context tower builds the cache, diffusion/denoiser predicts the next token.	1	`generate_mock_ar()`
AR	Standard autoregressive generation using the AR/context tower only (single GPU).	1	`generate_ar()`

How mask diffusion works

Generation is block-wise autoregressive: the AR/context tower encodes the prompt once, then the diffusion/denoiser fills one block of block_size positions at a time. For each new block:

Initialize the block as all [MASK] tokens (mask_token_id).
For steps_per_block denoising iterations:
- Compute the diffusion timestep t = current masked fraction of the block, and feed it to the time-conditioned denoiser (adaLN-single modulation — a global MLP maps t to per-layer scale/shift/gate, PixArt-α style).
- Run the diffusion/denoiser over the whole block (bidirectional in-block self-attention + layer-aligned causal cross-attention to the AR/context cache; Mamba-2 chunk-scan seeded from the context state).
- Constrain to p(x₀ | xₜ) (mask token forbidden; already-decoded positions fixed), then commit the high-confidence positions (all above confidence_threshold, with a floor that guarantees completion within steps_per_block) and re-mask the rest (confidence unmasking).
Commit the resolved block, advance the AR/context tower over it to update the KV + Mamba caches, and continue with the next block.

Each step predicts all masked positions in parallel and commits the confident subset, so multiple tokens may be committed per step.

License/Terms of Use

GOVERNING TERMS: Use of this model is governed by the NVIDIA Nemotron Open Model License Agreement.

Benchmark Evaluations

Default operating point: confidence unmasking, threshold γ = 0.8, block size S = 16, BF16 on 2×H100 GPUs. At this point Nemotron-TwoTower retains 98.7% of the AR baseline's aggregate benchmark quality and reaches 2.42× the AR baseline's wall-clock generation throughput. Lowering the confidence threshold increases the tokens committed per step and the throughput, with reduced quality.

Per-task results below compare the autoregressive backbone (AR/context baseline) against Nemotron-TwoTower (diffusion/denoiser decoding).

Task	NVIDIA-Nemotron-3-Nano-30B-A3B-Base (AR baseline)	Nemotron-TwoTower-30B-A3B-Base (diffusion)
General Knowledge
MMLU (5-shot, acc)	78.56	78.24
MMLU-Pro (5-shot, CoT EM)	62.59	60.93
Commonsense Understanding
ARC-Challenge (25-shot, acc_norm)	91.72	92.66
WinoGrande (5-shot, acc)	76.09	76.09
Reading Comprehension
RACE (0-shot, acc)	88.90	88.90
Code
HumanEval (0-shot)	79.27	75.58
MBPP-Sanitized (3-shot)	74.71	74.28
Math
GSM8K (8-shot, acc)	92.49	90.14
MATH-500 (4-shot)	84.40	80.60
Multilingual
MMLU Global Lite (5-shot, avg acc)	73.97	73.94
MGSM (8-shot, avg acc)	80.80	80.40
Aggregate
Quality retained (% of AR baseline)	100%	98.7%
Generation throughput (× AR baseline)	1.0×	2.42×

Model Architecture

Architecture Type: Two-Tower Block-Diffusion over a Mamba2-Transformer Hybrid Mixture of Experts (MoE) backbone
Backbone: NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16
Layers per tower: 52 — 23 Mamba-2, 6 self-attention, 23 MoE
Number of model parameters: ~60B total (30B AR/context tower + 30B diffusion/denoiser tower); the released checkpoint ships both towers
Active parameters per token: ~3B per tower, 128 routable experts of which 6 are activated, with 2 shared experts.
Denoiser-only modifications vs. the backbone:
- Bidirectional in-block attention — noisy tokens attend bidirectionally within the current block, causally to past clean blocks (no added parameters).
- Layer-aligned cross-attention — denoiser layer i attends to context-tower layer i's KV.
- Context-seeded Mamba-2 — denoiser Mamba layers seed their initial conv/SSM state from the corresponding context Mamba state (causal; the bidirectional-Mamba variant is not used).
- adaLN-single time conditioning — the diffusion timestep t modulates every denoiser layer (≈1.5M added parameters; replicated per tensor-parallel rank).

Training Methodology

Nemotron-TwoTower is produced by adapting a pretrained autoregressive backbone into a block-wise diffusion generator — only the diffusion/denoiser tower is trained; the AR/context tower is optionally trainable, but kept frozen here.

Stage 1 — Backbone pre-training (AR). The single-tower NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 is pre-trained from scratch with next-token prediction (~25T tokens).
Stage 2 — Two-tower denoiser training (diffusion). Both towers are initialized from the same backbone checkpoint. The AR/context tower is frozen; the diffusion/denoiser tower is trained under a masked-diffusion objective (mean negative log-likelihood over masked positions), conditioned on the context tower's per-layer KV cache and Mamba boundary states. Training follows the backbone's two-stage data curriculum (broad phase-1 blend → higher-quality phase-2 blend), over ~2.1T tokens total.

The released checkpoint is trained in three stages: phase-1 adaptation at block size S=32, phase-2 continuation at S=32, and a final phase-2 continuation at S=16 (the default sampling block size).

Precision: BF16. Optimizer: AdamW with a Warmup-Stable-Decay schedule (peak LR 1e-4, final LR 1e-6), reset at phase boundaries.
Software used for training: Megatron-LM

Input

Input Type(s): Text
Input Format(s): String
Input Parameters: One-Dimensional (1D): Sequences
Maximum input size: 128K tokens

Output

Output Type(s): Text
Output Format: String
Output Parameters: One-Dimensional (1D): Sequences
Maximum output size: 128K tokens

Our AI models are designed and optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s): HuggingFace Transformers (with trust_remote_code=True)
Supported Hardware Microarchitecture Compatibility: NVIDIA H100-80GB, NVIDIA A100 (full two-tower diffusion inference uses 2 GPUs, ~59GB per GPU for BF16 weights)
Operating System(s): Linux

Use it with Transformers

Full two-tower diffusion inference places the AR/context tower and the diffusion/denoiser tower on separate GPUs.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

# AR/context tower -> GPU 0, diffusion/denoiser tower -> GPU 1
model.place_towers_on_devices("cuda:0", "cuda:1")
model.eval()

prompt = "France is a country "
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

# Mask-diffusion generation (two-tower)
outputs = model.generate_mask_diffusion(
    inputs["input_ids"],
    max_new_tokens=128,
    block_size=16,            # tokens generated per block
    steps_per_block=16,       # denoising iterations per block
    mask_token_id=3,          # [MASK]
    temperature=0.1,
    confidence_threshold=0.8, # commit positions above this confidence each step
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Mock-AR (two-tower autoregressive, one token per step):

outputs = model.generate_mock_ar(
    inputs["input_ids"], max_new_tokens=128, temperature=0.0,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

AR-only (single GPU, AR/context tower only — load with .cuda() instead of place_towers_on_devices):

outputs = model.generate_ar(
    inputs["input_ids"], max_new_tokens=128, temperature=0.0,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Model Version(s)

v1.1 — Block-wise mask-diffusion generation enabled (time-conditioned diffusion/denoiser, bidirectional in-block attention, context-seeded chunk-scan Mamba-2); AR and mock-AR also supported.
v1.0 — Two-tower AR (mock-AR) checkpoint.

Training, Testing, and Evaluation Datasets

The diffusion/denoiser tower is trained on the same data sources as the NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 backbone (a ~2.1T-token subset of the backbone's two-phase blend). See the base model card for the full dataset listing.

Data Modality: Text
Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Not applicable (self-supervised mask-diffusion objective)

Inference

Engine(s): HuggingFace Transformers (with trust_remote_code=True)
Test Hardware: 2× NVIDIA A100 80GB or 2× NVIDIA H100 80GB (two-tower diffusion); 1× 80GB GPU sufficient for AR-only mode

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our Trustworthy AI terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Bias, Explainability, and Privacy Subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Citation

@misc{nvidia_nemotron_twotower_2026,
  title  = {{Nemotron-TwoTower}: Diffusion Language Modeling with Pretrained Autoregressive Context},
  author = {{NVIDIA}},
  year   = {2026},
  url    = {https://huggingface.co/collections/nvidia/nemotron-twotower},
  note   = {Technical report}
}

Downloads last month: 13

Safetensors

Model size

63B params

Tensor type

BF16

F32

Datasets used to train nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16

Collection including nvidia/Nemotron-TwoTower-30B-A3B-Base-BF16

Nemotron-TwoTower

Collection

Diffusion Language Modeling with Pretrained Autoregressive Nemotron 3 Models • 1 item • Updated about 17 hours ago • 1