Model Overview

Description:

Nemotron-CLIMB Proxy Base Models (62M and 350M) are two small decoder-only transformer language models pre-trained from scratch by NVIDIA on 10 trillion tokens using the Megatron-LM codebase. They are designed as proxy models for scaling law research — enabling practitioners to forecast the behavior of much larger models prior to committing full-scale compute resources. Both models use a WSD (Warmup-Stable-Decay) learning rate schedule and share the same 32-layer architecture, differing only in hidden dimension. These models are ready for commercial/non-commercial use.

License/Terms of Use:

Released under the NVIDIA Open Model License.

Deployment Geography:

Global

Use Case:

These proxy models are intended for ML researchers and engineers working on:

Scaling law experiments — predicting loss, downstream accuracy, or emergent behavior of larger models from small-model trends.
Recipe transfer — validating hyperparameter choices (learning rate, batch size, data mix) at low cost before scaling up.
Proxy-tuning research — studying how fine-tuning dynamics (SFT, RLHF, DPO) transfer across model scales.
Reward model proxy training — training lightweight reward models for alignment research.

References(s):

Model Architecture:

Architecture Type: Transformer (decoder-only)

Network Architecture: Decoder-only transformer with RMSNorm, SwiGLU activation, and Rotary Position Embeddings (RoPE).

Number of model parameters:

Variant	Parameters	Layers	Checkpoint Size
62M	62 million	32	~735 MB
350M	350 million	32	~4.5 GB

Note: Checkpoint sizes include optimizer state and RNG state, suitable for continued pre-training.

Design Choices: Both models were trained from scratch using the Megatron-LM distributed training framework with the following key design decisions:

Deep-and-narrow architecture. Both variants use 32 transformer layers — unusually deep for their parameter count — to better approximate the layer-wise dynamics of billion-scale models, improving proxy fidelity for scaling law extrapolation.
WSD learning rate schedule. A Warmup-Stable-Decay schedule was used for stable long-horizon training over 10T tokens.
Single tensor-parallel rank. Both models were trained with TP=1 to simplify checkpoint distribution and downstream usage.

Input(s):

Input Type(s): Text

Input Format(s):

Text: Token IDs (integer sequences)

Input Parameters:

Text: One-Dimensional (1D) sequence of token IDs

Other Properties Related to Input: These are base (pre-trained) language models. Input is tokenized text. The models accept standard causal-LM input and are not instruction-tuned.

Output(s)

Output Type(s): Text

Output Format(s):

Text: Next-token logits over vocabulary at each position

Output Parameters:

Text: Two-Dimensional (2D) — sequence length x vocabulary size

Other Properties Related to Output: As base models, outputs are raw next-token probability distributions. The models are not aligned or instruction-tuned and may produce unfiltered text.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine:

Megatron-LM (native checkpoint format)
Can be converted to HuggingFace Transformers format for inference

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere (A100)
NVIDIA Hopper (H100, H200)
NVIDIA Lovelace (L40S)
CPU inference is feasible given the small model size

Supported Operating System(s):

Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.

Model Version(s):

Variant	Training Iterations	Training Tokens	Training Nodes	Checkpoint
62M	2,500,000	10T	8	`iter_2500000/mp_rank_00/model_optim_rng.pt`
350M	2,384,053	10T	16	`iter_2384053/mp_rank_00/model_optim_rng.pt`

Both are v1.0 releases.

Training, Testing, and Evaluation Datasets:

Training Dataset:

Data Modality:

Text

Training Data Size:

Text Training Data Size: 1 trillion tokens

Data Collection Method by dataset:

Automated

Labeling Method by dataset:

Not Applicable

Properties: 1 trillion tokens. Content is English-language web text. The data may include publicly available web content of various types (articles, blogs, forums, etc.).

Testing Dataset:

Data Collection Method by dataset:

Automated

Labeling Method by dataset:

Not Applicable

Properties: 10 billion tokens. Same source distribution as training data.

Evaluation Dataset:

Data Collection Method by dataset:

Automated

Labeling Method by dataset:

Not Applicable

Properties: 10 billion tokens. Same source distribution as training data.

Inference:

Acceleration Engine: Megatron-LM or HuggingFace Transformers (after conversion)
Test Hardware:

NVIDIA A100 / H100 GPU (also runnable on CPU given small size)

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Bias

Field	Response
Participation considerations from adversely impacted groups protected classes in model design and testing:	Not Applicable.
Measures taken to mitigate against unwanted bias:	Not Applicable.

Explainability

Field	Response
Intended Task/Domain:	Scaling Law Research
Model Type:	Transformer
Intended Users:	ML researchers and engineers working on scaling law experiments, hyperparameter recipe transfer, proxy-tuning research, and reward model proxy training.
Output:	Text (Next-token logits over vocabulary at each position; raw probability distributions without instruction-tuning or alignment)
Describe how the model works:	Input text is tokenized into a sequence of token IDs and fed into a 32-layer decoder-only transformer with RMSNorm, SwiGLU activation, and Rotary Position Embeddings (RoPE). The model autoregressively predicts the next token at each position. Two variants (62M and 350M parameters) share the same depth but differ in hidden dimension, enabling scaling law extrapolation across model sizes.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:	Not Applicable.
Technical Limitations & Mitigation:	As base (non-aligned) models, outputs are unfiltered next-token distributions and may produce harmful, biased, or inaccurate text. The models are trained on English web text from Common Crawl/DCLM and may not generalize well to non-English languages or specialized domains. Their primary value is as proxy models for predicting larger-model behavior; they are not intended for direct deployment in production systems.
Verified to have met prescribed NVIDIA quality standards:	Yes
Performance Metrics:	Validation Loss (perplexity), Downstream Task Accuracy (via scaling law extrapolation), Throughput (tokens/second), Training Stability
Potential Known Risks:	As unaligned base models, they may generate harmful, biased, or factually incorrect text if used directly for text generation. Scaling law predictions derived from these proxy models may not perfectly extrapolate to all architectures or data distributions.
Licensing:	NVIDIA Open Model License

Safety & Security

Field	Response
Model Application Field(s):	Scaling Law Research
Describe the life critical impact (if present).	Not Applicable.
Use Case Restrictions:	Abide by the NVIDIA Open Model License.
Model and dataset restrictions:	The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.

Privacy

Privacy Subcard
Nemotron-Climb-Proxy-Models was trained on large-scale publicly available data that may contain images, audio-video, and text relating to people. NVIDIA collected and used this data in compliance with applicable data protection and privacy laws. This model was not designed to specifically derive insights or otherwise learn from any personal data contained in the datasets.
NVIDIA uses a combination of filters, data minimization techniques, and other guardrails to help prevent personal data from being recited by our models. We employ automated tools and data processing techniques during pre-training or training to identify and filter certain categories of personal data.
Please review NVIDIA's Applicable Privacy Policy for more information.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including nvidia/nemotron-climb-proxy-models

NeMo Curator - Classifier Models

Collection

Classifier models that can be used in NeMo Curator for labelling/filtering datasets. • 14 items • Updated about 23 hours ago • 27

Papers for nvidia/nemotron-climb-proxy-models

DataComp-LM: In search of the next generation of training sets for language models

Paper • 2406.11794 • Published Jun 17, 2024 • 55

Scaling Data-Constrained Language Models

Paper • 2305.16264 • Published May 25, 2023 • 16

Scaling Laws for Neural Language Models

Paper • 2001.08361 • Published Jan 23, 2020 • 10

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Paper • 1909.08053 • Published Sep 17, 2019 • 5