NeMo
Chris-Alexiuk commited on
Commit
84ba9fc
1 Parent(s): 6c5fbba

Updated Model Card

Browse files

Updated Model Card - 9:30PM EDT.

Files changed (1) hide show
  1. README.md +72 -47
README.md CHANGED
@@ -7,23 +7,35 @@ license_link: LICENSE
7
 
8
  [![Model architectuve](https://img.shields.io/badge/Model%20Arch-Transformer%20Decoder-green)](#model-architecture)[![Model size](https://img.shields.io/badge/Params-340B-green)](#model-architecture)[![Language](https://img.shields.io/badge/Language-Multilingual-green)](#datasets)
9
 
10
- ### License
11
-
12
- NVIDIA Open Model License
13
-
14
  ### Model Overview
15
 
16
- Nemotron-4-340B-Instruct is a large language model (LLM) which is a fine-tuned version of the Nemotron-4-340B-Base base model, optimized for English single and multi-turn chat use-cases. The base model was pre-trained on a corpus of 8 trillion tokens consisting of a diverse assortment of English based texts, 40+ coding languages, and 50+ natural languages.
17
 
18
  Subsequently the Nemotron-4-340B-Instruct model went through additional alignment steps including:
19
-
20
  - Supervised Fine-tuning (SFT)
21
- - Direct Policy Optimization (DPO)
22
- - Additional in-house alignment techniques (Publication work in progress)
 
 
 
 
 
 
 
 
 
23
 
24
- This results in a final model that is aligned for human chat preferences, improvements in mathematical reasoning, coding and instruction following.
25
 
26
- This model is ready for commercial use.
 
 
 
 
 
 
 
 
27
 
28
  **Model Developer:** NVIDIA
29
 
@@ -51,19 +63,21 @@ FP8 Inference:
51
 
52
  ### Model Architecture:
53
 
54
- The base model, Nemotron-4-340B, was trained with a global batch-size of 2304, a sequence length of 4096 tokens, uses Grouped-Query Attention (GQA), and RoPE positional embeddings.
55
 
56
  **Architecture Type:** Transformer Decoder (auto-regressive language model)
57
 
58
- ### Software Integration
59
-
60
- **Supported Hardware Architecture Compatibility:** NVIDIA H100, A100 80GB, A100 40GB
61
 
62
  ### Usage
63
 
64
  1. We will spin up an inference server and then call the inference server in a python script. Let’s first define the python script ``call_server.py``
65
 
66
  ```python
 
 
 
67
  headers = {"Content-Type": "application/json"}
68
 
69
  def text_generation(data, ip='localhost', port=None):
@@ -100,13 +114,16 @@ prompt = PROMPT_TEMPLATE.format(prompt=question)
100
  print(prompt)
101
 
102
  response = get_generation(prompt, greedy=True, add_BOS=False, token_to_gen=1024, min_tokens=1, temp=1.0, top_p=1.0, top_k=0, repetition=1.0, batch=False)
 
 
 
103
  print(response)
104
  ```
105
 
 
106
 
107
- 2. Given this python script, we will create a bash script, which spins up the inference server within the [NeMo container](https://github.com/NVIDIA/NeMo/blob/main/Dockerfile) and calls the python script ``call_server.py``. The bash script ``nemo_inference.sh`` is as follows,
108
-
109
-
110
  WEB_PORT=1424
111
 
112
  depends_on () {
@@ -122,15 +139,15 @@ depends_on () {
122
  echo "server ($HOST:$PORT) is up running"
123
  }
124
 
125
- echo "output filename: $OUTPUT_FILENAME"
126
 
127
  /usr/bin/python3 /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \
128
  gpt_model_file=$NEMO_FILE \
129
  pipeline_model_parallel_split_rank=0 \
130
  server=True tensor_model_parallel_size=8 \
131
- trainer.precision=bf16 pipeline_model_parallel_size=4 \
132
  trainer.devices=8 \
133
- trainer.num_nodes=4 \
134
  web_server=False \
135
  port=${WEB_PORT} &
136
  SERVER_PID=$!
@@ -144,47 +161,45 @@ echo "output filename: $OUTPUT_FILENAME"
144
 
145
  echo "SLURM_NODEID: $SLURM_NODEID"
146
  echo "local_rank: $local_rank"
147
- /usr/bin/python3 call_server.py
148
  echo "clean up dameons: $$"
149
  kill -9 $SERVER_PID
150
  pkill python
151
  fi
152
  wait
 
153
 
154
 
155
  3, We can launch the ``nemo_inferece.sh`` with a slurm script defined like below, which starts a 4-node job for the model inference.
156
 
157
-
158
  #!/bin/bash
159
  #SBATCH -A SLURM-ACCOUNT
160
  #SBATCH -p SLURM-PARITION
161
- #SBATCH -N 4 # number of nodes
162
  #SBATCH -J generation
163
  #SBATCH --ntasks-per-node=8
164
  #SBATCH --gpus-per-node=8
165
  set -x
166
 
 
 
 
 
 
 
167
  read -r -d '' cmd <<EOF
168
- bash nemo_inference.sh
169
  EOF
170
 
171
  srun -o $OUTFILE -e $ERRFILE --container-image="$CONTAINER" $MOUNTS bash -c "${cmd}"
172
-
173
-
174
-
175
- ### Intended use
176
-
177
- Nemotron-4-340B-Instruct is a chat model intended for use in over 50+ natural and coding languages. For best performance on a given task, users are encouraged to customize the chat model using the NeMo Framework suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA), and SFT/Steer-LM/RLHF.
178
-
179
- ### Red Teaming:
180
-
181
- TO BE UPDATED BASED ON RED TEAMING PICs + LEGAL REVIEW
182
 
183
  ### Evaluation Results
184
 
185
  #### MT-Bench (GPT-4-Turbo)
186
 
187
- Evaluated using select datasets from the [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/pdf/2306.05685v4)
188
 
189
  | total | writing | roleplay | extraction | stem | humanities | reasoning | math | coding | turn 1 | turn 2 |
190
  | ----- | ------- | -------- | ---------- | ---- | ---------- | --------- | ---- | ------ | ------ | ------ |
@@ -208,7 +223,7 @@ Evaluated using the Multi-task Language Understanding benchmarks as introduced i
208
 
209
  #### GSM8K
210
 
211
- Evaluated using the Grade School Math 8K (GSM8K) bechmark as introduced in [Training Verifiers to Solve Math Word Problems](https://arxiv.org/pdf/2110.14168v2).
212
 
213
  | GSM8K 0-shot |
214
  | ----------------- |
@@ -223,6 +238,15 @@ Evaluated using the HumanEval benchmark as introduced in [Evaluating Large Langu
223
  | ----- |
224
  | 73.2 |
225
 
 
 
 
 
 
 
 
 
 
226
  #### Arena Hard
227
 
228
  Evaluated using the [Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) from the LMSys Org.
@@ -235,17 +259,10 @@ Evaluated using the [Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-aren
235
 
236
  Evaluated using the AlpacaEval 2.0 LC (Length Controlled) as introduced in the paper: [Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators](https://arxiv.org/abs/2404.04475)
237
 
238
- | AlpacaEval |
239
  | ----------------- |
240
- | 54.2 |
241
-
242
- #### MBPP
243
 
244
- Evaluated using the MBPP Dataset as introduced in the [Program Synthesis with Large Language Models](https://arxiv.org/abs/2108.07732) paper.
245
-
246
- | MBPP |
247
- | ----------------- |
248
- | 75.4 |
249
 
250
  #### TFEval
251
 
@@ -256,6 +273,14 @@ Evaluated using the CantTalkAboutThis Dataset as introduced in the [CantTalkAbou
256
  | 81.7 | 97.7 |
257
 
258
 
259
- ### Limitations
 
 
 
 
 
 
 
 
260
 
261
- The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything
 
7
 
8
  [![Model architectuve](https://img.shields.io/badge/Model%20Arch-Transformer%20Decoder-green)](#model-architecture)[![Model size](https://img.shields.io/badge/Params-340B-green)](#model-architecture)[![Language](https://img.shields.io/badge/Language-Multilingual-green)](#datasets)
9
 
 
 
 
 
10
  ### Model Overview
11
 
12
+ Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs; and is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English single and multi-turn chat use-cases. The base model was pre-trained on a corpus of 9 trillion tokens consisting of a diverse assortment of English based texts,50+ natural languages, and 40+ coding languages.
13
 
14
  Subsequently the Nemotron-4-340B-Instruct model went through additional alignment steps including:
 
15
  - Supervised Fine-tuning (SFT)
16
+ - Direct Preference Optimization (DPO)
17
+ - Additional in-house alignment technique: Reward-aware Preference Optimization (RPO)
18
+
19
+ Throughout the alignment process, we relied on only approximately 20K human-annotated data while our data generation pipeline synthesized over 98% of the data used for supervised fine-tuning and preference fine-tuning (DPO & RPO). We provide comprehensive details about our synthetic data generation pipeline in the technical report.
20
+
21
+ This results in a model that is aligned for human chat preferences, improvements in mathematical reasoning, coding and instruction-following, and is capable of generating high quality synthetic data for a variety of use cases.
22
+
23
+ Under the NVIDIA Open Model License, NVIDIA confirms:
24
+ - Models are commercially usable.
25
+ - You are free to create and distribute Derivative Models.
26
+ - NVIDIA does not claim ownership to any outputs generated using the Models or Derivative Models.
27
 
28
+ ### License:
29
 
30
+ [NVIDIA Open Model License](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf)
31
+
32
+ ### Intended use
33
+
34
+ Nemotron-4-340B-Instruct is a chat model intended for use for the English language.
35
+
36
+ Nemotron-4-340B-Instruct is designed for Synthetic Data Generation to enable developers and enterprises for building and customizing their own large language models and LLM applications.
37
+
38
+ The instruct model itself can be further customized using the [NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html) suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA), and SFT/Steer-LM/RLHF using [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner).
39
 
40
  **Model Developer:** NVIDIA
41
 
 
63
 
64
  ### Model Architecture:
65
 
66
+ Nemotron-4-340B-Base, is standard decoder-only Transformer, trained with a sequence length of 4096 tokens, uses Grouped-Query Attention (GQA), and Rotary Position Embeddings (RoPE).
67
 
68
  **Architecture Type:** Transformer Decoder (auto-regressive language model)
69
 
70
+ **Network Architecture:**
71
+ Nemotron-4
 
72
 
73
  ### Usage
74
 
75
  1. We will spin up an inference server and then call the inference server in a python script. Let’s first define the python script ``call_server.py``
76
 
77
  ```python
78
+ import json
79
+ import requests
80
+
81
  headers = {"Content-Type": "application/json"}
82
 
83
  def text_generation(data, ip='localhost', port=None):
 
114
  print(prompt)
115
 
116
  response = get_generation(prompt, greedy=True, add_BOS=False, token_to_gen=1024, min_tokens=1, temp=1.0, top_p=1.0, top_k=0, repetition=1.0, batch=False)
117
+ response = response[len(prompt):]
118
+ if response.endswith("<extra_id_1>"):
119
+ response = response[:-len("<extra_id_1>")]
120
  print(response)
121
  ```
122
 
123
+ 2. Given this python script, we will create a bash script, which spins up the inference server within the NeMo container(docker pull nvcr.io/nvidia/nemo:24.01.framework) and calls the python script ``call_server.py``. The bash script ``nemo_inference.sh`` is as follows,
124
 
125
+ ```bash
126
+ NEMO_FILE=$1
 
127
  WEB_PORT=1424
128
 
129
  depends_on () {
 
139
  echo "server ($HOST:$PORT) is up running"
140
  }
141
 
142
+
143
 
144
  /usr/bin/python3 /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_eval.py \
145
  gpt_model_file=$NEMO_FILE \
146
  pipeline_model_parallel_split_rank=0 \
147
  server=True tensor_model_parallel_size=8 \
148
+ trainer.precision=bf16 pipeline_model_parallel_size=2 \
149
  trainer.devices=8 \
150
+ trainer.num_nodes=2 \
151
  web_server=False \
152
  port=${WEB_PORT} &
153
  SERVER_PID=$!
 
161
 
162
  echo "SLURM_NODEID: $SLURM_NODEID"
163
  echo "local_rank: $local_rank"
164
+ /usr/bin/python3 /scripts/call_server.py
165
  echo "clean up dameons: $$"
166
  kill -9 $SERVER_PID
167
  pkill python
168
  fi
169
  wait
170
+ ```
171
 
172
 
173
  3, We can launch the ``nemo_inferece.sh`` with a slurm script defined like below, which starts a 4-node job for the model inference.
174
 
175
+ ```bash
176
  #!/bin/bash
177
  #SBATCH -A SLURM-ACCOUNT
178
  #SBATCH -p SLURM-PARITION
179
+ #SBATCH -N 2 # number of nodes
180
  #SBATCH -J generation
181
  #SBATCH --ntasks-per-node=8
182
  #SBATCH --gpus-per-node=8
183
  set -x
184
 
185
+ RESULTS=<PATH_TO_YOUR_SCRIPTS_FOLDER>
186
+ OUTFILE="${RESULTS}/slurm-%j-%n.out"
187
+ ERRFILE="${RESULTS}/error-%j-%n.out"
188
+ MODEL=<PATH_TO>/Nemotron-4-340B-Instruct
189
+
190
+ MOUNTS="--container-mounts=<PATH_TO_YOUR_SCRIPTS_FOLDER>:/scripts,MODEL:/model"
191
  read -r -d '' cmd <<EOF
192
+ bash /scripts/nemo_inference.sh /model
193
  EOF
194
 
195
  srun -o $OUTFILE -e $ERRFILE --container-image="$CONTAINER" $MOUNTS bash -c "${cmd}"
196
+ ```
 
 
 
 
 
 
 
 
 
197
 
198
  ### Evaluation Results
199
 
200
  #### MT-Bench (GPT-4-Turbo)
201
 
202
+ Evaluated using MT-Bench judging by GPT-4-Turbo as described in the [HelpSteer2 Dataset Paper](https://arxiv.org/abs/2406.08673)
203
 
204
  | total | writing | roleplay | extraction | stem | humanities | reasoning | math | coding | turn 1 | turn 2 |
205
  | ----- | ------- | -------- | ---------- | ---- | ---------- | --------- | ---- | ------ | ------ | ------ |
 
223
 
224
  #### GSM8K
225
 
226
+ Evaluated using the Grade School Math 8K (GSM8K) benchmark as introduced in [Training Verifiers to Solve Math Word Problems](https://arxiv.org/pdf/2110.14168v2).
227
 
228
  | GSM8K 0-shot |
229
  | ----------------- |
 
238
  | ----- |
239
  | 73.2 |
240
 
241
+ #### MBPP
242
+
243
+ Evaluated using the MBPP Dataset as introduced in the [Program Synthesis with Large Language Models](https://arxiv.org/abs/2108.07732) paper.
244
+
245
+ | MBPP 0-shot|
246
+ | ----------------- |
247
+ | 75.4 |
248
+
249
+
250
  #### Arena Hard
251
 
252
  Evaluated using the [Arena-Hard Pipeline](https://lmsys.org/blog/2024-04-19-arena-hard/) from the LMSys Org.
 
259
 
260
  Evaluated using the AlpacaEval 2.0 LC (Length Controlled) as introduced in the paper: [Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators](https://arxiv.org/abs/2404.04475)
261
 
262
+ | AlpacaEval 2.0 LC|
263
  | ----------------- |
264
+ | 41.5 |
 
 
265
 
 
 
 
 
 
266
 
267
  #### TFEval
268
 
 
273
  | 81.7 | 97.7 |
274
 
275
 
276
+ ### Adversarial Testing and Red Teaming Efforts
277
+
278
+ The Nemotron-4 340B-Instruct model underwent extensive safety evaluation including adversarial testing via three distinct methods:
279
+ - [Garak](https://docs.garak.ai/garak), is an automated LLM vulnerability scanner that probes for common weaknesses, including prompt injection and data leakage.
280
+ - [AEGIS](https://arxiv.org/pdf/2404.05993), is a content safety evaluation dataset and LLM based content safety classifier model, that adheres to a broad taxonomy of 13 categories of critical risks in human-LLM interactions.
281
+ - Human Content Red Teaming leveraging human interaction and evaluation of the models' responses.
282
+
283
+
284
+ ### Ethical Considerations
285
 
286
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).