jiaqiz commited on
Commit
a9b12d7
1 Parent(s): 59c597e

update README

Browse files
Files changed (2) hide show
  1. README.md +181 -3
  2. mtbench_categories.png +0 -0
README.md CHANGED
@@ -1,5 +1,183 @@
1
  ---
2
- license: other
3
- license_name: tbd
4
- license_link: LICENSE
 
 
 
 
 
 
 
 
 
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ fine-tuning: true
8
+ tags:
9
+ - nvidia
10
+ - conversational
11
+ - llama2
12
+ datasets:
13
+ - Anthropic/hh-rlhf
14
  ---
15
+
16
+ # NV-Llama2-70B-RLHF-Chat
17
+
18
+ ## Description
19
+ NV-Llama2-70B-RLHF-Chat is a 70 billion parameter generative language model instruct-tuned on [LLama2-70B](https://huggingface.co/meta-llama/Llama-2-70b) model. It takes input with context length up to 4,096 tokens. The model has been fine-tuned for instruction following using Supervised Fine-tuning (SFT) on a combination of public and proprietary data and Reinforcement Learning from Human Feedback (RLHF) on [HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) , achieving 7.59 on MT-Bench and demonstrating strong performance on academic benchmarks.
20
+
21
+ NV-Llama2-70B-RLHF-Chat is trained with NVIDIA [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner), a scalable toolkit for performant and efficient model alignment. NeMo-Aligner is built using the [NeMo Framework](https://github.com/NVIDIA/NeMo) which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of our checkpoints are cross compatible with the NeMo ecosystem, allowing for inference deployment and further customization.
22
+
23
+ Try this model instantly for free hosted by us at [NVIDIA AI Playground](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/nv-llama2-70b-rlhf). You can use this in the provided UI or through a limited access API (up to 10, 000 requests within 30 days). If you would need more requests, we demonstrate how you can set up an inference server below.
24
+
25
+ <img src="https://huggingface.co/nvidia/NV-Llama2-70B-RLHF-Chat/resolve/main/mtbench_categories.png" alt="MT Bench Categories" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
26
+
27
+
28
+ ## Model Architecture
29
+ - **Architecture Type:** Transformer
30
+ - **Network Architecture:** Llama 2
31
+
32
+ ## Prompt Format
33
+ | Single-Turn | Single-Turn with Context | Multi-Turn |
34
+ |---------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|
35
+ | <extra_id_0>System<br><br><extra_id_1>User<br>{prompt}<br><extra_id_1>Assistant | <extra_id_0>System<br><br><extra_id_1>User<br>{context}<br>{prompt}<br><extra_id_1>Assistant | <extra_id_0>System<br><br><extra_id_1>User<br>{prompt 1}<br><extra_id_1>Assistant<br>{response 1}<br><extra_id_1>User<br>{prompt 2}<br><extra_id_1>Assistant |
36
+
37
+
38
+ ## Software Integration for Inference
39
+ - **Runtime Engine(s):** NVIDIA AI Enterprise
40
+ - **Toolkit:** NeMo Framework
41
+ - **Supported Hardware Architecture Compatibility:** H100, A100 80GB, A100 40GB
42
+
43
+ ### Steps to Run Inference
44
+ We demonstrate inference using [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo), which allows hassle-free model deployment based on [NVIDIA TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), a highly optimized inference solution focussing on high throughput and low latency.
45
+
46
+ Pre-requisite: You would need at least a machine with 4 40GB or 2 80GB NVIDIA GPUs, and 300GB of free disk space.
47
+
48
+ 1. Please sign up to get free and immediate access to [NVIDIA NeMo Framework container](https://developer.nvidia.com/nemo-framework). If you don’t have an NVIDIA NGC account, you will be prompted to sign up for an account before proceeding.
49
+
50
+ 2. If you don’t already have NVIDIA NGC key, sign into [NVIDIA NGC](https://ngc.nvidia.com/setup), selecting `organization/team: ea-bignlp/ga-participants` and click Generate API key. Save this key for the next step.
51
+
52
+ 3. On your machine, docker login to `nvcr.io`.
53
+ ```bash
54
+ docker login nvcr.io
55
+ Username: $oauthtoken
56
+ Password: <Your Saved NGC Key>
57
+ ```
58
+ 4. Download the required container.
59
+ ```bash
60
+ docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
61
+ ```
62
+ 5. Download the checkpoint.
63
+ ```bash
64
+ git lfs install
65
+ git clone https://huggingface.co/nvidia/NV-Llama2-70B-RLHF-Chat
66
+ cd NV-Llama2-70B-RLHF-Chat
67
+ git lfs pull
68
+ ```
69
+ 6. Convert checkpoint into NeMo format.
70
+ ```bash
71
+ cd NV-Llama2-70B-RLHF-Chat
72
+ tar -cvf NV-Llama2-70B-RLHF-Chat.nemo .
73
+ mv NV-Llama2-70B-RLHF-Chat.nemo ../
74
+ cd ..
75
+ rm -r NV-Llama2-70B-RLHF-Chat
76
+ ```
77
+ 7. Run Docker container.
78
+ ```bash
79
+ docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/NV-Llama2-70B-RLHF-Chat.nemo:/opt/checkpoints/NV-Llama2-70B-RLHF-Chat.nemo -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
80
+ ```
81
+ 8. Within the container, start the server in the background. This step does both conversion of the NeMo checkpoint to TRT-LLM and then deployment using TRT-LLM. For an explanation of each argument and advanced usage, please refer to [NeMo FW Deployment Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html).
82
+ ```bash
83
+ python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/NV-Llama2-70B-RLHF-Chat.nemo --model_type="llama" --triton_model_name NV-Llama2-70B-RLHF-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
84
+ ```
85
+ 9. Once the server is ready (i.e. when you see this messages below), you are ready to launch your client code.
86
+ ```
87
+ Started HTTPService at 0.0.0.0:8000
88
+ Started GRPCInferenceService at 0.0.0.0:8001
89
+ Started Metrics Service at 0.0.0.0:8002
90
+ ```
91
+ An example for single-turn closed QA with context:
92
+ ```python
93
+ from nemo.deploy import NemoQuery
94
+
95
+ PROMPT_TEMPLATE = """<extra_id_0>System
96
+
97
+ <extra_id_1>User
98
+ This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context.
99
+ Context: {context}
100
+ Please give a full and complete answer for the question. {prompt}
101
+ <extra_id_1>Assistant
102
+ """
103
+
104
+ context = "Climate change refers to long-term shifts in temperatures and weather patterns. Such shifts can be natural, due to changes in the sun’s activity or large volcanic eruptions. But since the 1800s, human activities have been the main driver of climate change, primarily due to the burning of fossil fuels like coal, oil and gas."
105
+ question = "What did Michael Jackson achieve?"
106
+ prompt = PROMPT_TEMPLATE.format(context=context, prompt=question)
107
+ print(prompt)
108
+
109
+ nq = NemoQuery(url="localhost:8000", model_name="NV-Llama2-70B-RLHF-Chat")
110
+ output = nq.query_llm(prompts=[prompt], max_output_token=256, top_k=1, top_p=0.0, temperature=1.0)
111
+
112
+ #this container currently does not support stop words but you do something like this as workaround
113
+ output = output[0][0].split("\n<extra_id_1>")[0]
114
+ print(output)
115
+ ```
116
+
117
+ An example for multi-turn conversation:
118
+ ```python
119
+ from nemo.deploy import NemoQuery
120
+
121
+ PROMPT_TEMPLATE1 = """<extra_id_0>System
122
+
123
+ <extra_id_1>User
124
+ {prompt1}
125
+ <extra_id_1>Assistant
126
+ """
127
+ PROMPT_TEMPLATE2 = """<extra_id_0>System
128
+
129
+ <extra_id_1>User
130
+ {prompt1}
131
+ <extra_id_1>Assistant
132
+ {response1}
133
+ <extra_id_1>User
134
+ {prompt2}
135
+ <extra_id_1>Assistant
136
+ """
137
+
138
+ nq = NemoQuery(url="localhost:8000", model_name="NV-Llama2-70B-RLHF-Chat")
139
+ # Turn 1
140
+ question1 = "Write an introduction about NVIDIA."
141
+ prompt = PROMPT_TEMPLATE1.format(prompt1=question1)
142
+ print(prompt)
143
+
144
+ output = nq.query_llm(prompts=[prompt], max_output_token=256, top_k=1, top_p=0.0, temperature=1.0)
145
+
146
+ #this container currently does not support stop words but you do something like this as workaround
147
+ response1 = output[0][0].split("\n<extra_id_1>")[0]
148
+ print(response1)
149
+
150
+ # Turn 2
151
+ question2 = "Can you write it in a poem in the style of Shakespeare?"
152
+ prompt = PROMPT_TEMPLATE2.format(prompt1=question1, response1=response1, prompt2=question2)
153
+ print(prompt)
154
+
155
+ output = nq.query_llm(prompts=[prompt], max_output_token=256, top_k=1, top_p=0.0, temperature=1.0)
156
+
157
+ response2 = output[0][0].split("\n<extra_id_1>")[0]
158
+ print(response2)
159
+ ```
160
+
161
+
162
+ ## Evaluation
163
+
164
+ ### MT-bench
165
+
166
+ | Total | Writing | Roleplay | Extraction | STEM | Humanities | Reasoning | Math | Coding |
167
+ |:-------:|:-------:|:--------:|:----------:|:----:|:----------:|:-----------:|:------:|:--------:|
168
+ | 7.59 | 9.15 | 8.90 | 8.80 | 8.60 | 9.65 | 6.25 | 4.65 | 4.70 |
169
+
170
+ ### Academic Benchmarks
171
+
172
+ | MMLU<br>(5-shot) | HellaSwag<br>(0-shot) | ARC easy<br>(0-shot) | WinoGrande<br>(0-shot) | TruthfulQA MC2<br>(0-shot) | TriviaQA<br>(5-shot) |
173
+ |:----------------:|:---------------------:|:--------------------:|:----------------------:|:--------------------------:|:--------------------:|
174
+ | 68.04 | 84.04 | 83.67 | 79.40 | 58.16 | 80.86 |
175
+
176
+ ## Intended use
177
+ - The NV-Llama2-70B-RLHF-Chat model is best for chat use cases including Question and Answering, Search, Summarization following instructions.
178
+ - Ethical use: Technology can have a profound impact on people and the world, and NVIDIA is committed to enabling trust and transparency in AI development. NVIDIA encourages users to adopt principles of AI ethics and trustworthiness to guide your decisions by following the guidelines in the [cc-by-nc-4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode.en) license.
179
+
180
+ ## Limitations
181
+ - The model was trained on the data that contains toxic language and societal biases originally crawled from the Internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts.
182
+ - The Model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
183
+ - We recommend deploying the model with [NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails) to mitigate these potential issues.
mtbench_categories.png ADDED