update README.md

ccea462 9 months ago

10.7 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	pipeline_tag: text-generation
	inference: false
	fine-tuning: true
	tags:
	- nvidia
	- conversational
	- llama2
	- rlhf
	datasets:
	- Anthropic/hh-rlhf
	---

	# NV-Llama2-70B-RLHF-Chat

	## Description
	NV-Llama2-70B-RLHF-Chat is a 70 billion parameter generative language model instruct-tuned on [LLama2-70B](https://huggingface.co/meta-llama/Llama-2-70b) model. It takes input with context length up to 4,096 tokens. The model has been fine-tuned for instruction following using Supervised Fine-tuning (SFT) on a combination of public and proprietary data and Reinforcement Learning from Human Feedback (RLHF) on [HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) , achieving 7.59 on MT-Bench and demonstrating strong performance on academic benchmarks.

	NV-Llama2-70B-RLHF-Chat is trained with NVIDIA [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner), a scalable toolkit for performant and efficient model alignment. NeMo-Aligner is built using the [NeMo Framework](https://github.com/NVIDIA/NeMo) which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of our checkpoints are cross compatible with the NeMo ecosystem, allowing for inference deployment and further customization.

	Try this model instantly for free hosted by us at [NVIDIA AI Playground](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/nv-llama2-70b-rlhf). You can use this in the provided UI or through a limited access API (up to 10, 000 requests within 30 days). If you would need more requests, we demonstrate how you can set up an inference server below.

	<img src="https://huggingface.co/nvidia/NV-Llama2-70B-RLHF-Chat/resolve/main/mtbench_categories.png" alt="MT Bench Categories" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>


	## Model Architecture
	- Architecture Type: Transformer
	- Network Architecture: Llama 2

	## Prompt Format
	\| Single-Turn \| Single-Turn with Context \| Multi-Turn \|
	\|---------------------------------------------------------------------------------\|----------------------------------------------------------------------------------------------\|--------------------------------------------------------------------------------------------------------------------------------------------------------------\|
	\| \<extra_id_0>System<br><br>\<extra_id_1>User<br>{prompt}<br>\<extra_id_1>Assistant \| \<extra_id_0>System<br><br>\<extra_id_1>User<br>{context}<br>{prompt}<br>\<extra_id_1>Assistant \| \<extra_id_0>System<br><br>\<extra_id_1>User<br>{prompt 1}<br>\<extra_id_1>Assistant<br>{response 1}<br>\<extra_id_1>User<br>{prompt 2}<br>\<extra_id_1>Assistant \|


	## Software Integration for Inference
	- Runtime Engine(s): NVIDIA AI Enterprise
	- Toolkit: NeMo Framework
	- Supported Hardware Architecture Compatibility: H100, A100 80GB, A100 40GB

	### Steps to Run Inference
	We demonstrate inference using [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo), which allows hassle-free model deployment based on [NVIDIA TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), a highly optimized inference solution focussing on high throughput and low latency.

	Pre-requisite: You would need at least a machine with 4 40GB or 2 80GB NVIDIA GPUs, and 300GB of free disk space.

	1. Please sign up to get free and immediate access to [NVIDIA NeMo Framework container](https://developer.nvidia.com/nemo-framework). If you don’t have an NVIDIA NGC account, you will be prompted to sign up for an account before proceeding.

	2. If you don’t already have NVIDIA NGC key, sign into [NVIDIA NGC](https://ngc.nvidia.com/setup), selecting `organization/team: ea-bignlp/ga-participants` and click Generate API key. Save this key for the next step.

	3. On your machine, docker login to `nvcr.io`.
	```bash
	docker login nvcr.io
	Username: $oauthtoken
	Password: <Your Saved NGC Key>
	```
	4. Download the required container.
	```bash
	docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
	```
	5. Download the checkpoint.
	```bash
	git lfs install
	git clone https://huggingface.co/nvidia/NV-Llama2-70B-RLHF-Chat
	cd NV-Llama2-70B-RLHF-Chat
	git lfs pull
	```
	6. Convert checkpoint into NeMo format.
	```bash
	cd NV-Llama2-70B-RLHF-Chat
	tar -cvf NV-Llama2-70B-RLHF-Chat.nemo .
	mv NV-Llama2-70B-RLHF-Chat.nemo ../
	cd ..
	rm -r NV-Llama2-70B-RLHF-Chat
	```
	7. Run Docker container.
	```bash
	docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/NV-Llama2-70B-RLHF-Chat.nemo:/opt/checkpoints/NV-Llama2-70B-RLHF-Chat.nemo -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
	```
	8. Within the container, start the server in the background. This step does both conversion of the NeMo checkpoint to TRT-LLM and then deployment using TRT-LLM. For an explanation of each argument and advanced usage, please refer to [NeMo FW Deployment Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html).
	```bash
	python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/NV-Llama2-70B-RLHF-Chat.nemo --model_type="llama" --triton_model_name NV-Llama2-70B-RLHF-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
	```
	9. Once the server is ready (i.e. when you see this messages below), you are ready to launch your client code.
	```
	Started HTTPService at 0.0.0.0:8000
	Started GRPCInferenceService at 0.0.0.0:8001
	Started Metrics Service at 0.0.0.0:8002
	```
	An example for single-turn closed QA with context:
	```python
	from nemo.deploy import NemoQuery

	PROMPT_TEMPLATE = """<extra_id_0>System

	<extra_id_1>User
	This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context.
	Context: {context}
	Please give a full and complete answer for the question. {prompt}
	<extra_id_1>Assistant
	"""

	context = "Climate change refers to long-term shifts in temperatures and weather patterns. Such shifts can be natural, due to changes in the sun’s activity or large volcanic eruptions. But since the 1800s, human activities have been the main driver of climate change, primarily due to the burning of fossil fuels like coal, oil and gas."
	question = "What did Michael Jackson achieve?"
	prompt = PROMPT_TEMPLATE.format(context=context, prompt=question)
	print(prompt)

	nq = NemoQuery(url="localhost:8000", model_name="NV-Llama2-70B-RLHF-Chat")
	output = nq.query_llm(prompts=[prompt], max_output_token=256, top_k=1, top_p=0.0, temperature=1.0)

	#this container currently does not support stop words but you do something like this as workaround
	output = output[0][0].split("\n<extra_id_1>")[0]
	print(output)
	```

	An example for multi-turn conversation:
	```python
	from nemo.deploy import NemoQuery

	PROMPT_TEMPLATE1 = """<extra_id_0>System

	<extra_id_1>User
	{prompt1}
	<extra_id_1>Assistant
	"""
	PROMPT_TEMPLATE2 = """<extra_id_0>System

	<extra_id_1>User
	{prompt1}
	<extra_id_1>Assistant
	{response1}
	<extra_id_1>User
	{prompt2}
	<extra_id_1>Assistant
	"""

	nq = NemoQuery(url="localhost:8000", model_name="NV-Llama2-70B-RLHF-Chat")
	# Turn 1
	question1 = "Write an introduction about NVIDIA."
	prompt = PROMPT_TEMPLATE1.format(prompt1=question1)
	print(prompt)

	output = nq.query_llm(prompts=[prompt], max_output_token=256, top_k=1, top_p=0.0, temperature=1.0)

	#this container currently does not support stop words but you do something like this as workaround
	response1 = output[0][0].split("\n<extra_id_1>")[0]
	print(response1)

	# Turn 2
	question2 = "Can you write it in a poem in the style of Shakespeare?"
	prompt = PROMPT_TEMPLATE2.format(prompt1=question1, response1=response1, prompt2=question2)
	print(prompt)

	output = nq.query_llm(prompts=[prompt], max_output_token=256, top_k=1, top_p=0.0, temperature=1.0)

	response2 = output[0][0].split("\n<extra_id_1>")[0]
	print(response2)
	```


	## Evaluation

	### MT-bench

	\| Total \| Writing \| Roleplay \| Extraction \| STEM \| Humanities \| Reasoning \| Math \| Coding \|
	\|:-------:\|:-------:\|:--------:\|:----------:\|:----:\|:----------:\|:-----------:\|:------:\|:--------:\|
	\| 7.59 \| 9.15 \| 8.90 \| 8.80 \| 8.60 \| 9.65 \| 6.25 \| 4.65 \| 4.70 \|

	### Academic Benchmarks

	\| MMLU<br>(5-shot) \| HellaSwag<br>(0-shot) \| ARC easy<br>(0-shot) \| WinoGrande<br>(0-shot) \| TruthfulQA MC2<br>(0-shot) \| TriviaQA<br>(5-shot) \|
	\|:----------------:\|:---------------------:\|:--------------------:\|:----------------------:\|:--------------------------:\|:--------------------:\|
	\| 68.04 \| 84.04 \| 83.67 \| 79.40 \| 58.16 \| 80.86 \|

	## Intended use
	- The NV-Llama2-70B-RLHF-Chat model is best for chat use cases including Question and Answering, Search, Summarization following instructions.
	- Ethical use: Technology can have a profound impact on people and the world, and NVIDIA is committed to enabling trust and transparency in AI development. NVIDIA encourages users to adopt principles of AI ethics and trustworthiness to guide your decisions by following the guidelines in the [cc-by-nc-4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode.en) license.

	## Limitations
	- The model was trained on the data that contains toxic language and societal biases originally crawled from the Internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts.
	- The Model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
	- We recommend deploying the model with [NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails) to mitigate these potential issues.