Text Generation
NeMo
English
nvidia
steerlm
llama2
zhilinw commited on
Commit
9c8c94a
1 Parent(s): 6c49012

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -8
README.md CHANGED
@@ -80,12 +80,12 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
80
  docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
81
  ```
82
 
83
- 6. Download the checkpoint
84
  ```
85
  git lfs install
86
  git clone https://huggingface.co/nvidia/Llama2-70B-SteerLM-Chat
87
  ```
88
- 7. Convert checkpoint into nemo format
89
  ```
90
  cd Llama2-70B-SteerLM-Chat/Llama2-70B-SteerLM-Chat
91
  tar -cvf Llama2-70B-SteerLM-Chat.nemo .
@@ -94,17 +94,17 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
94
  rm -r Llama2-70B-SteerLM-Chat
95
  ```
96
 
97
- 8. Run Docker container
98
  ```
99
  docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/Llama2-70B-SteerLM-Chat.nemo:/opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
100
  ```
101
- 9. Within the container, start the server in the background. This step does both conversion of the nemo checkpoint to TRT-LLM and then deployment using TRTLLM. For an explanation of each argument and advanced usage, please refer to [NeMo FW Deployment Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html)
102
 
103
  ```
104
  python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
105
  ```
106
 
107
- 10. Once the server is ready in 20-45 mins depending on your computer (i.e. when you see this messages below), you are ready to launch your client code
108
 
109
  ```
110
  Started HTTPService at 0.0.0.0:8000
@@ -134,9 +134,10 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
134
  output = output[0][0].split("\n<extra_id_1>")[0]
135
  print(output)
136
  ```
137
- 11. Prompt formatting for single and multi turn conversations
 
138
 
139
- Single Turn
140
  ```
141
  <extra_id_0>System
142
  A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
@@ -146,7 +147,7 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
146
  <extra_id_2>quality:4,toxicity:0,humor:0,creativity:0,helpfulness:4,correctness:4,coherence:4,complexity:4,verbosity:4
147
  ```
148
 
149
- Multi-Turn
150
  ```
151
  <extra_id_0>System
152
  A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
 
80
  docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
81
  ```
82
 
83
+ 5. Download the checkpoint
84
  ```
85
  git lfs install
86
  git clone https://huggingface.co/nvidia/Llama2-70B-SteerLM-Chat
87
  ```
88
+ 6. Convert checkpoint into nemo format
89
  ```
90
  cd Llama2-70B-SteerLM-Chat/Llama2-70B-SteerLM-Chat
91
  tar -cvf Llama2-70B-SteerLM-Chat.nemo .
 
94
  rm -r Llama2-70B-SteerLM-Chat
95
  ```
96
 
97
+ 7. Run Docker container
98
  ```
99
  docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/Llama2-70B-SteerLM-Chat.nemo:/opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
100
  ```
101
+ 8. Within the container, start the server in the background. This step does both conversion of the nemo checkpoint to TRT-LLM and then deployment using TRTLLM. For an explanation of each argument and advanced usage, please refer to [NeMo FW Deployment Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html)
102
 
103
  ```
104
  python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
105
  ```
106
 
107
+ 9. Once the server is ready in 20-45 mins depending on your computer (i.e. when you see this messages below), you are ready to launch your client code
108
 
109
  ```
110
  Started HTTPService at 0.0.0.0:8000
 
134
  output = output[0][0].split("\n<extra_id_1>")[0]
135
  print(output)
136
  ```
137
+ 10. If you would support multi-turn conversations or adjust attribute values at inference time, here is some guidance:
138
+
139
 
140
+ Default template for Single Turn
141
  ```
142
  <extra_id_0>System
143
  A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
 
147
  <extra_id_2>quality:4,toxicity:0,humor:0,creativity:0,helpfulness:4,correctness:4,coherence:4,complexity:4,verbosity:4
148
  ```
149
 
150
+ Default template for Multi-Turn
151
  ```
152
  <extra_id_0>System
153
  A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.