Text Generation
NeMo
English
nvidia
steerlm
llama2
zhilinw commited on
Commit
cf5171e
1 Parent(s): e6d27ee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -4
README.md CHANGED
@@ -83,17 +83,27 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
83
  git lfs install
84
  git clone https://huggingface.co/nvidia/Llama2-70B-SteerLM-Chat
85
  ```
86
- 7. Run Docker container
 
 
 
 
 
 
 
 
 
 
87
  ```
88
  docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/Llama2-70B-SteerLM-Chat.nemo:/opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
89
  ```
90
- 8. Within the container, start the server in the background. This step does both conversion of the nemo checkpoint to TRT-LLM and then deployment using TRTLLM. For an explanation of each argument and advanced usage, please refer to [NeMo FW Deployment Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html)
91
 
92
  ```
93
  python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
94
  ```
95
 
96
- 9. Once the server is ready in 20-45 mins depending on your computer (i.e. when you see this messages below), you are ready to launch your client code
97
 
98
  ```
99
  Started HTTPService at 0.0.0.0:8000
@@ -122,7 +132,7 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
122
  output = output[0][0].split("\n<extra_id_1>")[0]
123
  print(output)
124
  ```
125
- 10. Prompt formatting for single and multi turn conversations
126
 
127
  Single Turn
128
  ```
 
83
  git lfs install
84
  git clone https://huggingface.co/nvidia/Llama2-70B-SteerLM-Chat
85
  ```
86
+ 7. Convert checkpoint into nemo format
87
+ ```
88
+ cd Llama2-70B-SteerLM-Chat/Llama2-70B-SteerLM-Chat
89
+ tar -cvf Llama2-70B-SteerLM-Chat.nemo .
90
+ mv Llama2-70B-SteerLM-Chat.nemo ../
91
+ cd ..
92
+ # the step saves us around 130GB of space, but is optional if you have plenty of space
93
+ rm -r Llama2-70B-SteerLM-Chat
94
+ ```
95
+
96
+ 9. Run Docker container
97
  ```
98
  docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/Llama2-70B-SteerLM-Chat.nemo:/opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
99
  ```
100
+ 10. Within the container, start the server in the background. This step does both conversion of the nemo checkpoint to TRT-LLM and then deployment using TRTLLM. For an explanation of each argument and advanced usage, please refer to [NeMo FW Deployment Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html)
101
 
102
  ```
103
  python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
104
  ```
105
 
106
+ 11. Once the server is ready in 20-45 mins depending on your computer (i.e. when you see this messages below), you are ready to launch your client code
107
 
108
  ```
109
  Started HTTPService at 0.0.0.0:8000
 
132
  output = output[0][0].split("\n<extra_id_1>")[0]
133
  print(output)
134
  ```
135
+ 12. Prompt formatting for single and multi turn conversations
136
 
137
  Single Turn
138
  ```