Text Generation
NeMo
English
nvidia
steerlm
llama2
zhilinw commited on
Commit
7e43680
1 Parent(s): 564fa2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -92,17 +92,17 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
92
  rm -r Llama2-70B-SteerLM-Chat
93
  ```
94
 
95
- 9. Run Docker container
96
  ```
97
  docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/Llama2-70B-SteerLM-Chat.nemo:/opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
98
  ```
99
- 10. Within the container, start the server in the background. This step does both conversion of the nemo checkpoint to TRT-LLM and then deployment using TRTLLM. For an explanation of each argument and advanced usage, please refer to [NeMo FW Deployment Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html)
100
 
101
  ```
102
  python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
103
  ```
104
 
105
- 11. Once the server is ready in 20-45 mins depending on your computer (i.e. when you see this messages below), you are ready to launch your client code
106
 
107
  ```
108
  Started HTTPService at 0.0.0.0:8000
@@ -131,7 +131,7 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
131
  output = output[0][0].split("\n<extra_id_1>")[0]
132
  print(output)
133
  ```
134
- 12. Prompt formatting for single and multi turn conversations
135
 
136
  Single Turn
137
  ```
 
92
  rm -r Llama2-70B-SteerLM-Chat
93
  ```
94
 
95
+ 8. Run Docker container
96
  ```
97
  docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/Llama2-70B-SteerLM-Chat.nemo:/opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
98
  ```
99
+ 9. Within the container, start the server in the background. This step does both conversion of the nemo checkpoint to TRT-LLM and then deployment using TRTLLM. For an explanation of each argument and advanced usage, please refer to [NeMo FW Deployment Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html)
100
 
101
  ```
102
  python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
103
  ```
104
 
105
+ 10. Once the server is ready in 20-45 mins depending on your computer (i.e. when you see this messages below), you are ready to launch your client code
106
 
107
  ```
108
  Started HTTPService at 0.0.0.0:8000
 
131
  output = output[0][0].split("\n<extra_id_1>")[0]
132
  print(output)
133
  ```
134
+ 11. Prompt formatting for single and multi turn conversations
135
 
136
  Single Turn
137
  ```