Text Generation
NeMo
English
nvidia
steerlm
llama2
zhilinw commited on
Commit
0de8691
1 Parent(s): 4d1b4ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -65,7 +65,7 @@ H100, A100 80GB, A100 40GB
65
 
66
  We demonstrate inference using NVIDIA NeMo Framework, which allows hassle-free model deployment based on [NVIDIA TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), a highly optimized inference solution focussing on high throughput and low latency.
67
 
68
- Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GPUs, and 300GB of free disk space.
69
 
70
  1. Please sign up to get **free and immediate** access to [NVIDIA NeMo Framework container](https://developer.nvidia.com/nemo-framework). If you don’t have an NVIDIA NGC account, you will be prompted to sign up for an account before proceeding.
71
  2. If you don’t have an NVIDIA NGC API key, sign into [NVIDIA NGC](https://ngc.nvidia.com/setup), selecting organization/team: ea-bignlp/ga-participants and click Generate API key. Save this key for the next step. Else, skip this step.
@@ -104,7 +104,7 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
104
  python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
105
  ```
106
 
107
- 9. Once the server is ready in 20-45 mins depending on your computer (i.e. when you see this messages below), you are ready to launch your client code
108
 
109
  ```
110
  Started HTTPService at 0.0.0.0:8000
 
65
 
66
  We demonstrate inference using NVIDIA NeMo Framework, which allows hassle-free model deployment based on [NVIDIA TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), a highly optimized inference solution focussing on high throughput and low latency.
67
 
68
+ Pre-requisite: You would need at least a machine with 4 40GB or 2 80GB NVIDIA GPUs, and 300GB of free disk space.
69
 
70
  1. Please sign up to get **free and immediate** access to [NVIDIA NeMo Framework container](https://developer.nvidia.com/nemo-framework). If you don’t have an NVIDIA NGC account, you will be prompted to sign up for an account before proceeding.
71
  2. If you don’t have an NVIDIA NGC API key, sign into [NVIDIA NGC](https://ngc.nvidia.com/setup), selecting organization/team: ea-bignlp/ga-participants and click Generate API key. Save this key for the next step. Else, skip this step.
 
104
  python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
105
  ```
106
 
107
+ 9. Once the server is ready (i.e. when you see this messages below), you are ready to launch your client code
108
 
109
  ```
110
  Started HTTPService at 0.0.0.0:8000