jayr014 commited on
Commit
e96382c
1 Parent(s): f115b64

changing order of int8 and bf16

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -127,14 +127,14 @@ def main() -> None:
127
  modified_input_text = f"<human>: {input_text}\n<bot>:"
128
  ```
129
 
130
- Running command for int8 (sub optimal performance, but fast inference time):
131
- ```
132
- python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
133
- ```
134
  Running command for bf16
135
  ```
136
  python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
137
  ```
 
 
 
 
138
  **DISCLAIMER:** When using int8, the results will be subpar compared to bf16 as the model is being [quantized](https://huggingface.co/blog/hf-bitsandbytes-integration#introduction-to-model-quantization).
139
 
140
  ### Suggested Inference Parameters
 
127
  modified_input_text = f"<human>: {input_text}\n<bot>:"
128
  ```
129
 
 
 
 
 
130
  Running command for bf16
131
  ```
132
  python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
133
  ```
134
+ Running command for int8 (sub optimal performance, but fast inference time):
135
+ ```
136
+ python -m inference_server.cli --model_name sambanovasystems/BLOOMChat-176B-v1 --model_class AutoModelForCausalLM --dtype int8 --deployment_framework hf_accelerate --generate_kwargs '{"do_sample": false, "temperature": 0.8, "repetition_penalty": 1.2, "top_p": 0.9, "max_new_tokens": 512}'
137
+ ```
138
  **DISCLAIMER:** When using int8, the results will be subpar compared to bf16 as the model is being [quantized](https://huggingface.co/blog/hf-bitsandbytes-integration#introduction-to-model-quantization).
139
 
140
  ### Suggested Inference Parameters