TheBloke
/

bloomz-176B-GPTQ

Text Generation

text-generation-inference

Model card Files Files and versions Community

TheBloke commited on Jul 6, 2023

Commit

ed78723

•

1 Parent(s): 9ad39a2

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -841,7 +841,7 @@ print(pipe(prompt_template)[0]['generated_text'])
 This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa.  It will *not* work with ExLlama.
-It was created with group_size none (-1) to reduce VRAM usage, and with --act-order (desc_act) to increase inference speed.
 * `gptq_model-4bit-128g.safetensors`
   * Works with AutoGPTQ in CUDA or Triton modes.
@@ -856,7 +856,7 @@ It was created with group_size none (-1) to reduce VRAM usage, and with --act-or
 This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa.  It will *not* work with ExLlama.
-It was created with both group_size 128g and --act-order (desc_act) for increased inference quality.
 **Note** Using group_size + desc_act together can significantly lower performance in AutoGPTQ CUDA. You might want to try AutoGPTQ Triton mode instead (Linux only.)

 This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa.  It will *not* work with ExLlama.
+It was created with group_size none (-1) to reduce VRAM usage, and with --act-order (desc_act) to improve accuracy of responses.
 * `gptq_model-4bit-128g.safetensors`
   * Works with AutoGPTQ in CUDA or Triton modes.
 This will work with AutoGPTQ. It is untested with GPTQ-for-LLaMa.  It will *not* work with ExLlama.
+It was created with both group_size 128g and --act-order (desc_act) for even higher inference accuracy, at the cost of increased VRAM usage. Because we already need 2 x 80GB or 3 x 48GB GPUs, I don't expect the increased VRAM usage to change the GPU requirements.
 **Note** Using group_size + desc_act together can significantly lower performance in AutoGPTQ CUDA. You might want to try AutoGPTQ Triton mode instead (Linux only.)