astronomer
/

Llama-3-8B-Instruct-GPTQ-4-Bit

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions Community

davidxmle commited on Apr 20, 2024

Commit

5202788

·

verified ·

1 Parent(s): ed3a938

Update README.md

Files changed (1) hide show

README.md +9 -0

README.md CHANGED Viewed

@@ -53,6 +53,15 @@ datasets:
 - Built with Meta Llama 3
 - Quantized by [Astronomer](https://astronomer.io)
 <!-- description start -->
 ## Description

 - Built with Meta Llama 3
 - Quantized by [Astronomer](https://astronomer.io)
+# Important Note About Serving with vLLM & oobabooga/text-generation-webui
+- For loading this model onto vLLM, make sure all requests have `"stop_token_ids":[128001, 128009]` to temporarily address the non-stop generation issue.
+   - vLLM does not yet respect `generation_config.json`.
+   - vLLM team is working on a a fix for this https://github.com/vllm-project/vllm/issues/4180
+- For oobabooga/text-generation-webui
+  - Load the model via AutoGPTQ, with `no_inject_fused_attention` enabled. This is a bug with AutoGPTQ library.
+  - Under `Parameters` -> `Generation` -> `Skip special tokens`: turn this off (deselect)
+  - Under `Parameters` -> `Generation` -> `Custom stopping strings`: add `"<|end_of_text|>","<|eot_id|>"` to the field
 <!-- description start -->
 ## Description