TheBloke commited on
Commit
eb57837
1 Parent(s): 21da534

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -1
README.md CHANGED
@@ -49,7 +49,9 @@ AWQ is an efficient, accurate and blazing-fast low-bit weight quantization metho
49
 
50
  These are experimental first AWQs for the brand-new model format, Mistral.
51
 
52
- They will not work from vLLM or TGI. They can only be used from AutoAWQ, and they require installing both AutoAWQ and Transformers from Github. More details are below.
 
 
53
 
54
  <!-- description end -->
55
  <!-- repositories-available start -->
@@ -84,6 +86,43 @@ Models are released as sharded safetensors files.
84
 
85
  <!-- README_AWQ.md-provided-files end -->
86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
  <!-- README_AWQ.md-use-from-python start -->
89
  ## How to use this AWQ model from Python code
 
49
 
50
  These are experimental first AWQs for the brand-new model format, Mistral.
51
 
52
+ As of September 29th 2023, they are supported by AutoAWQ, and vLLM (version 0.2).
53
+
54
+ To use from AutoAWQ requires installing both AutoAWQ and Transformers from Github. More details are below.
55
 
56
  <!-- description end -->
57
  <!-- repositories-available start -->
 
86
 
87
  <!-- README_AWQ.md-provided-files end -->
88
 
89
+ <!-- README_AWQ.md-use-from-vllm start -->
90
+ ## Serving this model from vLLM
91
+
92
+ Make sure you are using vLLM version 0.2.
93
+
94
+ Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
95
+
96
+ - When using vLLM as a server, pass the `--quantization awq` parameter, for example:
97
+
98
+ ```shell
99
+ python3 python -m vllm.entrypoints.api_server --model TheBloke/Mistral-7B-v0.1-AWQ --quantization awq --dtype float16
100
+ ```
101
+
102
+ When using vLLM from Python code, pass the `quantization=awq` parameter, for example:
103
+
104
+ ```python
105
+ from vllm import LLM, SamplingParams
106
+
107
+ prompts = [
108
+ "Hello, my name is",
109
+ "The president of the United States is",
110
+ "The capital of France is",
111
+ "The future of AI is",
112
+ ]
113
+ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
114
+
115
+ llm = LLM(model="TheBloke/Mistral-7B-v0.1-AWQ", quantization="awq", dtype="float16")
116
+
117
+ outputs = llm.generate(prompts, sampling_params)
118
+
119
+ # Print the outputs.
120
+ for output in outputs:
121
+ prompt = output.prompt
122
+ generated_text = output.outputs[0].text
123
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
124
+ ```
125
+ <!-- README_AWQ.md-use-from-vllm start -->
126
 
127
  <!-- README_AWQ.md-use-from-python start -->
128
  ## How to use this AWQ model from Python code