chienweichang
/

Breeze-7B-Instruct-64k-v0_1-AWQ

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

chienweichang commited on Jan 30

Commit

ea22264

•

1 Parent(s): e00e655

set max-model-len for colab T4

Files changed (1) hide show

README.md +6 -2

README.md CHANGED Viewed

@@ -47,7 +47,11 @@ Documentation on installing and using vLLM [can be found here](https://vllm.read
 For example:
 ```shell
-python3 -m vllm.entrypoints.api_server --model chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ --quantization awq --dtype auto
 ```
 - When using vLLM from Python code, again set `quantization=awq`.
@@ -65,7 +69,7 @@ prompt_template='''[INST] {prompt} [/INST]
 '''
 prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]
 sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-llm = LLM(model="chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ", quantization="awq", dtype="half", max_model_len=8196)
 outputs = llm.generate(prompts, sampling_params)
 # Print the outputs.
 for output in outputs:

 For example:
 ```shell
+python3 -m vllm.entrypoints.api_server \
+    --model chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ \
+    --quantization awq \
+    --max-model-len 2048 \
+    --dtype auto
 ```
 - When using vLLM from Python code, again set `quantization=awq`.
 '''
 prompts = [prompt_template.format(prompt=prompt) for prompt in prompts]
 sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
+llm = LLM(model="chienweichang/Breeze-7B-Instruct-64k-v0_1-AWQ", quantization="awq", dtype="half", max_model_len=2048)
 outputs = llm.generate(prompts, sampling_params)
 # Print the outputs.
 for output in outputs: