Update README.md
Browse files
README.md
CHANGED
@@ -193,12 +193,15 @@ outputs = model.generate(inputs, max_length=512, num_return_sequences=1)
|
|
193 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
194 |
```
|
195 |
|
196 |
-
### vLLM
|
197 |
-
|
198 |
-
1. install VLLM (https://github.com/vllm-project/vllm)
|
199 |
-
2. python -m vllm.entrypoints.api_server --model /path/to/model --tensor-parallel-size num_gpus
|
200 |
-
3. run inference (CURL example)
|
201 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
202 |
```bash
|
203 |
curl --request POST \
|
204 |
--url http://localhost:8000/generate \
|
@@ -206,7 +209,30 @@ curl --request POST \
|
|
206 |
--data '{"prompt": "<s>[INST] <<SYS>>\nYou are a question answering assistant. Answer the question as truthful and helpful as possible คุณคือผู้ช่วยตอบคำถาม จงตอบคำถามอย่างถูกต้องและมีประโยชน์ที่สุด\n<</SYS>>\n\nอยากลดความอ้วนต้องทำอย่างไร [/INST]","use_beam_search": false, "temperature": 0.1, "max_tokens": 512, "top_p": 0.75, "top_k": 40, "frequency_penalty": 0.3 "stop": "</s>"}'
|
207 |
```
|
208 |
|
209 |
-
### LlamaCPP
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
210 |
|
211 |
|
212 |
### Authors
|
|
|
193 |
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
194 |
```
|
195 |
|
196 |
+
### vLLM
|
|
|
|
|
|
|
|
|
197 |
|
198 |
+
1. Install VLLM (https://github.com/vllm-project/vllm)
|
199 |
+
|
200 |
+
2. Run server
|
201 |
+
```bash
|
202 |
+
python -m vllm.entrypoints.api_server --model /path/to/model --tensor-parallel-size num_gpus
|
203 |
+
```
|
204 |
+
3. Run inference (CURL example)
|
205 |
```bash
|
206 |
curl --request POST \
|
207 |
--url http://localhost:8000/generate \
|
|
|
209 |
--data '{"prompt": "<s>[INST] <<SYS>>\nYou are a question answering assistant. Answer the question as truthful and helpful as possible คุณคือผู้ช่วยตอบคำถาม จงตอบคำถามอย่างถูกต้องและมีประโยชน์ที่สุด\n<</SYS>>\n\nอยากลดความอ้วนต้องทำอย่างไร [/INST]","use_beam_search": false, "temperature": 0.1, "max_tokens": 512, "top_p": 0.75, "top_k": 40, "frequency_penalty": 0.3 "stop": "</s>"}'
|
210 |
```
|
211 |
|
212 |
+
### LlamaCPP (for GGUF)
|
213 |
+
|
214 |
+
1. Build and Install LlamaCPP (LLAMA_CUBLAS=1 is for GPU inference)
|
215 |
+
```bash
|
216 |
+
git clone https://github.com/ggerganov/llama.cpp.git \
|
217 |
+
&& cd llama.cpp \
|
218 |
+
&& make -j LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=all
|
219 |
+
```
|
220 |
+
|
221 |
+
2. Run server
|
222 |
+
```bash
|
223 |
+
server -m /path/to/ggml-model-f16.gguf -c 3072 -ngl 81 -ts 1,1 --host 0.0.0.0
|
224 |
+
```
|
225 |
+
|
226 |
+
3. Run inference (CURL example)
|
227 |
+
```bash
|
228 |
+
curl --location 'http://localhost:8000/completion' \
|
229 |
+
--header 'Content-Type: application/json' \
|
230 |
+
--data '{
|
231 |
+
"prompt":"<s>[INST] <<SYS>>\nYou are a question answering assistant. Answer the question as truthful and helpful as possible คุณคือผู้ช่วยตอบคำถาม จงตอบคำถามอย่างถูกต้องและมีประโยชน์ที่สุด friendly\n\n<<SYS>>\n\nอยากลดความอ้วนต้องทำอย่างไร [/INST]",
|
232 |
+
"max_tokens": 512,
|
233 |
+
"stop":"</s>"
|
234 |
+
}'
|
235 |
+
```
|
236 |
|
237 |
|
238 |
### Authors
|