openthaigpt
/

openthaigpt-1.0.0-7b-chat

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

kobkrit commited on Apr 7

Commit

0cc44e2

•

1 Parent(s): 2bf0d3a

Update README.md

Files changed (1) hide show

README.md +32 -6

README.md CHANGED Viewed

@@ -193,12 +193,15 @@ outputs = model.generate(inputs, max_length=512, num_return_sequences=1)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-### vLLM Engine for float16 model
-1. install VLLM (https://github.com/vllm-project/vllm)
-2. python -m vllm.entrypoints.api_server --model /path/to/model --tensor-parallel-size num_gpus
-3. run inference (CURL example)
 ```bash
 curl --request POST \
     --url http://localhost:8000/generate \
@@ -206,7 +209,30 @@ curl --request POST \
     --data '{"prompt": "<s>[INST] <<SYS>>\nYou are a question answering assistant. Answer the question as truthful and helpful as possible คุณคือผู้ช่วยตอบคำถาม จงตอบคำถามอย่างถูกต้องและมีประโยชน์ที่สุด\n<</SYS>>\n\nอยากลดความอ้วนต้องทำอย่างไร [/INST]","use_beam_search": false, "temperature": 0.1, "max_tokens": 512, "top_p": 0.75, "top_k": 40, "frequency_penalty": 0.3 "stop": "</s>"}'
 ```
-### LlamaCPP Engine for 4 bit model
 ### Authors

 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+### vLLM
+1. Install VLLM (https://github.com/vllm-project/vllm)
+2. Run server
+```bash
+python -m vllm.entrypoints.api_server --model /path/to/model --tensor-parallel-size num_gpus
+```
+3. Run inference (CURL example)
 ```bash
 curl --request POST \
     --url http://localhost:8000/generate \
     --data '{"prompt": "<s>[INST] <<SYS>>\nYou are a question answering assistant. Answer the question as truthful and helpful as possible คุณคือผู้ช่วยตอบคำถาม จงตอบคำถามอย่างถูกต้องและมีประโยชน์ที่สุด\n<</SYS>>\n\nอยากลดความอ้วนต้องทำอย่างไร [/INST]","use_beam_search": false, "temperature": 0.1, "max_tokens": 512, "top_p": 0.75, "top_k": 40, "frequency_penalty": 0.3 "stop": "</s>"}'
 ```
+### LlamaCPP (for GGUF)
+1. Build and Install LlamaCPP (LLAMA_CUBLAS=1 is for GPU inference)
+```bash
+git clone https://github.com/ggerganov/llama.cpp.git \
+  && cd llama.cpp \
+  && make -j LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=all
+```
+2. Run server
+```bash
+server -m /path/to/ggml-model-f16.gguf -c 3072 -ngl 81 -ts 1,1 --host 0.0.0.0
+```
+3. Run inference (CURL example)
+```bash
+curl --location 'http://localhost:8000/completion' \
+--header 'Content-Type: application/json' \
+--data '{
+    "prompt":"<s>[INST] <<SYS>>\nYou are a question answering assistant. Answer the question as truthful and helpful as possible คุณคือผู้ช่วยตอบคำถาม จงตอบคำถามอย่างถูกต้องและมีประโยชน์ที่สุด friendly\n\n<<SYS>>\n\nอยากลดความอ้วนต้องทำอย่างไร [/INST]",
+    "max_tokens": 512,
+    "stop":"</s>"
+}'
+```
 ### Authors