sail
/

Sailor-4B-Chat-gguf

@@ -22,6 +22,7 @@ tags:
 - sft
 - chat
 - instruction
 license: apache-2.0
 base_model: sail/Sailor-4B
 ---
@@ -51,7 +52,7 @@ The pre-training corpus heavily leverages the publicly available corpus, includi
 [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B),
 [SkyPile](https://huggingface.co/datasets/Skywork/SkyPile-150B),
 [CC100](https://huggingface.co/datasets/cc100) and [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400).
-The instruction tuning corpus are all public available including
 [aya_collection](https://huggingface.co/datasets/CohereForAI/aya_collection),
 [aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset),
 [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
@@ -61,37 +62,80 @@ Through systematic experiments to determine the weights of different languages,
 The approach boosts their performance on SEA languages while maintaining proficiency in English and Chinese without significant compromise.
 Finally, we continually pre-train the Qwen1.5-0.5B model with 400 Billion tokens, and other models with 200 Billion tokens to obtain the Sailor models.
-## Requirements
-The code of Sailor has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`.
-## Quickstart
-Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents.
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-device = "cuda" # the device to load the model
-model = AutoModelForCausalLM.from_pretrained("sail/Sailor-7B", device_map="auto")
-tokenizer = AutoTokenizer.from_pretrained("sail/Sailor-7B")
-input_message = "Model bahasa adalah model probabilistik"
-### The given Indonesian input translates to 'A language model is a probabilistic model of.'
-model_inputs = tokenizer([input_message], return_tensors="pt").to(device)
-generated_ids = model.generate(
-    model_inputs.input_ids,
-    max_new_tokens=64
 )
-generated_ids = [
-    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
-]
-response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
-print(response)
 ```
 # License

 - sft
 - chat
 - instruction
+- gguf
 license: apache-2.0
 base_model: sail/Sailor-4B
 ---
 [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B),
 [SkyPile](https://huggingface.co/datasets/Skywork/SkyPile-150B),
 [CC100](https://huggingface.co/datasets/cc100) and [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400).
+The instruction tuning corpus are all publicly available including
 [aya_collection](https://huggingface.co/datasets/CohereForAI/aya_collection),
 [aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset),
 [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
 The approach boosts their performance on SEA languages while maintaining proficiency in English and Chinese without significant compromise.
 Finally, we continually pre-train the Qwen1.5-0.5B model with 400 Billion tokens, and other models with 200 Billion tokens to obtain the Sailor models.
+### GGUF model list
+| Name                                                         | Quant method | Bits | Size    | Use case                                                     |
+| ------------------------------------------------------------ | ------------ | ---- | ------- | ------------------------------------------------------------ |
+| [ggml-model-Q2_K.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q2_K.gguf) | Q2_K         | 2    | 1.62 GB | smallest, significant quality loss ❗️ not recommended for most purposes |
+| [ggml-model-Q3_K_L.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q3_K_L.gguf) | Q3_K_L       | 3    | 2.17 GB | small, substantial quality loss                              |
+| [ggml-model-Q3_K_M.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q3_K_M.gguf) | Q3_K_M       | 3    | 2.03 GB | very small, balanced quality                                 |
+| [ggml-model-Q3_K_S.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q3_K_S.gguf) | Q3_K_S       | 3    | 1.86 GB | very small, high quality loss                                |
+| [ggml-model-Q4_K_M.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q4_K_M.gguf) | Q4_K_M       | 4    | 2.46 GB | medium, balanced quality                                     |
+| [ggml-model-Q4_K_S.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q4_K_S.gguf) | Q4_K_S       | 4    | 2.34 GB | small, greater quality loss                                  |
+| [ggml-model-Q5_K_M.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q5_K_M.gguf) | Q5_K_M       | 5    | 2.84 GB | large, balanced quality                                      |
+| [ggml-model-Q5_K_S.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q5_K_S.gguf) | Q5_K_S       | 5    | 2.78 GB | medium, very low quality loss                                |
+| [ggml-model-Q6_K.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q6_K.gguf) | Q6_K         | 6    | 3.25 GB | very large, extremely low quality loss                       |
+| [ggml-model-Q8_0.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q8_0.gguf) | Q8_0         | 8    | 4.2 GB  | very large, extremely low quality loss                       |
+| [ggml-model-f16.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-f16.gguf) | f16          | 16   | 7.91 GB | original size, no quality loss                               |
+### How to run with `llama.cpp`
+```shell
+# install llama.cpp
+git clone https://github.com/ggerganov/llama.cpp.git
+cd llama.cpp
+make
+pip install -r requirements.txt
+# generate with llama.cpp
+./main -ngl 40 -m ggml-model-Q4_K_M.gguf -p "<|im_start|>question\nCara memanggang ikan?\n<|im_start|>answer\n" --temp 0.7 --repeat_penalty 1.1 -n 400 -e
+```
+> Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
+### How to run with `llama-cpp-python`
+```shell
+pip install llama-cpp-python
+```
+```python
+import llama_cpp
+import llama_cpp.llama_tokenizer
+# load model
+llama = llama_cpp.Llama.from_pretrained(
+    repo_id="sail/Sailor-4B-Chat-gguf",
+    filename="ggml-model-Q4_K_M.gguf",
+    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained("sail/Sailor-4B-Chat"),
+    n_gpu_layers=40,
+    n_threads=8,
+    verbose=False,
 )
+system_role= 'system'
+user_role = 'question'
+assistant_role = "answer"
+system_prompt= \
+'You are an AI assistant named Sailor created by Sea AI Lab. \
+Your answer should be friendly, unbiased, faithful, informative and detailed.'
+system_prompt = f"<|im_start|>{system_role}\n{system_prompt}<|im_end|>"
+# inference example
+output = llama(
+  system_prompt + '\n' + f"<|im_start|>{user_role}\nCara memanggang ikan?\n<|im_start|>{assistant_role}\n",
+  max_tokens=256,
+  temperature=0.7,
+  top_p=0.75,
+  top_k=60,
+  stop=["<|im_end|>", "<|endoftext|>"]
+)
+print(output['choices'][0]['text'])
 ```
+### How to build demo
+Install `llama-cpp-python` and `gradio`, then run [script](https://github.com/sail-sg/sailor-llm/blob/main/demo/llamacpp_demo.py).
 # License