dreamerdeo commited on
Commit
6329c42
1 Parent(s): 13e9069

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -24
README.md CHANGED
@@ -22,6 +22,7 @@ tags:
22
  - sft
23
  - chat
24
  - instruction
 
25
  license: apache-2.0
26
  base_model: sail/Sailor-4B
27
  ---
@@ -51,7 +52,7 @@ The pre-training corpus heavily leverages the publicly available corpus, includi
51
  [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B),
52
  [SkyPile](https://huggingface.co/datasets/Skywork/SkyPile-150B),
53
  [CC100](https://huggingface.co/datasets/cc100) and [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400).
54
- The instruction tuning corpus are all public available including
55
  [aya_collection](https://huggingface.co/datasets/CohereForAI/aya_collection),
56
  [aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset),
57
  [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
@@ -61,37 +62,80 @@ Through systematic experiments to determine the weights of different languages,
61
  The approach boosts their performance on SEA languages while maintaining proficiency in English and Chinese without significant compromise.
62
  Finally, we continually pre-train the Qwen1.5-0.5B model with 400 Billion tokens, and other models with 200 Billion tokens to obtain the Sailor models.
63
 
64
- ## Requirements
65
- The code of Sailor has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`.
66
-
67
- ## Quickstart
68
-
69
- Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents.
70
-
71
- ```python
72
- from transformers import AutoModelForCausalLM, AutoTokenizer
73
- device = "cuda" # the device to load the model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
- model = AutoModelForCausalLM.from_pretrained("sail/Sailor-7B", device_map="auto")
76
- tokenizer = AutoTokenizer.from_pretrained("sail/Sailor-7B")
77
 
78
- input_message = "Model bahasa adalah model probabilistik"
79
- ### The given Indonesian input translates to 'A language model is a probabilistic model of.'
80
 
81
- model_inputs = tokenizer([input_message], return_tensors="pt").to(device)
 
 
82
 
83
- generated_ids = model.generate(
84
- model_inputs.input_ids,
85
- max_new_tokens=64
 
 
 
 
 
 
 
 
 
86
  )
87
 
88
- generated_ids = [
89
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
90
- ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
93
- print(response)
94
  ```
 
 
 
95
 
96
  # License
97
 
 
22
  - sft
23
  - chat
24
  - instruction
25
+ - gguf
26
  license: apache-2.0
27
  base_model: sail/Sailor-4B
28
  ---
 
52
  [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B),
53
  [SkyPile](https://huggingface.co/datasets/Skywork/SkyPile-150B),
54
  [CC100](https://huggingface.co/datasets/cc100) and [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400).
55
+ The instruction tuning corpus are all publicly available including
56
  [aya_collection](https://huggingface.co/datasets/CohereForAI/aya_collection),
57
  [aya_dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset),
58
  [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca).
 
62
  The approach boosts their performance on SEA languages while maintaining proficiency in English and Chinese without significant compromise.
63
  Finally, we continually pre-train the Qwen1.5-0.5B model with 400 Billion tokens, and other models with 200 Billion tokens to obtain the Sailor models.
64
 
65
+ ### GGUF model list
66
+ | Name | Quant method | Bits | Size | Use case |
67
+ | ------------------------------------------------------------ | ------------ | ---- | ------- | ------------------------------------------------------------ |
68
+ | [ggml-model-Q2_K.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q2_K.gguf) | Q2_K | 2 | 1.62 GB | smallest, significant quality loss ❗️ not recommended for most purposes |
69
+ | [ggml-model-Q3_K_L.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q3_K_L.gguf) | Q3_K_L | 3 | 2.17 GB | small, substantial quality loss |
70
+ | [ggml-model-Q3_K_M.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q3_K_M.gguf) | Q3_K_M | 3 | 2.03 GB | very small, balanced quality |
71
+ | [ggml-model-Q3_K_S.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q3_K_S.gguf) | Q3_K_S | 3 | 1.86 GB | very small, high quality loss |
72
+ | [ggml-model-Q4_K_M.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q4_K_M.gguf) | Q4_K_M | 4 | 2.46 GB | medium, balanced quality |
73
+ | [ggml-model-Q4_K_S.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q4_K_S.gguf) | Q4_K_S | 4 | 2.34 GB | small, greater quality loss |
74
+ | [ggml-model-Q5_K_M.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q5_K_M.gguf) | Q5_K_M | 5 | 2.84 GB | large, balanced quality |
75
+ | [ggml-model-Q5_K_S.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q5_K_S.gguf) | Q5_K_S | 5 | 2.78 GB | medium, very low quality loss |
76
+ | [ggml-model-Q6_K.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q6_K.gguf) | Q6_K | 6 | 3.25 GB | very large, extremely low quality loss |
77
+ | [ggml-model-Q8_0.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-Q8_0.gguf) | Q8_0 | 8 | 4.2 GB | very large, extremely low quality loss |
78
+ | [ggml-model-f16.gguf](https://huggingface.co/sail/Sailor-4B-Chat-gguf/blob/main/ggml-model-f16.gguf) | f16 | 16 | 7.91 GB | original size, no quality loss |
79
+
80
+ ### How to run with `llama.cpp`
81
+
82
+ ```shell
83
+ # install llama.cpp
84
+ git clone https://github.com/ggerganov/llama.cpp.git
85
+ cd llama.cpp
86
+ make
87
+ pip install -r requirements.txt
88
+
89
+ # generate with llama.cpp
90
+ ./main -ngl 40 -m ggml-model-Q4_K_M.gguf -p "<|im_start|>question\nCara memanggang ikan?\n<|im_start|>answer\n" --temp 0.7 --repeat_penalty 1.1 -n 400 -e
91
+ ```
92
 
93
+ > Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
 
94
 
95
+ ### How to run with `llama-cpp-python`
 
96
 
97
+ ```shell
98
+ pip install llama-cpp-python
99
+ ```
100
 
101
+ ```python
102
+ import llama_cpp
103
+ import llama_cpp.llama_tokenizer
104
+
105
+ # load model
106
+ llama = llama_cpp.Llama.from_pretrained(
107
+ repo_id="sail/Sailor-4B-Chat-gguf",
108
+ filename="ggml-model-Q4_K_M.gguf",
109
+ tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained("sail/Sailor-4B-Chat"),
110
+ n_gpu_layers=40,
111
+ n_threads=8,
112
+ verbose=False,
113
  )
114
 
115
+ system_role= 'system'
116
+ user_role = 'question'
117
+ assistant_role = "answer"
118
+
119
+ system_prompt= \
120
+ 'You are an AI assistant named Sailor created by Sea AI Lab. \
121
+ Your answer should be friendly, unbiased, faithful, informative and detailed.'
122
+ system_prompt = f"<|im_start|>{system_role}\n{system_prompt}<|im_end|>"
123
+
124
+ # inference example
125
+ output = llama(
126
+ system_prompt + '\n' + f"<|im_start|>{user_role}\nCara memanggang ikan?\n<|im_start|>{assistant_role}\n",
127
+ max_tokens=256,
128
+ temperature=0.7,
129
+ top_p=0.75,
130
+ top_k=60,
131
+ stop=["<|im_end|>", "<|endoftext|>"]
132
+ )
133
 
134
+ print(output['choices'][0]['text'])
 
135
  ```
136
+ ### How to build demo
137
+
138
+ Install `llama-cpp-python` and `gradio`, then run [script](https://github.com/sail-sg/sailor-llm/blob/main/demo/llamacpp_demo.py).
139
 
140
  # License
141