dreamerdeo commited on
Commit
d490b4d
1 Parent(s): 287691b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ - id
6
+ - th
7
+ - vi
8
+ - ms
9
+ - lo
10
+ - my
11
+ - jv
12
+ - km
13
+ - su
14
+ - tl
15
+ tags:
16
+ - multilingual
17
+ - sea
18
+ - sailor
19
+ - sft
20
+ - chat
21
+ - instruction
22
+ widget:
23
+ - text: 如何制作烤鱼?
24
+ example_title: Chinese
25
+ - text: How to bake fish?
26
+ example_title: English
27
+ - text: Bagaimana cara memanggang ikan?
28
+ example_title: Malay
29
+ - text: วิธีย่างปลา?
30
+ example_title: Thai
31
+ - text: Bagaimana membuat bakaran ikan?
32
+ example_title: Indonesian
33
+ - text: Làm thế nào để nướng cá?
34
+ example_title: Vietnamese
35
+ license: apache-2.0
36
+ base_model:
37
+ - Qwen/Qwen2.5-7B
38
+ ---
39
+
40
+ <div align="center">
41
+ <img src="sailor2_banner.jpg" width="700"/>
42
+ </div>
43
+
44
+ > The logo was generated by MidJourney
45
+
46
+ Sailor2 is a community-driven initiative that brings cutting-edge multilingual language models to South-East Asia (SEA).
47
+ Our research highlights a strong demand for models in the **8B and 20B parameter** range for production use, alongside **1B models** for specialized applications,
48
+ such as speculative decoding and research purposes.
49
+ These models, released under the **Apache 2.0 license**, provide enhanced accessibility to advanced language technologies across the region.
50
+
51
+
52
+ Sailor2 builds upon the foundation of the awesome multilingual model [Qwen 2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e) and
53
+ is continuously pre-trained on **500B tokens** to support **15 languages** better with a unified model.
54
+ These languages include English, Chinese, Burmese, Cebuano, Ilocano, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tagalog, Thai, Vietnamese, and Waray.
55
+ By addressing the growing demand for diverse, robust, and accessible language models,
56
+ Sailor2 seeks to serve the underserved in SEA areas with open, inclusive, and accessible multilingual LLMs.
57
+
58
+ Refer to [Sailor2 Website](https://sailorllm.github.io/) for more training details.
59
+
60
+ ## Model Summary
61
+ - **Model Collections:** [Base Model & Chat Model](https://huggingface.co/collections/sail/sailor2-language-models-674d7c9e6b4dbbd9a869906b)
62
+ - **Project Website:** [sailorllm.github.io](https://sailorllm.github.io/)
63
+ - **Codebase:** [github.com/sail-sg/sailor2](https://github.com/sail-sg/sailor2)
64
+ - **Technical Report:** Coming Soon
65
+
66
+
67
+ ## Training details
68
+
69
+
70
+ ## Requirements
71
+ The code of Sailor2 has been in the latest Hugging face transformers and we advise you to install `transformers==4.46.3`.
72
+
73
+ ### Quickstart
74
+
75
+ Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents.
76
+
77
+ ```python
78
+ from transformers import AutoModelForCausalLM, AutoTokenizer
79
+ device = "cuda" # the device to load the model
80
+
81
+ model = AutoModelForCausalLM.from_pretrained("sail/Sailor2-8B", device_map="auto")
82
+ tokenizer = AutoTokenizer.from_pretrained("sail/Sailor2-8B")
83
+
84
+ input_message = "Model bahasa adalah model probabilistik"
85
+ ### The given Indonesian input translates to 'A language model is a probabilistic model of.'
86
+
87
+ model_inputs = tokenizer([input_message], return_tensors="pt").to(device)
88
+
89
+ generated_ids = model.generate(
90
+ model_inputs.input_ids,
91
+ max_new_tokens=64
92
+ )
93
+
94
+ generated_ids = [
95
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
96
+ ]
97
+
98
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
99
+ print(response)
100
+ ```
101
+
102
+ # License
103
+
104
+ Sailor2 is distributed under the terms of the Apache License 2.0.
105
+ No restrict on the research and the commercial use.
106
+
107
+ ## Citation
108
+
109
+ If you find Sailor2 useful, please cite our work as follows:
110
+
111
+ ```
112
+ @misc{sailor2report,
113
+ title={Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLM},
114
+ author={Sailor2 Team},
115
+ year={2024}
116
+ }
117
+ ```
118
+
119
+ # Contact Us
120
+
121
+ If you have any questions, please raise an issue or contact us at [doulx@sea.com](mailto:doulx@sea.com) or [liuqian.sea@gmail.com](mailto:liuqian.sea@gmail.com).