|
--- |
|
language: |
|
- en |
|
- zh |
|
- id |
|
- th |
|
- vi |
|
- ms |
|
- lo |
|
- my |
|
- jv |
|
- km |
|
- su |
|
- tl |
|
tags: |
|
- multilingual |
|
- sea |
|
- sailor |
|
- sft |
|
- chat |
|
- instruction |
|
widget: |
|
- text: 如何制作烤鱼? |
|
example_title: Chinese |
|
- text: How to bake fish? |
|
example_title: English |
|
- text: Bagaimana cara memanggang ikan? |
|
example_title: Malay |
|
- text: วิธีย่างปลา? |
|
example_title: Thai |
|
- text: Bagaimana membuat bakaran ikan? |
|
example_title: Indonesian |
|
- text: Làm thế nào để nướng cá? |
|
example_title: Vietnamese |
|
license: apache-2.0 |
|
base_model: |
|
- sail/Sailor2-20B |
|
--- |
|
|
|
<div align="center"> |
|
<img src="sailor2_banner.jpg" width="700"/> |
|
</div> |
|
|
|
> The logo was generated by MidJourney |
|
|
|
Sailor2 is a community-driven initiative that brings cutting-edge multilingual language models to South-East Asia (SEA). |
|
Our research highlights a strong demand for models in the **8B and 20B parameter** range for production use, alongside **1B models** for specialized applications, |
|
such as speculative decoding and research purposes. |
|
These models, released under the **Apache 2.0 license**, provide enhanced accessibility to advanced language technologies across the region. |
|
|
|
Sailor2 builds upon the foundation of the awesome multilingual model [Qwen 2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e) and |
|
is continuously pre-trained on **500B tokens** to support **15 languages** better with a unified model. |
|
These languages include English, Chinese, Burmese, Cebuano, Ilocano, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tagalog, Thai, Vietnamese, and Waray. |
|
By addressing the growing demand for diverse, robust, and accessible language models, Sailor2 seeks to serve the underserved in SEA areas with open, inclusive, and accessible multilingual LLMs. |
|
The Sailor2 model comes in three sizes, 1B, 8B, and 20B, which are **expanded from the Qwen2.5 base models** of 0.5B, 7B, and 14B, respectively. |
|
|
|
## Model Summary |
|
- **Model Collections:** [Base Model & Chat Model](https://huggingface.co/collections/sail/sailor2-language-models-674d7c9e6b4dbbd9a869906b) |
|
- **Project Website:** [sea-sailor.github.io/blog/sailor2/](https://sea-sailor.github.io/blog/sailor2/) |
|
- **Codebase:** [github.com/sail-sg/sailor2](https://github.com/sail-sg/sailor2) |
|
- **Technical Report:** Coming Soon |
|
|
|
|
|
## Training details |
|
|
|
During development, we employ a range of advanced technologies to ensure top-tier performance and efficiency: |
|
|
|
1. model expansion |
|
2. optimized data mixing strategies |
|
3. multi-stage pre-training protocols |
|
4. advanced multilingual post-training |
|
|
|
Please refer to [Sailor2 Blog](https://sea-sailor.github.io/blog/sailor2/) for more training details. |
|
|
|
|
|
## Requirements |
|
The code of Sailor2 has been in the latest Hugging face transformers and we advise you to install `transformers==4.46.3`. |
|
|
|
## Quickstart |
|
|
|
Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents. |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
device = "cuda" |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
'sail/Sailor2-20B-Chat', |
|
torch_dtype=torch.bfloat16, |
|
device_map="auto" |
|
) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('sail/Sailor2-20B-Chat') |
|
system_prompt= \ |
|
'You are an AI assistant named Sailor2, created by Sea AI Lab. \ |
|
As an AI assistant, you can answer questions in English, Chinese, and Southeast Asian languages \ |
|
such as Burmese, Cebuano, Ilocano, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tagalog, Thai, Vietnamese, and Waray. \ |
|
Your responses should be friendly, unbiased, informative, detailed, and faithful.' |
|
|
|
prompt = "Beri saya pengenalan singkat tentang model bahasa besar." |
|
# prompt = "Hãy cho tôi một giới thiệu ngắn gọn về mô hình ngôn ngữ lớn." |
|
# prompt = "ให้ฉันแนะนำสั้น ๆ เกี่ยวกับโมเดลภาษาขนาดใหญ่" |
|
|
|
messages = [ |
|
{"role": "system", "content": system_prompt}, |
|
{"role": "user", "content": prompt} |
|
] |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(device) |
|
input_ids = model_inputs.input_ids.to(device) |
|
|
|
generated_ids = model.generate( |
|
input_ids, |
|
max_new_tokens=512, |
|
) |
|
|
|
generated_ids = [ |
|
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
|
] |
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
print(response) |
|
``` |
|
|
|
# License |
|
|
|
Sailor2 is distributed under the terms of the Apache License 2.0. |
|
No restrict on the research and the commercial use. |
|
|
|
## Citation |
|
|
|
If you find Sailor2 useful, please cite our work as follows: |
|
|
|
``` |
|
@misc{sailor2report, |
|
title={Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLM}, |
|
author={{Sailor2 Team}}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
# Contact Us |
|
|
|
If you have any questions, please raise an issue or contact us at [doulx@sea.com](mailto:doulx@sea.com) or [liuqian.sea@gmail.com](mailto:liuqian.sea@gmail.com). |