---
title: README
emoji: 💻
colorFrom: pink
colorTo: blue
sdk: static
pinned: false
---

# Sailor2

The Sailor2 community is to build open large language models optimized for multiple South-East Asian languages, such as Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese. The model will be continually pre-trained on a base model proficient in both Chinese and English, and its performance is expected to be comparable to the most advanced business models for the above South-East Asian languages.

Read more details about Sailor2 at https://sea-sailor.github.io/blog/sailor2/.

Now you can use the following command to run Sailor2 chat models locally:

```
ollama run sailor2:1b
ollama run sailor2
ollama run sailor2:20b
```

---

<details open>
  <summary>
    <b><font size="+1">🔱 Sailor2 Models</font></b>
  </summary>
  
  - [GitHub](https://github.com/sail-sg/sailor2): All you need to know about using or fine-tuning Sailor2.
  - [Sailor2-1B](https://huggingface.co/sail/Sailor2-1B): 1B base model continually pre-trained on 500B tokens from Qwen2.5-0.5B with model expansion.
  - [Sailor2-8B](https://huggingface.co/sail/Sailor2-8B): 8B base model continually pre-trained on 500B tokens from Qwen2.5-7B with model expansion.
  - [Sailor2-20B](https://huggingface.co/sail/Sailor2-20B): 20B base model continually pre-trained on 500B tokens from Qwen2.5-14B with model expansion.
  - [Sailor2-1B-Chat](https://huggingface.co/sail/Sailor2-1B-Chat): 1B chat model after post-training on the 1B base model.
  - [Sailor2-8B-Chat](https://huggingface.co/sail/Sailor2-8B-Chat): 8B chat model after post-training on the 8B base model.
  - [Sailor2-20B-Chat](https://huggingface.co/sail/Sailor2-20B-Chat): 20B chat model after post-training on the 20B base model.

</details>

---
<details>
  <summary>
    <b><font size="+1">📚 Sailor2 Pre-training Dataset </font></b>
  </summary> 

  - [sailor2-pretrain-data-stage1](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage1): A comprehensive dataset comprising 450B tokens of high-quality data for continual pre-training, including English (from [ProX](https://huggingface.co/datasets/gair-prox/FineWeb-pro)), Chinese (from [Chinese-Fineweb-Edu](https://huggingface.co/datasets/opencsg/chinese-fineweb-edu)), Vietnamese, Indonesian, Thai, Malay, Burmese, Tagalog, and Khmer,  organized by **chunks**
  - [sailor2-pretrain-data-stage2](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage2): An additional 60B tokens of exceptionally high-quality data for model annealing, including the above languages and Cebuano, Lao, Javanese, Waray, Sundanese, and Ilocano, organized by **chunks**
  - [community-dataset](https://huggingface.co/datasets/sailor2/community-dataset): Clean South-East Asia datasets sourced from the whole community members, including Indonesian, Thai, and Vietnamese content in fields like news, finance, law, books, poetry, social media, and TED Talks, organized by **source**
  - [sea-commoncrawl](https://huggingface.co/datasets/sailor2/sea-commoncrawl): Clean South-East Asia related web corpora from 89 CommonCrawl snapshots, organized by **languages**
  - [sea-internet](https://huggingface.co/datasets/sailor2/sea-internet): Clean multilingual data from Internet Archive, organized by languages, cleaned and deduplicated from the awesome dataset provided by [A New Massive Multilingual Dataset for High-Performance Language Technologies](https://arxiv.org/abs/2403.14009),  organized by **languages**
  - [sea-pdf-text](https://huggingface.co/datasets/sailor2/sea-pdf-text): Clean pdf data, the PDF links are sourced from partner information,  organized by **languages**
  - [sea-synthetic](https://huggingface.co/datasets/sailor2/sea-synthetic): Translation dataset from Cosmopedia across multiple languages, which is used to retreive the high-quality tokens for stage 2, organized by **languages**
  - [sea-commoncrawl-high-quality](https://huggingface.co/datasets/sailor2/sea-commoncrawl-high-quality): the high-quality CommonCrawl subset, which is used in stage 2 of Sailor2 pre-training, organized by **languages**

</details>

---

<details>
  <summary>
    <b><font size="+1">📑 Sailor2 Post-training Dataset </font></b>
  </summary>

  - [sailor2-sft-stage1](https://huggingface.co/datasets/sailor2/sailor2-sft-stage1): 4M Medium-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
  - [sailor2-sft-stage2](https://huggingface.co/datasets/sailor2/sailor2-sft-stage2): 400K High-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
  - [sea-ultrafeedback](https://huggingface.co/datasets/sailor2/sea-ultrafeedback): Preference optimization dataset, supports English, Chinese and 17 SEA languages.
    
</details>

---

<details>
  <summary>
    <b><font size="+1">🧐 Sailor2 Evaluation Benchmark </font></b>
  </summary>

  - [sea-wildbench](https://huggingface.co/datasets/sailor2/sea-wildbench): Chat model evaluation, supports 8 SEA languages.
    
</details>

---

<details>
  <summary>
    <b><font size="+1">💻 Sailor2 Codebase </font></b>
  </summary>

  - [SailCraft Code](https://github.com/sail-sg/sailcraft): Data cleaning 
  - [Regmix Code](https://github.com/sail-sg/regmix): Data mixture 
  - [SailCompass Code](https://github.com/sail-sg/sailcompass): Few-shot evaluation
  - [Megatron Code](https://github.com/sail-sg/Megatron-Sailor2): Pre-training (Coming Soon)
  - [OAT Code](https://github.com/sail-sg/oat): Post-training
    
</details>