File size: 5,598 Bytes
8eff85b b3ec645 cc8a018 b3ec645 5d9ea91 b3ec645 3ff0d6c b3ec645 3ff0d6c 89e53b2 b3ec645 6e415fc b3ec645 056844d 065c030 7437b0e b3ec645 6e415fc b3ec645 3eccbf9 24ab78d 6e415fc 24ab78d 6e415fc 24ab78d 7437b0e 24ab78d b3ec645 24ab78d 6e415fc 24ab78d c137ee0 fc98034 24ab78d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
---
title: README
emoji: π»
colorFrom: pink
colorTo: blue
sdk: static
pinned: false
---
# Sailor2
The Sailor2 community is to build open large language models optimized for multiple South-East Asian languages, such as Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese. The model will be continually pre-trained on a base model proficient in both Chinese and English, and its performance is expected to be comparable to the most advanced business models for the above South-East Asian languages.
Read more details about Sailor2 at https://sea-sailor.github.io/blog/sailor2/.
Now you can use the following command to run Sailor2 chat models locally:
```
ollama run sailor2:1b
ollama run sailor2
ollama run sailor2:20b
```
---
<details open>
<summary>
<b><font size="+1">π± Sailor2 Models</font></b>
</summary>
- [GitHub](https://github.com/sail-sg/sailor2): All you need to know about using or fine-tuning Sailor2.
- [Sailor2-1B](https://huggingface.co/sail/Sailor2-1B): 1B base model continually pre-trained on 500B tokens from Qwen2.5-0.5B with model expansion.
- [Sailor2-8B](https://huggingface.co/sail/Sailor2-8B): 8B base model continually pre-trained on 500B tokens from Qwen2.5-7B with model expansion.
- [Sailor2-20B](https://huggingface.co/sail/Sailor2-20B): 20B base model continually pre-trained on 500B tokens from Qwen2.5-14B with model expansion.
- [Sailor2-1B-Chat](https://huggingface.co/sail/Sailor2-1B-Chat): 1B chat model after post-training on the 1B base model.
- [Sailor2-8B-Chat](https://huggingface.co/sail/Sailor2-8B-Chat): 8B chat model after post-training on the 8B base model.
- [Sailor2-20B-Chat](https://huggingface.co/sail/Sailor2-20B-Chat): 20B chat model after post-training on the 20B base model.
</details>
---
<details>
<summary>
<b><font size="+1">π Sailor2 Pre-training Dataset </font></b>
</summary>
- [sailor2-pretrain-data-stage1](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage1): A comprehensive dataset comprising 450B tokens of high-quality data for continual pre-training, including English (from [ProX](https://huggingface.co/datasets/gair-prox/FineWeb-pro)), Chinese (from [Chinese-Fineweb-Edu](https://huggingface.co/datasets/opencsg/chinese-fineweb-edu)), Vietnamese, Indonesian, Thai, Malay, Burmese, Tagalog, and Khmer, organized by **chunks**
- [sailor2-pretrain-data-stage2](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage2): An additional 60B tokens of exceptionally high-quality data for model annealing, including the above languages and Cebuano, Lao, Javanese, Waray, Sundanese, and Ilocano, organized by **chunks**
- [community-dataset](https://huggingface.co/datasets/sailor2/community-dataset): Clean South-East Asia datasets sourced from the whole community members, including Indonesian, Thai, and Vietnamese content in fields like news, finance, law, books, poetry, social media, and TED Talks, organized by **source**
- [sea-commoncrawl](https://huggingface.co/datasets/sailor2/sea-commoncrawl): Clean South-East Asia related web corpora from 89 CommonCrawl snapshots, organized by **languages**
- [sea-internet](https://huggingface.co/datasets/sailor2/sea-internet): Clean multilingual data from Internet Archive, organized by languages, cleaned and deduplicated from the awesome dataset provided by [A New Massive Multilingual Dataset for High-Performance Language Technologies](https://arxiv.org/abs/2403.14009), organized by **languages**
- [sea-pdf-text](https://huggingface.co/datasets/sailor2/sea-pdf-text): Clean pdf data, the PDF links are sourced from partner information, organized by **languages**
- [sea-synthetic](https://huggingface.co/datasets/sailor2/sea-synthetic): Translation dataset from Cosmopedia across multiple languages, which is used to retreive the high-quality tokens for stage 2, organized by **languages**
- [sea-commoncrawl-high-quality](https://huggingface.co/datasets/sailor2/sea-commoncrawl-high-quality): the high-quality CommonCrawl subset, which is used in stage 2 of Sailor2 pre-training, organized by **languages**
</details>
---
<details>
<summary>
<b><font size="+1">π Sailor2 Post-training Dataset </font></b>
</summary>
- [sailor2-sft-stage1](https://huggingface.co/datasets/sailor2/sailor2-sft-stage1): 4M Medium-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
- [sailor2-sft-stage2](https://huggingface.co/datasets/sailor2/sailor2-sft-stage2): 400K High-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
- [sea-ultrafeedback](https://huggingface.co/datasets/sailor2/sea-ultrafeedback): Preference optimization dataset, supports English, Chinese and 17 SEA languages.
</details>
---
<details>
<summary>
<b><font size="+1">π§ Sailor2 Evaluation Benchmark </font></b>
</summary>
- [sea-wildbench](https://huggingface.co/datasets/sailor2/sea-wildbench): Chat model evaluation, supports 8 SEA languages.
</details>
---
<details>
<summary>
<b><font size="+1">π» Sailor2 Codebase </font></b>
</summary>
- [SailCraft Code](https://github.com/sail-sg/sailcraft): Data cleaning
- [Regmix Code](https://github.com/sail-sg/regmix): Data mixture
- [SailCompass Code](https://github.com/sail-sg/sailcompass): Few-shot evaluation
- [Megatron Code](https://github.com/sail-sg/Megatron-Sailor2): Pre-training (Coming Soon)
- [OAT Code](https://github.com/sail-sg/oat): Post-training
</details>
|