Spaces:

sailor2
/

README

Running

File size: 5,598 Bytes

8eff85b
 
 
 
 
 
 
 
 
b3ec645
 
 
 
cc8a018
b3ec645
5d9ea91
 
 
 
 
 
 
 
b3ec645
 
3ff0d6c
b3ec645
 
 
 
 
3ff0d6c
 
 
89e53b2
 
 
 
b3ec645
 
 
6e415fc
b3ec645
 
 
 
056844d
 
065c030
 
 
 
 
 
7437b0e
b3ec645
 
 
 
6e415fc
b3ec645
 
 
 
3eccbf9
 
24ab78d
 
 
 
 
 
6e415fc
24ab78d
6e415fc
24ab78d
7437b0e
24ab78d
 
b3ec645
 
 
24ab78d
6e415fc
24ab78d
 
 
 
 
 
c137ee0
fc98034
24ab78d

---
title: README
emoji: 💻
colorFrom: pink
colorTo: blue
sdk: static
pinned: false
---

# Sailor2

The Sailor2 community is to build open large language models optimized for multiple South-East Asian languages, such as Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese. The model will be continually pre-trained on a base model proficient in both Chinese and English, and its performance is expected to be comparable to the most advanced business models for the above South-East Asian languages.

Read more details about Sailor2 at https://sea-sailor.github.io/blog/sailor2/.

Now you can use the following command to run Sailor2 chat models locally:

```
ollama run sailor2:1b
ollama run sailor2
ollama run sailor2:20b
```

---

<details open>
  <summary>
    <b><font size="+1">🔱 Sailor2 Models</font></b>
  </summary>
  
  - [GitHub](https://github.com/sail-sg/sailor2): All you need to know about using or fine-tuning Sailor2.
  - [Sailor2-1B](https://huggingface.co/sail/Sailor2-1B): 1B base model continually pre-trained on 500B tokens from Qwen2.5-0.5B with model expansion.
  - [Sailor2-8B](https://huggingface.co/sail/Sailor2-8B): 8B base model continually pre-trained on 500B tokens from Qwen2.5-7B with model expansion.
  - [Sailor2-20B](https://huggingface.co/sail/Sailor2-20B): 20B base model continually pre-trained on 500B tokens from Qwen2.5-14B with model expansion.
  - [Sailor2-1B-Chat](https://huggingface.co/sail/Sailor2-1B-Chat): 1B chat model after post-training on the 1B base model.
  - [Sailor2-8B-Chat](https://huggingface.co/sail/Sailor2-8B-Chat): 8B chat model after post-training on the 8B base model.
  - [Sailor2-20B-Chat](https://huggingface.co/sail/Sailor2-20B-Chat): 20B chat model after post-training on the 20B base model.

</details>

---
<details>
  <summary>
    <b><font size="+1">📚 Sailor2 Pre-training Dataset </font></b>
  </summary> 

  - [sailor2-pretrain-data-stage1](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage1): A comprehensive dataset comprising 450B tokens of high-quality data for continual pre-training, including English (from [ProX](https://huggingface.co/datasets/gair-prox/FineWeb-pro)), Chinese (from [Chinese-Fineweb-Edu](https://huggingface.co/datasets/opencsg/chinese-fineweb-edu)), Vietnamese, Indonesian, Thai, Malay, Burmese, Tagalog, and Khmer,  organized by **chunks**
  - [sailor2-pretrain-data-stage2](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage2): An additional 60B tokens of exceptionally high-quality data for model annealing, including the above languages and Cebuano, Lao, Javanese, Waray, Sundanese, and Ilocano, organized by **chunks**
  - [community-dataset](https://huggingface.co/datasets/sailor2/community-dataset): Clean South-East Asia datasets sourced from the whole community members, including Indonesian, Thai, and Vietnamese content in fields like news, finance, law, books, poetry, social media, and TED Talks, organized by **source**
  - [sea-commoncrawl](https://huggingface.co/datasets/sailor2/sea-commoncrawl): Clean South-East Asia related web corpora from 89 CommonCrawl snapshots, organized by **languages**
  - [sea-internet](https://huggingface.co/datasets/sailor2/sea-internet): Clean multilingual data from Internet Archive, organized by languages, cleaned and deduplicated from the awesome dataset provided by [A New Massive Multilingual Dataset for High-Performance Language Technologies](https://arxiv.org/abs/2403.14009),  organized by **languages**
  - [sea-pdf-text](https://huggingface.co/datasets/sailor2/sea-pdf-text): Clean pdf data, the PDF links are sourced from partner information,  organized by **languages**
  - [sea-synthetic](https://huggingface.co/datasets/sailor2/sea-synthetic): Translation dataset from Cosmopedia across multiple languages, which is used to retreive the high-quality tokens for stage 2, organized by **languages**
  - [sea-commoncrawl-high-quality](https://huggingface.co/datasets/sailor2/sea-commoncrawl-high-quality): the high-quality CommonCrawl subset, which is used in stage 2 of Sailor2 pre-training, organized by **languages**

</details>

---

<details>
  <summary>
    <b><font size="+1">📑 Sailor2 Post-training Dataset </font></b>
  </summary>

  - [sailor2-sft-stage1](https://huggingface.co/datasets/sailor2/sailor2-sft-stage1): 4M Medium-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
  - [sailor2-sft-stage2](https://huggingface.co/datasets/sailor2/sailor2-sft-stage2): 400K High-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
  - [sea-ultrafeedback](https://huggingface.co/datasets/sailor2/sea-ultrafeedback): Preference optimization dataset, supports English, Chinese and 17 SEA languages.
    
</details>

---

<details>
  <summary>
    <b><font size="+1">🧐 Sailor2 Evaluation Benchmark </font></b>
  </summary>

  - [sea-wildbench](https://huggingface.co/datasets/sailor2/sea-wildbench): Chat model evaluation, supports 8 SEA languages.
    
</details>

---

<details>
  <summary>
    <b><font size="+1">💻 Sailor2 Codebase </font></b>
  </summary>

  - [SailCraft Code](https://github.com/sail-sg/sailcraft): Data cleaning 
  - [Regmix Code](https://github.com/sail-sg/regmix): Data mixture 
  - [SailCompass Code](https://github.com/sail-sg/sailcompass): Few-shot evaluation
  - [Megatron Code](https://github.com/sail-sg/Megatron-Sailor2): Pre-training (Coming Soon)
  - [OAT Code](https://github.com/sail-sg/oat): Post-training
    
</details>