metadata

title: README
emoji: 💻
colorFrom: pink
colorTo: blue
sdk: static
pinned: false

Sailor2

The Sailor2 community is to build open large language models optimized for multiple South-East Asian languages, such as Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese. The model will be continually pre-trained on a base model proficient in both Chinese and English, and its performance is expected to be comparable to the most advanced business models for the above South-East Asian languages.

Read more details about Sailor2 at https://sea-sailor.github.io/blog/sailor2/.

Now you can use the following command to run Sailor2 chat models locally:

ollama run sailor2:1b
ollama run sailor2
ollama run sailor2:20b

🔱 Sailor2 Models

GitHub: All you need to know about using or fine-tuning Sailor2.
Sailor2-1B: 1B base model continually pre-trained on 500B tokens from Qwen2.5-0.5B with model expansion.
Sailor2-8B: 8B base model continually pre-trained on 500B tokens from Qwen2.5-7B with model expansion.
Sailor2-20B: 20B base model continually pre-trained on 500B tokens from Qwen2.5-14B with model expansion.
Sailor2-1B-Chat: 1B chat model after post-training on the 1B base model.
Sailor2-8B-Chat: 8B chat model after post-training on the 8B base model.
Sailor2-20B-Chat: 20B chat model after post-training on the 20B base model.

📚 Sailor2 Pre-training Dataset

sailor2-pretrain-data-stage1: A comprehensive dataset comprising 450B tokens of high-quality data for continual pre-training, including English (from ProX), Chinese (from Chinese-Fineweb-Edu), Vietnamese, Indonesian, Thai, Malay, Burmese, Tagalog, and Khmer, organized by chunks
sailor2-pretrain-data-stage2: An additional 60B tokens of exceptionally high-quality data for model annealing, including the above languages and Cebuano, Lao, Javanese, Waray, Sundanese, and Ilocano, organized by chunks
community-dataset: Clean South-East Asia datasets sourced from the whole community members, including Indonesian, Thai, and Vietnamese content in fields like news, finance, law, books, poetry, social media, and TED Talks, organized by source
sea-commoncrawl: Clean South-East Asia related web corpora from 89 CommonCrawl snapshots, organized by languages
sea-internet: Clean multilingual data from Internet Archive, organized by languages, cleaned and deduplicated from the awesome dataset provided by A New Massive Multilingual Dataset for High-Performance Language Technologies, organized by languages
sea-pdf-text: Clean pdf data, the PDF links are sourced from partner information, organized by languages
sea-synthetic: Translation dataset from Cosmopedia across multiple languages, which is used to retreive the high-quality tokens for stage 2, organized by languages
sea-commoncrawl-high-quality: the high-quality CommonCrawl subset, which is used in stage 2 of Sailor2 pre-training, organized by languages

📑 Sailor2 Post-training Dataset

sailor2-sft-stage1: 4M Medium-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
sailor2-sft-stage2: 400K High-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
sea-ultrafeedback: Preference optimization dataset, supports English, Chinese and 17 SEA languages.

🧐 Sailor2 Evaluation Benchmark

sea-wildbench: Chat model evaluation, supports 8 SEA languages.

💻 Sailor2 Codebase

SailCraft Code: Data cleaning
Regmix Code: Data mixture
SailCompass Code: Few-shot evaluation
Megatron Code: Pre-training (Coming Soon)
OAT Code: Post-training