metadata
title: README
emoji: π»
colorFrom: pink
colorTo: blue
sdk: static
pinned: false
Sailor2
The Sailor2 community is to build open large language models optimized for multiple South-East Asian languages, such as Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese. The model will be continually pre-trained on a base model proficient in both Chinese and English, and its performance is expected to be comparable to the most advanced business models for the above South-East Asian languages.
Read more details about Sailor2 at https://sea-sailor.github.io/blog/sailor2/.
Now you can use the following command to run Sailor2 chat models locally:
ollama run sailor2:1b
ollama run sailor2
ollama run sailor2:20b
π± Sailor2 Models
- GitHub: All you need to know about using or fine-tuning Sailor2.
- Sailor2-1B: 1B base model continually pre-trained on 500B tokens from Qwen2.5-0.5B with model expansion.
- Sailor2-8B: 8B base model continually pre-trained on 500B tokens from Qwen2.5-7B with model expansion.
- Sailor2-20B: 20B base model continually pre-trained on 500B tokens from Qwen2.5-14B with model expansion.
- Sailor2-1B-Chat: 1B chat model after post-training on the 1B base model.
- Sailor2-8B-Chat: 8B chat model after post-training on the 8B base model.
- Sailor2-20B-Chat: 20B chat model after post-training on the 20B base model.
π Sailor2 Pre-training Dataset
- sailor2-pretrain-data-stage1: A comprehensive dataset comprising 450B tokens of high-quality data for continual pre-training, including English (from ProX), Chinese (from Chinese-Fineweb-Edu), Vietnamese, Indonesian, Thai, Malay, Burmese, Tagalog, and Khmer, organized by chunks
- sailor2-pretrain-data-stage2: An additional 60B tokens of exceptionally high-quality data for model annealing, including the above languages and Cebuano, Lao, Javanese, Waray, Sundanese, and Ilocano, organized by chunks
- community-dataset: Clean South-East Asia datasets sourced from the whole community members, including Indonesian, Thai, and Vietnamese content in fields like news, finance, law, books, poetry, social media, and TED Talks, organized by source
- sea-commoncrawl: Clean South-East Asia related web corpora from 89 CommonCrawl snapshots, organized by languages
- sea-internet: Clean multilingual data from Internet Archive, organized by languages, cleaned and deduplicated from the awesome dataset provided by A New Massive Multilingual Dataset for High-Performance Language Technologies, organized by languages
- sea-pdf-text: Clean pdf data, the PDF links are sourced from partner information, organized by languages
- sea-synthetic: Translation dataset from Cosmopedia across multiple languages, which is used to retreive the high-quality tokens for stage 2, organized by languages
- sea-commoncrawl-high-quality: the high-quality CommonCrawl subset, which is used in stage 2 of Sailor2 pre-training, organized by languages
π Sailor2 Post-training Dataset
- sailor2-sft-stage1: 4M Medium-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
- sailor2-sft-stage2: 400K High-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
- sea-ultrafeedback: Preference optimization dataset, supports English, Chinese and 17 SEA languages.
π§ Sailor2 Evaluation Benchmark
- sea-wildbench: Chat model evaluation, supports 8 SEA languages.
π» Sailor2 Codebase
- SailCraft Code: Data cleaning
- Regmix Code: Data mixture
- SailCompass Code: Few-shot evaluation
- Megatron Code: Pre-training (Coming Soon)
- OAT Code: Post-training