included CPT details
Browse files
README.md
CHANGED
@@ -1,10 +1,17 @@
|
|
1 |
---
|
2 |
language:
|
3 |
- en
|
|
|
|
|
4 |
- id
|
5 |
-
- ta
|
6 |
- th
|
7 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
license: gemma
|
9 |
---
|
10 |
# Gemma2 9B CPT SEA-LIONv3
|
@@ -24,7 +31,7 @@ The continued pre-training data for Gemma2 9B CPT SEA-LIONv3 base model encompas
|
|
24 |
- **Developed by:** Products Pillar, AI Singapore
|
25 |
- **Funded by:** Singapore NRF
|
26 |
- **Model type:** Decoder
|
27 |
-
- **Languages:** English, Indonesian, Thai,
|
28 |
- **License:** [Gemma Community License](https://ai.google.dev/gemma/terms)
|
29 |
|
30 |
For tokenization, the model employs the default tokenizer used in Gemma-2-9B.
|
@@ -44,12 +51,12 @@ For more details on Gemma2 9B CPT SEA-LIONv3 base benchmark performance, please
|
|
44 |
|
45 |
### Data
|
46 |
|
47 |
-
Gemma2 9B CPT SEA-LIONv3 base model was continued pre-trained on
|
48 |
|
49 |
| Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
|
50 |
|---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
|
51 |
-
|
|
52 |
-
|
|
53 |
| Dolma Reddit - English | 1.339 | 1 | 1.339 | 2.42 |
|
54 |
| Dolma Semantic Scholar | 0.959 | 1 | 0.959 | 2.79 |
|
55 |
| Dolma arXiv | 0.469 | 1 | 0.469 | 1.99 |
|
@@ -64,9 +71,48 @@ Gemma2 9B CPT SEA-LIONv3 base model was continued pre-trained on 48B tokens of t
|
|
64 |
| SEA-LION Pile - Vietnamese| 6.76 | 1 | 6.76 | 14.08 |
|
65 |
| Wiki* - Vietnamese | 0.31 | 4 | 1.24 | 2.58 |
|
66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
67 |
Note:
|
68 |
- All token counts are counted using Gemma2 tokenizer
|
69 |
-
-
|
|
|
70 |
- Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
|
71 |
|
72 |
### Infrastructure
|
|
|
1 |
---
|
2 |
language:
|
3 |
- en
|
4 |
+
- zh
|
5 |
+
- vi
|
6 |
- id
|
|
|
7 |
- th
|
8 |
+
- tl
|
9 |
+
- ta
|
10 |
+
- ms
|
11 |
+
- km
|
12 |
+
- lo
|
13 |
+
- my
|
14 |
+
|
15 |
license: gemma
|
16 |
---
|
17 |
# Gemma2 9B CPT SEA-LIONv3
|
|
|
31 |
- **Developed by:** Products Pillar, AI Singapore
|
32 |
- **Funded by:** Singapore NRF
|
33 |
- **Model type:** Decoder
|
34 |
+
- **Languages:** English, Chinese, Vietnamese, Indonesian, Thai, Tagalog, Tamil, Malay, Khmer, Lao, Burmese
|
35 |
- **License:** [Gemma Community License](https://ai.google.dev/gemma/terms)
|
36 |
|
37 |
For tokenization, the model employs the default tokenizer used in Gemma-2-9B.
|
|
|
51 |
|
52 |
### Data
|
53 |
|
54 |
+
Gemma2 9B CPT SEA-LIONv3 base model was continued pre-trained on 200B tokens of the following data:
|
55 |
|
56 |
| Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
|
57 |
|---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
|
58 |
+
| FineWebEdu | 7.650 | 1 | 7.650 | 15.90 |
|
59 |
+
| Stackv2 | 1.160 | 1 | 1.16 | 9.21 |
|
60 |
| Dolma Reddit - English | 1.339 | 1 | 1.339 | 2.42 |
|
61 |
| Dolma Semantic Scholar | 0.959 | 1 | 0.959 | 2.79 |
|
62 |
| Dolma arXiv | 0.469 | 1 | 0.469 | 1.99 |
|
|
|
71 |
| SEA-LION Pile - Vietnamese| 6.76 | 1 | 6.76 | 14.08 |
|
72 |
| Wiki* - Vietnamese | 0.31 | 4 | 1.24 | 2.58 |
|
73 |
|
74 |
+
| Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%)|
|
75 |
+
|---------------------------------------|:-----------------:|:----------:|:----------------:|:-------------:|
|
76 |
+
| StackV2 | 40.0 | 1 | 40.0 | 20.00 |
|
77 |
+
| Wiki* + News* - English | 5.0 | 1 | 5.0 | 2.50 |
|
78 |
+
| Fineweb-Edu | 7.5 | 1 | 7.5 | 3.75 |
|
79 |
+
| Dolma Project Gutenberg | 5.0 | 1 | 5.0 | 2.50 |
|
80 |
+
| Dolma arXiv | 1.7 | 1 | 1.7 | 0.83 |
|
81 |
+
| Dolma StackExchange | 1.7 | 1 | 1.7 | 0.83 |
|
82 |
+
| Dolma Semantic Scholar | 1.7 | 1 | 1.7 | 0.83 |
|
83 |
+
| Dolma OpenWebMath | 2.5 | 1 | 2.5 | 1.25 |
|
84 |
+
| Dolma Algebraic Stack | 2.5 | 1 | 2.5 | 1.25 |
|
85 |
+
| Dolma Flan | 5.0 | 1 | 5.0 | 2.50 |
|
86 |
+
| Dolma Reddit | 5.0 | 1 | 5.0 | 2.50 |
|
87 |
+
| Dolma Megawika | 5.0 | 1 | 5.0 | 2.50 |
|
88 |
+
| Dolma CC News | 7.5 | 1 | 7.5 | 3.75 |
|
89 |
+
| Wiki* + News* - Chinese | 3.5 | 4 | 14.0 | 7.00 |
|
90 |
+
| SEA-LION Pile - Chinese | 12.0 | 1 | 12.0 | 6.00 |
|
91 |
+
| Wiki* + News* - Vietnamese | 2.4 | 4 | 9.4 | 4.70 |
|
92 |
+
| VinBigData - Vietnamese | 2.1 | 4 | 8.2 | 4.10 |
|
93 |
+
| SEA-LION Pile - Vietnamese | 8.4 | 1 | 8.4 | 4.20 |
|
94 |
+
| Wiki* + News* - Indonesian | 1.3 | 4 | 5.2 | 2.60 |
|
95 |
+
| SEA-LION Pile - Indonesian | 20.8 | 1 | 20.8 | 10.40 |
|
96 |
+
| Wiki* + News* + WangChanBERTa - Thai | 1.3 | 4 | 5.2 | 2.60 |
|
97 |
+
| SEA-LION Pile - Thai | 14.8 | 1 | 14.8 | 7.40 |
|
98 |
+
| Wiki* + News - Tagalog | 0.2 | 4 | 0.9 | 0.43 |
|
99 |
+
| SEA-LION Pile - Tagalog | 2.1 | 1 | 2.1 | 1.07 |
|
100 |
+
| Wiki* + News - Tamil | 0.1 | 4 | 0.3 | 0.14 |
|
101 |
+
| SEA-LION Pile - Tamil | 0.7 | 1 | 0.7 | 0.36 |
|
102 |
+
| Wiki* + News - Malay | 0.1 | 4 | 0.6 | 0.29 |
|
103 |
+
| SEA-LION Pile - Malay | 1.4 | 1 | 1.4 | 0.71 |
|
104 |
+
| Wiki* + News - Khmer | 0.1 | 4 | 0.3 | 0.17 |
|
105 |
+
| SEA-LION Pile - Khmer | 2.3 | 1 | 2.3 | 1.13 |
|
106 |
+
| Wiki* + News - Lao | 0.0 | 4 | 0.1 | 0.03 |
|
107 |
+
| SEA-LION Pile - Lao | 0.3 | 1 | 0.3 | 0.17 |
|
108 |
+
| Wiki* + News - Burmese | 0.1 | 4 | 0.4 | 0.20 |
|
109 |
+
| SEA-LION Pile - Burmese | 2.6 | 1 | 2.6 | 1.30 |
|
110 |
+
|
111 |
+
|
112 |
Note:
|
113 |
- All token counts are counted using Gemma2 tokenizer
|
114 |
+
- Wiki* sources includes Wikipedia, Wiki Books, Wiki Source, Wiki Voyage and Fandom Wiki
|
115 |
+
- News* sources includes VOA, Global Voices, MediaCorp, VinBigData-News
|
116 |
- Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
|
117 |
|
118 |
### Infrastructure
|