yuli-aisg commited on
Commit
0c5276c
1 Parent(s): 26a8a26

included CPT details

Browse files
Files changed (1) hide show
  1. README.md +53 -7
README.md CHANGED
@@ -1,10 +1,17 @@
1
  ---
2
  language:
3
  - en
 
 
4
  - id
5
- - ta
6
  - th
7
- - vi
 
 
 
 
 
 
8
  license: gemma
9
  ---
10
  # Gemma2 9B CPT SEA-LIONv3
@@ -24,7 +31,7 @@ The continued pre-training data for Gemma2 9B CPT SEA-LIONv3 base model encompas
24
  - **Developed by:** Products Pillar, AI Singapore
25
  - **Funded by:** Singapore NRF
26
  - **Model type:** Decoder
27
- - **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
28
  - **License:** [Gemma Community License](https://ai.google.dev/gemma/terms)
29
 
30
  For tokenization, the model employs the default tokenizer used in Gemma-2-9B.
@@ -44,12 +51,12 @@ For more details on Gemma2 9B CPT SEA-LIONv3 base benchmark performance, please
44
 
45
  ### Data
46
 
47
- Gemma2 9B CPT SEA-LIONv3 base model was continued pre-trained on 48B tokens of the following data:
48
 
49
  | Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
50
  |---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
51
- | Dolma RefinedWeb - English| 7.650 | 1 | 7.650 | 15.90 |
52
- | Dolma C4 - English | 1.160 | 1 | 1.16 | 9.21 |
53
  | Dolma Reddit - English | 1.339 | 1 | 1.339 | 2.42 |
54
  | Dolma Semantic Scholar | 0.959 | 1 | 0.959 | 2.79 |
55
  | Dolma arXiv | 0.469 | 1 | 0.469 | 1.99 |
@@ -64,9 +71,48 @@ Gemma2 9B CPT SEA-LIONv3 base model was continued pre-trained on 48B tokens of t
64
  | SEA-LION Pile - Vietnamese| 6.76 | 1 | 6.76 | 14.08 |
65
  | Wiki* - Vietnamese | 0.31 | 4 | 1.24 | 2.58 |
66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  Note:
68
  - All token counts are counted using Gemma2 tokenizer
69
- - wiki* sources includes Wikipedia, Wiki Books, Wiki Source and Wiki Voyage
 
70
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
71
 
72
  ### Infrastructure
 
1
  ---
2
  language:
3
  - en
4
+ - zh
5
+ - vi
6
  - id
 
7
  - th
8
+ - tl
9
+ - ta
10
+ - ms
11
+ - km
12
+ - lo
13
+ - my
14
+
15
  license: gemma
16
  ---
17
  # Gemma2 9B CPT SEA-LIONv3
 
31
  - **Developed by:** Products Pillar, AI Singapore
32
  - **Funded by:** Singapore NRF
33
  - **Model type:** Decoder
34
+ - **Languages:** English, Chinese, Vietnamese, Indonesian, Thai, Tagalog, Tamil, Malay, Khmer, Lao, Burmese
35
  - **License:** [Gemma Community License](https://ai.google.dev/gemma/terms)
36
 
37
  For tokenization, the model employs the default tokenizer used in Gemma-2-9B.
 
51
 
52
  ### Data
53
 
54
+ Gemma2 9B CPT SEA-LIONv3 base model was continued pre-trained on 200B tokens of the following data:
55
 
56
  | Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%) |
57
  |---------------------------|:-----------------:|:----------:|:----------------:|:--------------:|
58
+ | FineWebEdu | 7.650 | 1 | 7.650 | 15.90 |
59
+ | Stackv2 | 1.160 | 1 | 1.16 | 9.21 |
60
  | Dolma Reddit - English | 1.339 | 1 | 1.339 | 2.42 |
61
  | Dolma Semantic Scholar | 0.959 | 1 | 0.959 | 2.79 |
62
  | Dolma arXiv | 0.469 | 1 | 0.469 | 1.99 |
 
71
  | SEA-LION Pile - Vietnamese| 6.76 | 1 | 6.76 | 14.08 |
72
  | Wiki* - Vietnamese | 0.31 | 4 | 1.24 | 2.58 |
73
 
74
+ | Data Source | Unique Tokens (B) | Multiplier | Total Tokens (B) | Percentage (%)|
75
+ |---------------------------------------|:-----------------:|:----------:|:----------------:|:-------------:|
76
+ | StackV2 | 40.0 | 1 | 40.0 | 20.00 |
77
+ | Wiki* + News* - English | 5.0 | 1 | 5.0 | 2.50 |
78
+ | Fineweb-Edu | 7.5 | 1 | 7.5 | 3.75 |
79
+ | Dolma Project Gutenberg | 5.0 | 1 | 5.0 | 2.50 |
80
+ | Dolma arXiv | 1.7 | 1 | 1.7 | 0.83 |
81
+ | Dolma StackExchange | 1.7 | 1 | 1.7 | 0.83 |
82
+ | Dolma Semantic Scholar | 1.7 | 1 | 1.7 | 0.83 |
83
+ | Dolma OpenWebMath | 2.5 | 1 | 2.5 | 1.25 |
84
+ | Dolma Algebraic Stack | 2.5 | 1 | 2.5 | 1.25 |
85
+ | Dolma Flan | 5.0 | 1 | 5.0 | 2.50 |
86
+ | Dolma Reddit | 5.0 | 1 | 5.0 | 2.50 |
87
+ | Dolma Megawika | 5.0 | 1 | 5.0 | 2.50 |
88
+ | Dolma CC News | 7.5 | 1 | 7.5 | 3.75 |
89
+ | Wiki* + News* - Chinese | 3.5 | 4 | 14.0 | 7.00 |
90
+ | SEA-LION Pile - Chinese | 12.0 | 1 | 12.0 | 6.00 |
91
+ | Wiki* + News* - Vietnamese | 2.4 | 4 | 9.4 | 4.70 |
92
+ | VinBigData - Vietnamese | 2.1 | 4 | 8.2 | 4.10 |
93
+ | SEA-LION Pile - Vietnamese | 8.4 | 1 | 8.4 | 4.20 |
94
+ | Wiki* + News* - Indonesian | 1.3 | 4 | 5.2 | 2.60 |
95
+ | SEA-LION Pile - Indonesian | 20.8 | 1 | 20.8 | 10.40 |
96
+ | Wiki* + News* + WangChanBERTa - Thai | 1.3 | 4 | 5.2 | 2.60 |
97
+ | SEA-LION Pile - Thai | 14.8 | 1 | 14.8 | 7.40 |
98
+ | Wiki* + News - Tagalog | 0.2 | 4 | 0.9 | 0.43 |
99
+ | SEA-LION Pile - Tagalog | 2.1 | 1 | 2.1 | 1.07 |
100
+ | Wiki* + News - Tamil | 0.1 | 4 | 0.3 | 0.14 |
101
+ | SEA-LION Pile - Tamil | 0.7 | 1 | 0.7 | 0.36 |
102
+ | Wiki* + News - Malay | 0.1 | 4 | 0.6 | 0.29 |
103
+ | SEA-LION Pile - Malay | 1.4 | 1 | 1.4 | 0.71 |
104
+ | Wiki* + News - Khmer | 0.1 | 4 | 0.3 | 0.17 |
105
+ | SEA-LION Pile - Khmer | 2.3 | 1 | 2.3 | 1.13 |
106
+ | Wiki* + News - Lao | 0.0 | 4 | 0.1 | 0.03 |
107
+ | SEA-LION Pile - Lao | 0.3 | 1 | 0.3 | 0.17 |
108
+ | Wiki* + News - Burmese | 0.1 | 4 | 0.4 | 0.20 |
109
+ | SEA-LION Pile - Burmese | 2.6 | 1 | 2.6 | 1.30 |
110
+
111
+
112
  Note:
113
  - All token counts are counted using Gemma2 tokenizer
114
+ - Wiki* sources includes Wikipedia, Wiki Books, Wiki Source, Wiki Voyage and Fandom Wiki
115
+ - News* sources includes VOA, Global Voices, MediaCorp, VinBigData-News
116
  - Tamil news is sourced with permission from [Seithi](https://seithi.mediacorp.sg/)
117
 
118
  ### Infrastructure