beomi Taekyoon commited on
Commit
b663d6a
1 Parent(s): 382f21a

Update README.md (#3)

Browse files

- Update README.md (28373e42b738a3df105f750ca10597d038993f37)


Co-authored-by: Taekyoon Ted Choi <Taekyoon@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +15 -1
README.md CHANGED
@@ -20,7 +20,7 @@ tags:
20
  **Original Gemma Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
21
 
22
  This model card corresponds to the 7B base version of the **Gemma-Mling** model,
23
- continual pretrained on Korean/English/Chinese/Japanese corpus.
24
 
25
  **Resources and Technical Documentation**:
26
 
@@ -96,6 +96,20 @@ Details about the model internals.
96
 
97
  Training was done using [beomi/Gemma-EasyLM](https://github.com/Beomi/Gemma-EasyLM).
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
  ## Evaluation
101
 
 
20
  **Original Gemma Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
21
 
22
  This model card corresponds to the 7B base version of the **Gemma-Mling** model,
23
+ continual pretrained on mainly Korean/English/Chinese/Japanese + 500 multilingual corpus.
24
 
25
  **Resources and Technical Documentation**:
26
 
 
96
 
97
  Training was done using [beomi/Gemma-EasyLM](https://github.com/Beomi/Gemma-EasyLM).
98
 
99
+ ### Dataset
100
+
101
+ We trained a mixture of multiple language datasets and trained until 100B.
102
+ The released model is the best performance model based on our Evaluation below from model checkpoints.
103
+
104
+ For Korean and English datasets, we utilized sampled llama2ko training dataset which combined 1:1 ratio in each language.
105
+
106
+ | Dataset | Jsonl (GB) | Sampled |
107
+ |--------------------------|------------|---------|
108
+ | range3/cc100-ja | 96.39 | No |
109
+ | Skywork/SkyPile-150B | 100.57 | Yes |
110
+ | llama2ko dataset (ko/en) | 108.5 | Yes |
111
+ | cis-lmu/Glot500 | 181.24 | No |
112
+ | Total | 486.7 | . |
113
 
114
  ## Evaluation
115