beomi
/

gemma-mling-7b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Taekyoon commited on Apr 15, 2024

Commit

28373e4

·

verified ·

1 Parent(s): 382f21a

Update README.md

Files changed (1) hide show

README.md +15 -1

README.md CHANGED Viewed

@@ -20,7 +20,7 @@ tags:
 **Original Gemma Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
 This model card corresponds to the 7B base version of the **Gemma-Mling** model,
-continual pretrained on Korean/English/Chinese/Japanese corpus.
 **Resources and Technical Documentation**:
@@ -96,6 +96,20 @@ Details about the model internals.
 Training was done using [beomi/Gemma-EasyLM](https://github.com/Beomi/Gemma-EasyLM).
 ## Evaluation

 **Original Gemma Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
 This model card corresponds to the 7B base version of the **Gemma-Mling** model,
+continual pretrained on mainly Korean/English/Chinese/Japanese + 500 multilingual corpus.
 **Resources and Technical Documentation**:
 Training was done using [beomi/Gemma-EasyLM](https://github.com/Beomi/Gemma-EasyLM).
+### Dataset
+We trained a mixture of multiple language datasets and trained until 100B.
+The released model is the best performance model based on our Evaluation below from model checkpoints.
+For Korean and English datasets, we utilized sampled llama2ko training dataset which combined 1:1 ratio in each language.
+| Dataset                  | Jsonl (GB) | Sampled |
+|--------------------------|------------|---------|
+| range3/cc100-ja          | 96.39      | No      |
+| Skywork/SkyPile-150B     | 100.57     | Yes     |
+| llama2ko dataset (ko/en) | 108.5      | Yes     |
+| cis-lmu/Glot500          | 181.24     | No      |
+| Total                    | 486.7      | .       |
 ## Evaluation