christian-dynamofl commited on
Commit
211be6b
1 Parent(s): 982a25f

added languages

Browse files
Files changed (1) hide show
  1. README.md +10 -1
README.md CHANGED
@@ -1,3 +1,12 @@
 
 
 
 
 
 
 
 
 
1
  Dynamo 8B is an improvement of the Mistral-7B architecture for the purpose of multilingual language modeling. It includes an extended tokenizer that was pretrained to better leverage tokens in different languages. The tokenizer was extended by training a sentence BPE tokenizer on selected languages (200M tokens were used per language) and then combined the merges/vocab that were not already present in the Mistral tokenizer. After the tokenizers were merged, the model was pretrained with an additional 210B tokens from multilingual data like German, Spanish, Korean, Italian, and Turkish texts. The pretraining dataset also incorporated English tokens to mitigate catastrophic forgetting.
2
 
3
  Dynamo 8B has not been instruction fine-tuned and has not undergone alignment using techniques like reinforcement learning from human feedback. The intention behind crafting this model is to provide the research community with a model to explore vital multilingual capabilities that enable widespread use of LLMs globally.
@@ -14,4 +23,4 @@ Model Specifications:
14
  *Pretraining on the multilingual dataset was done with a sequence length of 4096 tokens
15
 
16
 
17
- Dynamo 8B is a pre-trained model that can be adapted and fine-tuned for a variety of tasks. However, it is new technology that carries risks. In some scenarios, it may generate inaccurate, unverified, or biased output despite efforts we have made to maximize model safety. As with all LLMs, we recommend users exercise critical thinking, validate outputs, and perform the requisite safety evaluations for specific downstream applications of the Dynamo model.
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - es
6
+ - ko
7
+ - tr
8
+ - it
9
+ ---
10
  Dynamo 8B is an improvement of the Mistral-7B architecture for the purpose of multilingual language modeling. It includes an extended tokenizer that was pretrained to better leverage tokens in different languages. The tokenizer was extended by training a sentence BPE tokenizer on selected languages (200M tokens were used per language) and then combined the merges/vocab that were not already present in the Mistral tokenizer. After the tokenizers were merged, the model was pretrained with an additional 210B tokens from multilingual data like German, Spanish, Korean, Italian, and Turkish texts. The pretraining dataset also incorporated English tokens to mitigate catastrophic forgetting.
11
 
12
  Dynamo 8B has not been instruction fine-tuned and has not undergone alignment using techniques like reinforcement learning from human feedback. The intention behind crafting this model is to provide the research community with a model to explore vital multilingual capabilities that enable widespread use of LLMs globally.
 
23
  *Pretraining on the multilingual dataset was done with a sequence length of 4096 tokens
24
 
25
 
26
+ Dynamo 8B is a pre-trained model that can be adapted and fine-tuned for a variety of tasks. However, it is new technology that carries risks. In some scenarios, it may generate inaccurate, unverified, or biased output despite efforts we have made to maximize model safety. As with all LLMs, we recommend users exercise critical thinking, validate outputs, and perform the requisite safety evaluations for specific downstream applications of the Dynamo model.