LVouk commited on
Commit
2d6bec7
1 Parent(s): 01b039d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -16,7 +16,7 @@ Meltemi is built on top of [Mistral-7B](https://huggingface.co/mistralai/Mistral
16
  # Model Information
17
 
18
  - Vocabulary extension of the Mistral-7B tokenizer with Greek tokens
19
- - Trained with 8k context length
20
  - We extend the pretraining of Mistral-7B with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **40 billion tokens**.
21
  * This corpus includes 28.5 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (10.5 billion tokens) and Greek-English parallel data (600 million tokens).
22
  * This corpus has been processed, filtered, and deduplicated to ensure data quality (a detailed description of our data processing pipeline will be published in our upcoming paper) and is outlined below:
 
16
  # Model Information
17
 
18
  - Vocabulary extension of the Mistral-7B tokenizer with Greek tokens
19
+ - 8192 context length
20
  - We extend the pretraining of Mistral-7B with added proficiency for the Greek language, by utilizing a large corpus consisting of approximately **40 billion tokens**.
21
  * This corpus includes 28.5 billion monolingual Greek tokens, constructed from publicly available resources. Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (10.5 billion tokens) and Greek-English parallel data (600 million tokens).
22
  * This corpus has been processed, filtered, and deduplicated to ensure data quality (a detailed description of our data processing pipeline will be published in our upcoming paper) and is outlined below: