eryk-mazus commited on
Commit
aecbcac
·
verified ·
1 Parent(s): 87123a2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -19,7 +19,7 @@ widget:
19
 
20
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61bf0e11c88f3fd22f654059/EMSrPEzAFkjY9nvbaJoC3.png)
21
 
22
- # Polka-1.1b
23
 
24
 
25
  `polka-1.1b` takes the [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) model and enhances it by continuing pretraining on an additional **5.7 billion Polish tokens**, primarily sourced from the [MADLAD-400](https://arxiv.org/abs/2309.04662) dataset. The tokens were sampled in a 10:1 ratio between Polish and English shards using [DSIR](https://github.com/p-lambda/dsir). Furthermore, Polka extends the TinyLlama tokenizer's vocabulary to 43,882 tokens, improving its efficiency for generating Polish text.
 
19
 
20
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/61bf0e11c88f3fd22f654059/EMSrPEzAFkjY9nvbaJoC3.png)
21
 
22
+ # polka-1.1b
23
 
24
 
25
  `polka-1.1b` takes the [TinyLlama-1.1B](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T) model and enhances it by continuing pretraining on an additional **5.7 billion Polish tokens**, primarily sourced from the [MADLAD-400](https://arxiv.org/abs/2309.04662) dataset. The tokens were sampled in a 10:1 ratio between Polish and English shards using [DSIR](https://github.com/p-lambda/dsir). Furthermore, Polka extends the TinyLlama tokenizer's vocabulary to 43,882 tokens, improving its efficiency for generating Polish text.