javier-ab-bsc
commited on
Commit
•
ed68404
1
Parent(s):
1117366
Update README.md
Browse files
README.md
CHANGED
@@ -149,7 +149,7 @@ to be adapted before continuing its pre-training with data in the target languag
|
|
149 |
1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 7.1B parameters to 6.3B.
|
150 |
2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
|
151 |
3) The embeddings from tokens not present in BLOOM's original vocabulary were initialized as the average of all embeddings.
|
152 |
-
4) The model was initialized with the weights from
|
153 |
5) The model was then trained on a corpus that contains a mixture of Catalan, Spanish, and English data.
|
154 |
|
155 |
### Training data
|
|
|
149 |
1) We trained our own BPE tokenizer for Catalan, Spanish, and English, and replaced the original BLOOM tokenizer and vocabulary with it. This procedure implied a downsizing of the original BLOOM's embedding layer and, therefore, a model compression from 7.1B parameters to 6.3B.
|
150 |
2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
|
151 |
3) The embeddings from tokens not present in BLOOM's original vocabulary were initialized as the average of all embeddings.
|
152 |
+
4) The model was initialized with the weights from BLOOM-7.1B, and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
|
153 |
5) The model was then trained on a corpus that contains a mixture of Catalan, Spanish, and English data.
|
154 |
|
155 |
### Training data
|