Add a note about padded vocab size during training
#43
by
mathemakitten
- opened
README.md
CHANGED
@@ -193,6 +193,8 @@ The BLOOM tokenizer ([link](https://huggingface.co/bigscience/tokenizer)) is a l
|
|
193 |
|
194 |
- A vocabulary size of 250,680
|
195 |
|
|
|
|
|
196 |
It was trained on a subset of a preliminary version of the corpus using alpha-weighting per language.
|
197 |
|
198 |
</details>
|
|
|
193 |
|
194 |
- A vocabulary size of 250,680
|
195 |
|
196 |
+
The vocabulary size was padded to 250,880 for practical purposes during training, but the effective model vocabulary size is 250,680.
|
197 |
+
|
198 |
It was trained on a subset of a preliminary version of the corpus using alpha-weighting per language.
|
199 |
|
200 |
</details>
|