tathi commited on
Commit
09210b8
1 Parent(s): 820cd7d

update vocabulary size

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -100,7 +100,8 @@ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-
100
  - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
101
  - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
102
  - **Training data:** A subset of the datasets for model pre-training
103
- - **Vocabulary size:** 97,024 (mixed vocabulary of Japanese, English, and source code)
 
104
 
105
 
106
  ## Datasets
 
100
  - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
101
  - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
102
  - **Training data:** A subset of the datasets for model pre-training
103
+ - **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)
104
+ - The acutal size of vocabulary in the pretrained model is 97,024 due to round-up to multiples of 256.
105
 
106
 
107
  ## Datasets