Update README.md
Browse files
README.md
CHANGED
@@ -99,7 +99,7 @@ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-
|
|
99 |
- **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
|
100 |
- **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
|
101 |
- **Training data:** A subset of the datasets for model pre-training
|
102 |
-
- **Vocabulary size:**
|
103 |
|
104 |
|
105 |
## Datasets
|
|
|
99 |
- **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
|
100 |
- **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
|
101 |
- **Training data:** A subset of the datasets for model pre-training
|
102 |
+
- **Vocabulary size:** 97,024 (mixed vocabulary of Japanese, English, and source code)
|
103 |
|
104 |
|
105 |
## Datasets
|