llm-jp
/

llm-jp-13b-v2.0

Text Generation

text-generation-inference

Model card Files Files and versions Community

hkiyomaru commited on Apr 25, 2024

Commit

35675cd

·

verified ·

1 Parent(s): 75fb8d7

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -99,7 +99,7 @@ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-
 - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
 - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
 - **Training data:** A subset of the datasets for model pre-training
-- **Vocabulary size:** 48,588 (mixed vocabulary of Japanese, English, and source code)
 ## Datasets

 - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
 - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
 - **Training data:** A subset of the datasets for model pre-training
+- **Vocabulary size:** 97,024 (mixed vocabulary of Japanese, English, and source code)
 ## Datasets