llm-jp
/

llm-jp-13b-v2.0

Text Generation

text-generation-inference

Model card Files Files and versions Community

tathi commited on Apr 24

Commit

71d0e87

•

1 Parent(s): f9652f3

add tokenizer info

Files changed (1) hide show

README.md +5 -4

README.md CHANGED Viewed

@@ -96,12 +96,13 @@ print(tokenizer.decode(output))
 ## Tokenizer (To be updated)
 The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
-The vocabulary entries were converted from [`llm-jp-tokenizer v2.1 (50k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.1).
-Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure.
 - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
-- **Training algorithm:** SentencePiece Unigram byte-fallback
 - **Training data:** A subset of the datasets for model pre-training
-- **Vocabulary size:** 50,570 (mixed vocabulary of Japanese, English, and source code)
 ## Datasets (To be updated)

 ## Tokenizer (To be updated)
 The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
+The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (50k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
+Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
 - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
+- **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
 - **Training data:** A subset of the datasets for model pre-training
+- **Vocabulary size:** 48,588 (mixed vocabulary of Japanese, English, and source code)
 ## Datasets (To be updated)