llm-jp
/

llm-jp-13b-v2.0

Text Generation

text-generation-inference

Model card Files Files and versions Community

hiroshi-matsuda-rit commited on Apr 30

Commit

1fa7d94

•

1 Parent(s): 49fa852

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -55,7 +55,7 @@ Checkpoints format: Hugging Face Transformers
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v2.0")
-model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-v2.0", device_map="auto", torch_dtype=torch.float16)
 text = "自然言語処理とは何か"
 tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
 with torch.no_grad():
@@ -97,7 +97,7 @@ The tokenizer of this model is based on [huggingface/tokenizers](https://github.
 The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (100k: code20K_en40K_ja60K.ver2.2)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
 Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
-- **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
 - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
 - **Training data:** A subset of the datasets for model pre-training
 - **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)

 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
 tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v2.0")
+model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-v2.0", device_map="auto", torch_dtype=torch.bfloat16)
 text = "自然言語処理とは何か"
 tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
 with torch.no_grad():
 The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (100k: code20K_en40K_ja60K.ver2.2)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
 Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
+- **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model
 - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
 - **Training data:** A subset of the datasets for model pre-training
 - **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)