hiroshi-matsuda-rit commited on
Commit
1fa7d94
1 Parent(s): 49fa852

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -55,7 +55,7 @@ Checkpoints format: Hugging Face Transformers
55
  import torch
56
  from transformers import AutoTokenizer, AutoModelForCausalLM
57
  tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v2.0")
58
- model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-v2.0", device_map="auto", torch_dtype=torch.float16)
59
  text = "自然言語処理とは何か"
60
  tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
61
  with torch.no_grad():
@@ -97,7 +97,7 @@ The tokenizer of this model is based on [huggingface/tokenizers](https://github.
97
  The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (100k: code20K_en40K_ja60K.ver2.2)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
98
  Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
99
 
100
- - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
101
  - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
102
  - **Training data:** A subset of the datasets for model pre-training
103
  - **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)
 
55
  import torch
56
  from transformers import AutoTokenizer, AutoModelForCausalLM
57
  tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v2.0")
58
+ model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-v2.0", device_map="auto", torch_dtype=torch.bfloat16)
59
  text = "自然言語処理とは何か"
60
  tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
61
  with torch.no_grad():
 
97
  The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (100k: code20K_en40K_ja60K.ver2.2)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
98
  Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
99
 
100
+ - **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model
101
  - **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
102
  - **Training data:** A subset of the datasets for model pre-training
103
  - **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)