hkiyomaru commited on
Commit
eeb438a
1 Parent(s): 47f4d4c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -18
README.md CHANGED
@@ -21,13 +21,13 @@ library_name: transformers
21
  pipeline_tag: text-generation
22
  inference: false
23
  ---
24
- # llm-jp-13b-v1.0
25
 
26
  This repository provides large language models developed by [LLM-jp](https://llm-jp.nii.ac.jp/), a collaborative project launched in Japan.
27
 
28
  | Model Variant |
29
  | :--- |
30
- |**Instruction models**|
31
  | [llm-jp-13b-instruct-full-jaster-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-v1.0) |
32
  | [llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0) |
33
  | [llm-jp-13b-instruct-full-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-dolly-oasst-v1.0) |
@@ -39,25 +39,25 @@ This repository provides large language models developed by [LLM-jp](https://llm
39
  | |
40
  | :--- |
41
  |**Pre-trained models**|
42
- | [llm-jp-13b-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-v1.0) |
43
- | [llm-jp-1.3b-v1.0](https://huggingface.co/llm-jp/llm-jp-1.3b-v1.0) |
44
- Checkpoints format: Hugging Face Transformers (Megatron-DeepSpeed format models are available [here](https://huggingface.co/llm-jp/llm-jp-13b-v1.0-mdsfmt))
45
 
 
46
 
47
- ## Required Libraries and Their Versions
 
48
 
49
  - torch>=2.0.0
50
  - transformers>=4.34.0
51
  - tokenizers>=0.14.0
52
  - accelerate==0.23.0
53
 
54
- ## Usage
55
 
56
  ```python
57
  import torch
58
  from transformers import AutoTokenizer, AutoModelForCausalLM
59
- tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v1.0")
60
- model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-v1.0", device_map="auto", torch_dtype=torch.float16)
61
  text = "自然言語処理とは何か"
62
  tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
63
  with torch.no_grad():
@@ -72,7 +72,7 @@ print(tokenizer.decode(output))
72
  ```
73
 
74
 
75
- ## Model Details
76
 
77
  - **Model type:** Transformer-based Language Model
78
  - **Total seen tokens:** 300B
@@ -80,10 +80,9 @@ print(tokenizer.decode(output))
80
  |Model|Params|Layers|Hidden size|Heads|Context length|
81
  |:---:|:---:|:---:|:---:|:---:|:---:|
82
  |13b model|13b|40|5120|40|2048|
83
- |1.3b model|1.3b|24|2048|16|2048|
84
 
85
 
86
- ## Training
87
 
88
  - **Pre-training:**
89
  - **Hardware:** 96 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
@@ -93,7 +92,8 @@ print(tokenizer.decode(output))
93
  - **Hardware:** 8 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
94
  - **Software:** [TRL](https://github.com/huggingface/trl), [PEFT](https://github.com/huggingface/peft), and [DeepSpeed](https://github.com/microsoft/DeepSpeed)
95
 
96
- ## Tokenizer
 
97
  The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
98
  The vocabulary entries were converted from [`llm-jp-tokenizer v2.1 (50k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.1).
99
  Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure.
@@ -103,7 +103,7 @@ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-
103
  - **Vocabulary size:** 50,570 (mixed vocabulary of Japanese, English, and source code)
104
 
105
 
106
- ## Datasets
107
 
108
  ### Pre-training
109
 
@@ -120,7 +120,7 @@ The models have been pre-trained using a blend of the following datasets.
120
  The pre-training was continuously conducted using a total of 10 folds of non-overlapping data, each consisting of approximately 27-28B tokens.
121
  We finalized the pre-training with additional (potentially) high-quality 27B tokens data obtained from the identical source datasets listed above used for the 10-fold data.
122
 
123
- ### Instruction tuning
124
 
125
  The models have been fine-tuned on the following datasets.
126
 
@@ -131,7 +131,8 @@ The models have been fine-tuned on the following datasets.
131
  ||[OpenAssistant Conversations Dataset](https://huggingface.co/datasets/OpenAssistant/oasst1)| A translated one by DeepL in LLM-jp |
132
 
133
 
134
- ## Evaluation
 
135
  You can view the evaluation results of several LLMs on this [leaderboard](http://wandb.me/llm-jp-leaderboard). We used [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) for the evaluation.
136
 
137
  ## Risks and Limitations
@@ -149,7 +150,8 @@ llm-jp(at)nii.ac.jp
149
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
150
 
151
 
152
- ## Model Card Authors
 
153
  *The names are listed in alphabetical order.*
154
 
155
- Hirokazu Kiyomaru, Hiroshi Matsuda, Jun Suzuki, Namgi Han, Saku Sugawara, Shota Sasaki, Shuhei Kurita, Taishi Nakamura, Takumi Okamoto.
 
21
  pipeline_tag: text-generation
22
  inference: false
23
  ---
24
+ # llm-jp-13b-v2.0
25
 
26
  This repository provides large language models developed by [LLM-jp](https://llm-jp.nii.ac.jp/), a collaborative project launched in Japan.
27
 
28
  | Model Variant |
29
  | :--- |
30
+ |**Instruction models (To be updated)**|
31
  | [llm-jp-13b-instruct-full-jaster-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-v1.0) |
32
  | [llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0) |
33
  | [llm-jp-13b-instruct-full-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-dolly-oasst-v1.0) |
 
39
  | |
40
  | :--- |
41
  |**Pre-trained models**|
42
+ | [llm-jp-13b-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-v2.0) |
 
 
43
 
44
+ Checkpoints format: Hugging Face Transformers
45
 
46
+
47
+ ## Required Libraries and Their Versions (To be updated)
48
 
49
  - torch>=2.0.0
50
  - transformers>=4.34.0
51
  - tokenizers>=0.14.0
52
  - accelerate==0.23.0
53
 
54
+ ## Usage (To be updated)
55
 
56
  ```python
57
  import torch
58
  from transformers import AutoTokenizer, AutoModelForCausalLM
59
+ tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v2.0")
60
+ model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-v2.0", device_map="auto", torch_dtype=torch.float16)
61
  text = "自然言語処理とは何か"
62
  tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
63
  with torch.no_grad():
 
72
  ```
73
 
74
 
75
+ ## Model Details (To be updated)
76
 
77
  - **Model type:** Transformer-based Language Model
78
  - **Total seen tokens:** 300B
 
80
  |Model|Params|Layers|Hidden size|Heads|Context length|
81
  |:---:|:---:|:---:|:---:|:---:|:---:|
82
  |13b model|13b|40|5120|40|2048|
 
83
 
84
 
85
+ ## Training (To be updated)
86
 
87
  - **Pre-training:**
88
  - **Hardware:** 96 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
 
92
  - **Hardware:** 8 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
93
  - **Software:** [TRL](https://github.com/huggingface/trl), [PEFT](https://github.com/huggingface/peft), and [DeepSpeed](https://github.com/microsoft/DeepSpeed)
94
 
95
+ ## Tokenizer (To be updated)
96
+
97
  The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
98
  The vocabulary entries were converted from [`llm-jp-tokenizer v2.1 (50k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.1).
99
  Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure.
 
103
  - **Vocabulary size:** 50,570 (mixed vocabulary of Japanese, English, and source code)
104
 
105
 
106
+ ## Datasets (To be updated)
107
 
108
  ### Pre-training
109
 
 
120
  The pre-training was continuously conducted using a total of 10 folds of non-overlapping data, each consisting of approximately 27-28B tokens.
121
  We finalized the pre-training with additional (potentially) high-quality 27B tokens data obtained from the identical source datasets listed above used for the 10-fold data.
122
 
123
+ ### Instruction tuning (To be updated)
124
 
125
  The models have been fine-tuned on the following datasets.
126
 
 
131
  ||[OpenAssistant Conversations Dataset](https://huggingface.co/datasets/OpenAssistant/oasst1)| A translated one by DeepL in LLM-jp |
132
 
133
 
134
+ ## Evaluation (To be updated)
135
+
136
  You can view the evaluation results of several LLMs on this [leaderboard](http://wandb.me/llm-jp-leaderboard). We used [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) for the evaluation.
137
 
138
  ## Risks and Limitations
 
150
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
151
 
152
 
153
+ ## Model Card Authors (To be updated)
154
+
155
  *The names are listed in alphabetical order.*
156
 
157
+ Hirokazu Kiyomaru.