llm-jp
/

llm-jp-13b-v2.0

@@ -21,13 +21,13 @@ library_name: transformers
 pipeline_tag: text-generation
 inference: false
 ---
-# llm-jp-13b-v1.0
 This repository provides large language models developed by [LLM-jp](https://llm-jp.nii.ac.jp/), a collaborative project launched in Japan.
 | Model Variant |
 | :--- |
-|**Instruction models**|
 | [llm-jp-13b-instruct-full-jaster-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-v1.0) |
 | [llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0) |
 | [llm-jp-13b-instruct-full-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-dolly-oasst-v1.0) |
@@ -39,25 +39,25 @@ This repository provides large language models developed by [LLM-jp](https://llm
 |  |
 | :--- |
 |**Pre-trained models**|
-| [llm-jp-13b-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-v1.0) |
-| [llm-jp-1.3b-v1.0](https://huggingface.co/llm-jp/llm-jp-1.3b-v1.0) |
-Checkpoints format: Hugging Face Transformers (Megatron-DeepSpeed format models are available [here](https://huggingface.co/llm-jp/llm-jp-13b-v1.0-mdsfmt))
-## Required Libraries and Their Versions
 - torch>=2.0.0
 - transformers>=4.34.0
 - tokenizers>=0.14.0
 - accelerate==0.23.0
-## Usage
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v1.0")
-model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-v1.0", device_map="auto", torch_dtype=torch.float16)
 text = "自然言語処理とは何か"
 tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
 with torch.no_grad():
@@ -72,7 +72,7 @@ print(tokenizer.decode(output))
 ```
-## Model Details
 - **Model type:** Transformer-based Language Model
 - **Total seen tokens:** 300B
@@ -80,10 +80,9 @@ print(tokenizer.decode(output))
 |Model|Params|Layers|Hidden size|Heads|Context length|
 |:---:|:---:|:---:|:---:|:---:|:---:|
 |13b model|13b|40|5120|40|2048|
-|1.3b model|1.3b|24|2048|16|2048|
-## Training
 - **Pre-training:**
   - **Hardware:** 96 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
@@ -93,7 +92,8 @@ print(tokenizer.decode(output))
   - **Hardware:** 8 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
   - **Software:** [TRL](https://github.com/huggingface/trl), [PEFT](https://github.com/huggingface/peft), and [DeepSpeed](https://github.com/microsoft/DeepSpeed)
-## Tokenizer
 The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
 The vocabulary entries were converted from [`llm-jp-tokenizer v2.1 (50k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.1).
 Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure.
@@ -103,7 +103,7 @@ Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-
 - **Vocabulary size:** 50,570 (mixed vocabulary of Japanese, English, and source code)
-## Datasets
 ### Pre-training
@@ -120,7 +120,7 @@ The models have been pre-trained using a blend of the following datasets.
 The pre-training was continuously conducted using a total of 10 folds of non-overlapping data, each consisting of approximately 27-28B tokens.
 We finalized the pre-training with additional (potentially) high-quality 27B tokens data obtained from the identical source datasets listed above used for the 10-fold data.
-### Instruction tuning
 The models have been fine-tuned on the following datasets.
@@ -131,7 +131,8 @@ The models have been fine-tuned on the following datasets.
 ||[OpenAssistant Conversations Dataset](https://huggingface.co/datasets/OpenAssistant/oasst1)| A translated one by DeepL in LLM-jp |
-## Evaluation
 You can view the evaluation results of several LLMs on this [leaderboard](http://wandb.me/llm-jp-leaderboard). We used [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) for the evaluation.
 ## Risks and Limitations
@@ -149,7 +150,8 @@ llm-jp(at)nii.ac.jp
 [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-## Model Card Authors
 *The names are listed in alphabetical order.*
-Hirokazu Kiyomaru, Hiroshi Matsuda, Jun Suzuki, Namgi Han, Saku Sugawara, Shota Sasaki, Shuhei Kurita, Taishi Nakamura, Takumi Okamoto.

 pipeline_tag: text-generation
 inference: false
 ---
+# llm-jp-13b-v2.0
 This repository provides large language models developed by [LLM-jp](https://llm-jp.nii.ac.jp/), a collaborative project launched in Japan.
 | Model Variant |
 | :--- |
+|**Instruction models (To be updated)**|
 | [llm-jp-13b-instruct-full-jaster-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-v1.0) |
 | [llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0) |
 | [llm-jp-13b-instruct-full-dolly-oasst-v1.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-dolly-oasst-v1.0) |
 |  |
 | :--- |
 |**Pre-trained models**|
+| [llm-jp-13b-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-v2.0) |
+Checkpoints format: Hugging Face Transformers
+## Required Libraries and Their Versions (To be updated)
 - torch>=2.0.0
 - transformers>=4.34.0
 - tokenizers>=0.14.0
 - accelerate==0.23.0
+## Usage (To be updated)
 ```python
 import torch
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v2.0")
+model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-v2.0", device_map="auto", torch_dtype=torch.float16)
 text = "自然言語処理とは何か"
 tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
 with torch.no_grad():
 ```
+## Model Details (To be updated)
 - **Model type:** Transformer-based Language Model
 - **Total seen tokens:** 300B
 |Model|Params|Layers|Hidden size|Heads|Context length|
 |:---:|:---:|:---:|:---:|:---:|:---:|
 |13b model|13b|40|5120|40|2048|
+## Training (To be updated)
 - **Pre-training:**
   - **Hardware:** 96 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
   - **Hardware:** 8 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
   - **Software:** [TRL](https://github.com/huggingface/trl), [PEFT](https://github.com/huggingface/peft), and [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+## Tokenizer (To be updated)
 The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
 The vocabulary entries were converted from [`llm-jp-tokenizer v2.1 (50k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.1).
 Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure.
 - **Vocabulary size:** 50,570 (mixed vocabulary of Japanese, English, and source code)
+## Datasets (To be updated)
 ### Pre-training
 The pre-training was continuously conducted using a total of 10 folds of non-overlapping data, each consisting of approximately 27-28B tokens.
 We finalized the pre-training with additional (potentially) high-quality 27B tokens data obtained from the identical source datasets listed above used for the 10-fold data.
+### Instruction tuning (To be updated)
 The models have been fine-tuned on the following datasets.
 ||[OpenAssistant Conversations Dataset](https://huggingface.co/datasets/OpenAssistant/oasst1)| A translated one by DeepL in LLM-jp |
+## Evaluation (To be updated)
 You can view the evaluation results of several LLMs on this [leaderboard](http://wandb.me/llm-jp-leaderboard). We used [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) for the evaluation.
 ## Risks and Limitations
 [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+## Model Card Authors (To be updated)
 *The names are listed in alphabetical order.*
+Hirokazu Kiyomaru.