File size: 5,526 Bytes
47f4d4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eeb438a
47f4d4c
 
 
 
 
75fb8d7
9b28dd7
 
 
47f4d4c
 
 
 
 
eeb438a
47f4d4c
eeb438a
47f4d4c
eeb438a
f9652f3
47f4d4c
f9652f3
 
 
 
 
47f4d4c
820cd7d
47f4d4c
 
 
 
eeb438a
 
47f4d4c
 
 
 
 
 
 
 
 
820cd7d
47f4d4c
 
 
 
 
8703a3b
47f4d4c
 
8703a3b
47f4d4c
 
 
8703a3b
47f4d4c
 
e51b61c
47f4d4c
 
e51b61c
 
47f4d4c
 
 
 
 
4ef7c89
eeb438a
47f4d4c
71d0e87
 
 
47f4d4c
71d0e87
47f4d4c
09210b8
 
47f4d4c
 
4ef7c89
47f4d4c
 
 
 
 
 
 
8250994
 
 
 
 
47f4d4c
eeb438a
47f4d4c
 
 
 
 
 
 
 
 
 
82b6aaf
eeb438a
82b6aaf
47f4d4c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eeb438a
 
47f4d4c
 
ed3c0b7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
license: apache-2.0
language:
  - en
  - ja
programming_language:
  - C
  - C++
  - C#
  - Go
  - Java
  - JavaScript
  - Lua
  - PHP
  - Python
  - Ruby
  - Rust
  - Scala
  - TypeScript
library_name: transformers
pipeline_tag: text-generation
inference: false
---
# llm-jp-13b-v2.0

This repository provides large language models developed by [LLM-jp](https://llm-jp.nii.ac.jp/), a collaborative project launched in Japan.

| Model Variant | 
| :--- |
|**Instruction models**|
| [llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) |
| [llm-jp-13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) |
| [llm-jp-13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) |


|  | 
| :--- |
|**Pre-trained models**|
| [llm-jp-13b-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-v2.0) | 

Checkpoints format: Hugging Face Transformers


## Required Libraries and Their Versions

- torch>=2.2.2
- transformers>=4.39.3
- tokenizers>=0.15.2
- accelerate>=0.27.2
- flash-attn>=2.5.6

## Usage

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v2.0")
model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-v2.0", device_map="auto", torch_dtype=torch.float16)
text = "自然言語処理とは何か"
tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(
        tokenized_input,
        max_new_tokens=100,
        do_sample=True,
        top_p=0.95,
        temperature=0.7,
        repetition_penalty=1.05,
    )[0]
print(tokenizer.decode(output))
```


## Model Details

- **Model type:** Transformer-based Language Model
- **Total seen tokens:** 256B

|Model|Params|Layers|Hidden size|Heads|Context length|
|:---:|:---:|:---:|:---:|:---:|:---:|
|13b model|13b|40|5120|40|4096|


## Training

- **Pre-training:**
  - **Hardware:** 128 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
  - **Software:** Megatron-LM

- **Instruction tuning:**
  - **Hardware:** 8 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
  - **Software:** [TRL](https://github.com/huggingface/trl), [PEFT](https://github.com/huggingface/peft), and [DeepSpeed](https://github.com/microsoft/DeepSpeed)

## Tokenizer

The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (50k)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).

- **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
- **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
- **Training data:** A subset of the datasets for model pre-training
- **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)
  - The acutal size of vocabulary in the pretrained model is 97,024 due to round-up to multiples of 256.


## Datasets

### Pre-training

The models have been pre-trained using a blend of the following datasets.

| Language | Dataset | Tokens|
|:---:|:---:|:---:|
|Japanese|[Wikipedia](https://huggingface.co/datasets/wikipedia)|1.4B
||[Common Crawl](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus)|130.7B
|English|[Wikipedia](https://huggingface.co/datasets/wikipedia)|4.7B
||[The Pile](https://huggingface.co/datasets/EleutherAI/pile)|110.3B
|Codes|[The Stack](https://huggingface.co/datasets/bigcode/the-stack)|8.7B

### Instruction tuning (To be updated)

The models have been fine-tuned on the following datasets.
 
| Language | Dataset | description |
|:---|:---:|:---:|
|Japanese|[jaster](https://github.com/llm-jp/llm-jp-eval)| An automatically transformed data from the existing Japanese NLP datasets |
||[databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k)| A translated one by DeepL in LLM-jp |
||[OpenAssistant Conversations Dataset](https://huggingface.co/datasets/OpenAssistant/oasst1)| A translated one by DeepL in LLM-jp |


## Evaluation

You can view the evaluation results of several LLMs on this [leaderboard](http://wandb.me/llm-jp-leaderboard). We used [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) (v1.3.0) for the evaluation.

## Risks and Limitations

The models released here are still in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.


## Send Questions to

llm-jp(at)nii.ac.jp


## License

[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)


## Model Card Authors (To be updated)

*The names are listed in alphabetical order.*

Namgi Han, Tatsuya Hiraoka, Hirokazu Kiyomaru, Takashi Kodama, and Hiroshi Matsuda.