llm-jp
/

llm-jp-13b-v2.0

+---
+license: apache-2.0
+language:
+  - en
+  - ja
+programming_language:
+  - C
+  - C++
+  - C#
+  - Go
+  - Java
+  - JavaScript
+  - Lua
+  - PHP
+  - Python
+  - Ruby
+  - Rust
+  - Scala
+  - TypeScript
+library_name: transformers
+pipeline_tag: text-generation
+inference: false
+---
+# llm-jp-13b-v2.0
+This repository provides large language models developed by [LLM-jp](https://llm-jp.nii.ac.jp/), a collaborative project launched in Japan.
+| Model Variant |
+| :--- |
+|**Instruction models**|
+| [llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) |
+| [llm-jp-13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-ac_001-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) |
+| [llm-jp-13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-instruct-full-ac_001_16x-dolly-ichikara_004_001_single-oasst-oasst2-v2.0) |
+|  |
+| :--- |
+|**Pre-trained models**|
+| [llm-jp-13b-v2.0](https://huggingface.co/llm-jp/llm-jp-13b-v2.0) |
+Checkpoints format: Hugging Face Transformers
+## Required Libraries and Their Versions
+- torch>=2.2.2
+- transformers>=4.39.3
+- tokenizers>=0.15.2
+- accelerate>=0.27.2
+- flash-attn>=2.5.6
+## Usage
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-13b-v2.0")
+model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-13b-v2.0", device_map="auto", torch_dtype=torch.float16)
+text = "自然言語処理とは何か"
+tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    output = model.generate(
+        tokenized_input,
+        max_new_tokens=100,
+        do_sample=True,
+        top_p=0.95,
+        temperature=0.7,
+        repetition_penalty=1.05,
+    )[0]
+print(tokenizer.decode(output))
+```
+## Model Details
+- **Model type:** Transformer-based Language Model
+- **Total seen tokens:** 256B
+|Model|Params|Layers|Hidden size|Heads|Context length|
+|:---:|:---:|:---:|:---:|:---:|:---:|
+|13b model|13b|40|5120|40|4096|
+## Training
+- **Pre-training:**
+  - **Hardware:** 128 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
+  - **Software:** Megatron-LM
+- **Instruction tuning:**
+  - **Hardware:** 8 A100 40GB GPUs ([mdx cluster](https://mdx.jp/en/))
+  - **Software:** [TRL](https://github.com/huggingface/trl), [PEFT](https://github.com/huggingface/peft), and [DeepSpeed](https://github.com/microsoft/DeepSpeed)
+## Tokenizer
+The tokenizer of this model is based on [huggingface/tokenizers](https://github.com/huggingface/tokenizers) Unigram byte-fallback model.
+The vocabulary entries were converted from [`llm-jp-tokenizer v2.2 (100k: code20K_en40K_ja60K.ver2.2)`](https://github.com/llm-jp/llm-jp-tokenizer/releases/tag/v2.2).
+Please refer to [README.md](https://github.com/llm-jp/llm-jp-tokenizer) of `llm-ja-tokenizer` for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
+- **Model:** Hugging Face Fast Tokenizer using Unigram byte-fallback model which requires `tokenizers>=0.14.0`
+- **Training algorithm:** Marging Code/English/Japanese vocabularies constructed with SentencePiece Unigram byte-fallback and reestimating scores with the EM-algorithm.
+- **Training data:** A subset of the datasets for model pre-training
+- **Vocabulary size:** 96,867 (mixed vocabulary of Japanese, English, and source code)
+  - The acutal size of vocabulary in the pretrained model is 97,024 due to round-up to multiples of 256.
+## Datasets
+### Pre-training
+The models have been pre-trained using a blend of the following datasets.
+| Language | Dataset | Tokens|
+|:---|:---|:---|
+|Japanese|[Wikipedia](https://huggingface.co/datasets/wikipedia)|1.4B
+||[Common Crawl](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus)|130.7B
+|English|[Wikipedia](https://huggingface.co/datasets/wikipedia)|4.7B
+||[The Pile](https://huggingface.co/datasets/EleutherAI/pile)|110.3B
+|Codes|[The Stack](https://huggingface.co/datasets/bigcode/the-stack)|8.7B
+### Instruction tuning
+The models have been fine-tuned on the following datasets.
+| Language | Dataset | description |
+|:---|:---|:---|
+|Japanese|[ichikara-instruction-004-001](https://liat-aip.sakura.ne.jp/wp/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf%e4%bd%9c%e6%88%90/llm%e3%81%ae%e3%81%9f%e3%82%81%e3%81%ae%e6%97%a5%e6%9c%ac%e8%aa%9e%e3%82%a4%e3%83%b3%e3%82%b9%e3%83%88%e3%83%a9%e3%82%af%e3%82%b7%e3%83%a7%e3%83%b3%e3%83%87%e3%83%bc%e3%82%bf-%e5%85%ac%e9%96%8b/)| A manually constructed Japanese instruction dataset |
+|        |[answer-carefully-001]()| A manually constructed Japanese instruction dataset focusing on LLMs' safety |
+|        |[databricks-dolly-15k-ja](https://huggingface.co/datasets/llm-jp/databricks-dolly-15k-ja)| [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) translated into Japanese using DeepL  |
+|        |[oasst1-21k-ja](https://huggingface.co/datasets/llm-jp/oasst1-21k-ja)| A subset of [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) translated into Japanese using DeepL |
+|        |[oasst2-33k-ja](https://huggingface.co/datasets/llm-jp/oasst2-33k-ja)| A subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) translated into Japanese using DeepL |
+|English |[oasst1-21k-en](https://huggingface.co/datasets/llm-jp/oasst1-21k-en)| A subset of [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) |
+|        |[oasst2-33k-en](https://huggingface.co/datasets/llm-jp/oasst2-33k-en)| A subset of [oasst2](https://huggingface.co/datasets/OpenAssistant/oasst2) |
+## Evaluation
+You can view the evaluation results of several LLMs on this [leaderboard](http://wandb.me/llm-jp-leaderboard). We used [llm-jp-eval](https://github.com/llm-jp/llm-jp-eval) (v1.3.0) for the evaluation.
+## Risks and Limitations
+The models released here are still in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.
+## Send Questions to
+llm-jp(at)nii.ac.jp
+## License
+[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+## Model Card Authors
+*The names are listed in alphabetical order.*
+Namgi Han, Tatsuya Hiraoka, Hirokazu Kiyomaru, Takashi Kodama, and Hiroshi Matsuda.