File size: 6,089 Bytes
e05a0c4 dae5c31 e05a0c4 a4d6c40 e05a0c4 42b6e72 e05a0c4 dae5c31 e05a0c4 dae5c31 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
license: other
datasets:
- mc4
- cc100
- oscar
- togethercomputer/RedPajama-Data-1T
language:
- ja
- en
library_name: transformers
base_model: meta-llama/Llama-2-70b-hf
pipeline_tag: text-generation
tags:
- llama
- llama-2
---
# KARAKURI LM
![KARAKURI LM](./thumbnail.png)
KARAKURI LM is a pretrained language model that builds upon Llama 2.
Our model enhances Llama 2's capabilities by incorporating additional Japanese vocabulary and further pretraining on a mixture of Japanese and multilingual corpora.
KARAKURI LM Chat is a fine-tuned version of KARAKURI LM, which was trained on a mixture of publicly available and closed datasets using the [SteerLM](https://aclanthology.org/2023.findings-emnlp.754/) technique.
During fine-tuning, our model employed a continual learning approach.
Unlike the common practice of relying solely on structured conversational datasets, we also incorporated unstructured corpora, similar to what was used during its pretraining phase.
Despite the conversational datasets containing only 2.5% Japanese tokens, our model has shown remarkable performance.
It achieves the highest performance among Japanese open models on the [MT-Bench-jp](https://api.wandb.ai/links/wandb-japan/6ff86bp3) at the time of release.
Furthermore, it achieves performance comparable to Llama 2 70B Chat on the original English [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench).
You can find more details in our blog post ([en](https://medium.com/karakuri/introducing-karakuri-lm-34c79a3bf341), [ja](https://medium.com/karakuri/karakuri-lm%E3%81%AE%E8%A7%A3%E8%AA%AC-4b6cf9c3d40f)).
If you are curious about our model, give our [demo](https://lm.karakuri.cc/) a try.
## Model Details
- **Developed by**: [KARAKURI Inc.](https://about.karakuri.ai/)
- **Model type**: Causal decoder-only transformer language model
- **Languages**: English and Japanese
- **Finetuned from**: [meta-llama/Llama-2-70b-hf](https://huggingface.co/meta-llama/Llama-2-70b-hf)
- **Contact**: For questions and comments about the model, please email `karakuri-rd@karakuri.ai`
## Performance
At the time of release, KARAKURI LM 70B Chat v0.1 achieves the highest performance among Japanese open models on the [MT-Bench-jp](https://api.wandb.ai/links/wandb-japan/6ff86bp3):
| Model | Size | Alignment | MT-Bench-jp |
| :---------------------------------- | :-----: | :---------: | ----------: |
| GPT-4 | - | RLHF | 8.78 |
| GPT-3.5-Turbo | - | RLHF | 8.24 |
| Claude 2.1 | - | RLHF | 8.18 |
| Gemini Pro | - | RLHF | 7.17 |
| **KARAKURI LM 70B Chat v0.1** | **70B** | **SteerLM** | **6.43** |
| Qarasu-14B-Chat-Plus-Unleashed | 14B | SFT | 6.26 |
| Llama 2 70B Chat | 70B | RLHF | 5.23 |
| ELYZA-Japanese-Llama-2-13B | 13B | SFT | 5.05 |
| Japanese-StableLM-Instruct-Beta-70B | 70B | SFT | 5.03 |
| Swallow-70B-Instruct | 70B | SFT | 4.39 |
It also achieves performance comparable to Llama 2 70B Chat on the original English [MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench):
| Model | Average | MT-Bench | MT-Bench-jp |
| :---------------------------- | -------: | -------: | ----------: |
| **KARAKURI LM 70B Chat v0.1** | **6.52** | **6.61** | **6.43** |
| Llama 2 70B Chat | 6.04 | 6.86 | 5.23 |
## Use in 🤗 Transformers
You can run the model using the `pipeline()` function from 🤗 Transformers:
```python
from transformers import pipeline
generator = pipeline("text-generation", model="karakuri-ai/karakuri-lm-70b-v0.1", device_map="auto", torch_dtype="auto")
prompt = """以下は人間とAIアシスタントとの会話です。
Human: こんにちは。
AI: こんにちは、私はAIアシスタントです。何かお手伝いできることはありますか?
Human: 週末に日帰りで東京に遊びに行こうと思っています。日帰りなので、短時間で回れるおすすめの観光プランを教えてください。
AI: """
outputs = generator(prompt, return_full_text=False, max_new_tokens=512)
outputs[0]["generated_text"]
```
## Training
### Training Datasets
- [mC4](https://huggingface.co/datasets/mc4)
- [CC100](https://huggingface.co/datasets/cc100)
- [OSCAR](https://huggingface.co/datasets/oscar)
- [RedPajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)
- Our internal Japanese corpora
### Training Infrastructure
- **Hardware**: KARAKURI LM 70B was trained on 32 nodes of an Amazon EC2 trn1.32xlarge instance.
- **Software**: We use code based on [neuronx-nemo-megatron](https://github.com/aws-neuron/neuronx-nemo-megatron).
## Acknowledgements
We gratefully acknowledge the support from AWS Japan through the [AWS LLM Development Support Program](https://aws.amazon.com/jp/local/llm-development-support-program/).
## License
Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.
Subject to the license above, and except for commercial purposes, you are free to share and adapt KARAKURI LM, provided that you must, in a recognizable and appropriate manner, (i) state that you are using KARAKURI LM developed by KARAKURI Inc., when you publish or make available to third parties KARAKURI LM, its derivative works or modification, or any output or results of KARAKURI LM or its derivative works or modification, and (ii) indicate your contributions, if you modified any material of KARAKURI LM.
If you plan to use KARAKURI LM for commercial purposes, please contact us beforehand. You are not authorized to use KARAKURI LM for commercial purposes unless we expressly grant you such rights.
If you have any questions regarding the interpretation of above terms, please also feel free to contact us.
|