MatrixC7's picture
Upload folder using huggingface_hub
20e8f96 verified
metadata
license: cc-by-sa-4.0
datasets:
  - mc4
  - cc100
  - oscar
  - togethercomputer/RedPajama-Data-1T
language:
  - ja
  - en
library_name: transformers
base_model: meta-llama/Llama-2-70b-hf
pipeline_tag: text-generation
tags:
  - llama
  - llama-2

KARAKURI LM

KARAKURI LM

KARAKURI LM is a pretrained language model that builds upon Llama 2. Our model enhances Llama 2's capabilities by incorporating additional Japanese vocabulary and further pretraining on a mixture of Japanese and multilingual corpora.

KARAKURI LM Chat is a fine-tuned version of KARAKURI LM, which was trained on a mixture of publicly available and closed datasets using the SteerLM technique. During fine-tuning, our model employed a continual learning approach. Unlike the common practice of relying solely on structured conversational datasets, we also incorporated unstructured corpora, similar to what was used during its pretraining phase.

Despite the conversational datasets containing only 2.5% Japanese tokens, our model has shown remarkable performance. It achieves the highest performance among Japanese open models on the MT-Bench-jp at the time of release. Furthermore, it achieves performance comparable to Llama 2 70B Chat on the original English MT-Bench.

You can find more details in our blog post (en, ja). If you are curious about our model, give our demo a try.

Model Details

  • Developed by: KARAKURI Inc.
  • Model type: Causal decoder-only transformer language model
  • Languages: English and Japanese
  • Finetuned from: meta-llama/Llama-2-70b-hf
  • Contact: For questions and comments about the model, please email karakuri-rd@karakuri.ai

Performance

At the time of release, KARAKURI LM 70B Chat v0.1 achieves the highest performance among Japanese open models on the MT-Bench-jp:

Model Size Alignment MT-Bench-jp
GPT-4 - RLHF 8.78
GPT-3.5-Turbo - RLHF 8.24
Claude 2.1 - RLHF 8.18
Gemini Pro - RLHF 7.17
KARAKURI LM 70B Chat v0.1 70B SteerLM 6.43
Qarasu-14B-Chat-Plus-Unleashed 14B SFT 6.26
Llama 2 70B Chat 70B RLHF 5.23
ELYZA-Japanese-Llama-2-13B 13B SFT 5.05
Japanese-StableLM-Instruct-Beta-70B 70B SFT 5.03
Swallow-70B-Instruct 70B SFT 4.39

It also achieves performance comparable to Llama 2 70B Chat on the original English MT-Bench:

Model Average MT-Bench MT-Bench-jp
KARAKURI LM 70B Chat v0.1 6.52 6.61 6.43
Llama 2 70B Chat 6.04 6.86 5.23

Use in 🤗 Transformers

You can run the model using the pipeline() function from 🤗 Transformers:

from transformers import pipeline

generator = pipeline("text-generation", model="karakuri-ai/karakuri-lm-70b-v0.1", device_map="auto", torch_dtype="auto")

prompt = """以下は人間とAIアシスタントとの会話です。

Human: こんにちは。
AI: こんにちは、私はAIアシスタントです。何かお手伝いできることはありますか?
Human: 週末に日帰りで東京に遊びに行こうと思っています。日帰りなので、短時間で回れるおすすめの観光プランを教えてください。
AI: """
outputs = generator(prompt, return_full_text=False, max_new_tokens=512)
outputs[0]["generated_text"]

Training

Training Datasets

Training Infrastructure

  • Hardware: KARAKURI LM 70B was trained on 32 nodes of an Amazon EC2 trn1.32xlarge instance.
  • Software: We use code based on neuronx-nemo-megatron.

Acknowledgements

We gratefully acknowledge the support from AWS Japan through the AWS LLM Development Support Program.

License

Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.

KARAKURI LM is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). Under this license, you are free to share and adapt this model, even for commercial purposes, as long as you provide appropriate credit and distribute your contributions under the same license.

However, if you wish to use KARAKURI LM for commercial purposes, we require that you contact us directly, regardless of the terms of the CC BY-SA 4.0 license. If you have any questions regarding the interpretation of its terms, please also feel free to contact us.