からまる Llama-3-Karamaru-v1

Karamaru is a conversational AI model developed by Sakana AI that responds in the style of Edo-period Japanese. While the base language model was originally trained on modern text, we applied continual pretraining using a custom Edo-period dataset consisting of over 25 million characters. This dataset includes approximately 13 million characters of human-transcribed text and 12 million characters transcribed using AI-based kuzushiji OCR from historical Japanese books.

With Karamaru, users can ask questions in modern Japanese and receive answers written in the classical Japanese style of the Edo period, reflecting the worldview and cultural context of that era. Karamaru offers a unique way to explore and engage with Japan’s historical language and thought.

Karamaru is intended as a tool for research, education, and cultural exploration—bridging time and language to bring the past closer to the present.

For further information, please refer to our blog post.

Developed by: Sakana AI
License: Llama3 Community License
Finetuned from model : Llama-3-ELYZA-JP-8B
Demo: Karamaru v1 Demo

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = "SakanaAI/Llama-3-Karamaru-v1"

model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", torch_dtype=torch.bfloat16)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

text = "AIにとって大事なものはなんですか。"
message = {"role": "user", "content": text}
conversation = [message]

input_ids = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, return_tensors="pt")
    
input_ids = input_ids.to(model.device)
attention_mask = torch.ones_like(input_ids).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        max_new_tokens=500,
        temperature=0.6,
        top_p=0.9,
        top_k=50,
        repetition_penalty=1.1,
        attention_mask=attention_mask
    )
output_ids = output_ids[0][input_ids.shape[1]:]
output = tokenizer.decode(output_ids, skip_special_tokens=True)
print(output)

Training Data

Karamaru was trained using a custom Edo-period text dataset totaling approximately 25 million characters.

Minna de Honkoku 12 millions characters.
Kuzushiji Dataset 1 million characters.
Pre-Modern Japanese Text Dataset 12 million characters using AI Kuzushiji OCR model RURI and using Sakana AI's LLM based classical Japanese OCR Refiner.

Limitations

Karamaru was trained on historical texts from the Edo period, which may reflect the social norms, values, and biases of that time. As a result, the model may generate responses that are considered inappropriate, outdated, or offensive by modern standards. Users should be mindful when using the model for research, educational or public-facing purposes.

Glossary

Edo period
A historical era in Japan spanning from 1603 to 1868, characterized by the rule of the Tokugawa shogunate, a strict social hierarchy, and flourishing traditional arts and culture.
Kuzushiji (くずし字)
A cursive style of Japanese writing used in historical texts, particularly before the Meiji 33rd year (1900). Kuzushiji includes Kanji and Hentaigana. It can be difficult to read without specialized training.

Developers

Tarin Clanuwat (Sakana AI)
Tianyu Zhao (Sakana AI)
Yuki Imajuku (Sakana AI)
Makoto Shing (Sakana AI)
Asanobu Kitamoto (National Institute of Informatics, ROIS-DS Center for Open Data in the Humanities, Sakana AI)

Collaborators

Kazuaki Yamamoto (National Institute of Japanese Literature)
Yuta Hashimoto (National Meseum of Japanese History)

Citation

BibTeX:

@misc{karamaruv1,
    url    = {https://SakanaAI/Llama-3-Karamaru-v1},
    title  = {Llama-3-Karamaru-v1},
    author = {Clanuwat, Tarin and Zhao, Tianyu and Imajuku, Yuki and Shing, Makoto and Kitamoto, Asanobu}
}

SakanaAI
/

Llama-3-Karamaru-v1