japanese-large-lm-1.7b

This repository provides a 1.7B parameters Japanese language model, trained by LINE Corporation.

Tech Blog explains details.

How to use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
 
model = AutoModelForCausalLM.from_pretrained("line-corporation/japanese-large-lm-1.7b", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("line-corporation/japanese-large-lm-1.7b", use_fast=False)
generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
set_seed(101)
 
text = generator(
    "おはようございます、今日の天気は",
    max_length=30,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    num_return_sequences=5,
)

for t in text:
  print(t)

# [{'generated_text': 'おはようございます、今日の天気は雨模様ですね。梅雨のこの時期の ジメジメ、ムシムシはたまらないですねえ~。 皆さんもお'},
#  {'generated_text': 'おはようございます、今日の天気は快晴。 そして、朝8時15分には、 8月9日現在の、 月島・勝どき・'},
#  {'generated_text': 'おはようございます、今日の天気は曇りです。 朝起きたら雪がチラついていました。 日中も雪が舞い散るような天気です。 朝から寒いですね。'},
#  {'generated_text': 'おはようございます、今日の天気は雨です。昨日、天気が悪く洗濯物を干しにベランダに出た時に雨に降られ、風邪が悪化しそうです。今日洗濯'},
#  {'generated_text': 'おはようございます、今日の天気は晴天ですが涼しい1日です、気温は午後になり 若干下がる予報です。 6月も10日を'}]

Model architecture

Model	Vocab size	Architecture	Position type	Layers	Hidden dim	Attention heads
1.7B	51200	GPT2	Absolute	24	2304	24
3.6B	51200	GPTNeoX	RoPE	30	3072	32

Training Corpus

Our training corpus consists of the Japanese portions of publicly available corpus such as C4, CC-100, and Oscar. We also incorporated the Web texts crawled by in-house system. The total size of our training corpus is about 650 GB. The trained model achieves 8.57 perplexity on the internal validation sets of Japanese C4.

Tokenization

We use a sentencepiece tokenizer with a unigram language model and byte-fallback. We do not apply pre-tokenization with Japanese tokenizer. Thus, a user may directly feed raw sentences into the tokenizer.

License

Apache License, Version 2.0

line-corporation
/

japanese-large-lm-1.7b