line-corporation
/

japanese-large-lm-1.7b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

japanese-large-lm-1.7b / README.md

sho-takase's picture

Add tech blog link

86ba7ed about 1 year ago

|

No virus

3.02 kB

	---
	license: apache-2.0
	datasets:
	- wikipedia
	- mc4
	- cc100
	- oscar
	language:
	- ja
	---

	# japanese-large-lm-1.7b

	This repository provides a 1.7B parameters Japanese language model, trained by [LINE Corporation](https://linecorp.com/ja/).
	[Tech Blog](https://engineering.linecorp.com/ja/blog/3.6-billion-parameter-japanese-language-model) explains details.

	## How to use

	```
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed

	model = AutoModelForCausalLM.from_pretrained("line-corporation/japanese-large-lm-1.7b", torch_dtype=torch.float16)
	tokenizer = AutoTokenizer.from_pretrained("line-corporation/japanese-large-lm-1.7b", use_fast=False)
	generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
	set_seed(101)

	text = generator(
	"おはようございます、今日の天気は",
	max_length=30,
	do_sample=True,
	pad_token_id=tokenizer.pad_token_id,
	num_return_sequences=5,
	)

	for t in text:
	print(t)

	# [{'generated_text': 'おはようございます、今日の天気は雨模様ですね。梅雨のこの時期のジメジメ、ムシムシはたまらないですねえ~。皆さんもお'},
	# {'generated_text': 'おはようございます、今日の天気は快晴。そして、朝8時15分には、 8月9日現在の、月島・勝どき・'},
	# {'generated_text': 'おはようございます、今日の天気は曇りです。朝起きたら雪がチラついていました。日中も雪が舞い散るような天気です。朝から寒いですね。'},
	# {'generated_text': 'おはようございます、今日の天気は雨です。昨日、天気が悪く洗濯物を干しにベランダに出た時に雨に降られ、風邪が悪化しそうです。今日洗濯'},
	# {'generated_text': 'おはようございます、今日の天気は晴天ですが涼しい1日です、気温は午後になり若干下がる予報です。 6月も10日を'}]
	```

	## Model architecture
	\| Model \| Vocab size \| Architecture \| Position type \| Layers \| Hidden dim \| Attention heads \|
	\| :---: \| :--------: \| :----------- \| :-----------: \| :----: \| :--------: \| :-------------: \|
	\| 1.7B \| 51200 \| GPT2 \| Absolute \| 24 \| 2304 \| 24 \|
	\| 3.6B \| 51200 \| GPTNeoX \| RoPE \| 30 \| 3072 \| 32 \|

	## Training Corpus
	Our training corpus consists of the Japanese portions of publicly available corpus such as C4, CC-100, and Oscar.
	We also incorporated the Web texts crawled by in-house system.
	The total size of our training corpus is about 650 GB.
	The trained model achieves 8.57 perplexity on the internal validation sets of Japanese C4.

	## Tokenization
	We use a sentencepiece tokenizer with a unigram language model and byte-fallback.
	We do not apply pre-tokenization with Japanese tokenizer.
	Thus, a user may directly feed raw sentences into the tokenizer.


	## License
	[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)