watashiha
/

watashiha-gpt-6b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

watashiha-gpt-6b / README.md

watashihakobashi's picture

watashihakobashi

Update README.md

74b91fa verified 6 months ago

|

history blame contribute delete

No virus

3.22 kB

	---
	license: apache-2.0
	language:
	- ja
	---

	## モデル概要
	AWSのtrn1インスタンスを用いて開発した大喜利言語モデルです。
	事前学習後に大喜利データでFine-tuningしています。

	* Architecture: GPT2
	* Vocab size: 44880
	* Model size: 6B params
	* License: [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)
	* Library: [aws-neuron-reference-for-megatron-lm](https://github.com/aws-neuron/aws-neuron-reference-for-megatron-lm)

	## 学習データ
	以下のコーパスを使用して、事前学習を行いました。その際のトークン数は477億トークンでした。
	* [C4](https://huggingface.co/datasets/mc4)の日本語データ
	* [CC-100](https://huggingface.co/datasets/cc100)の日本語データ
	* [OSCAR](https://huggingface.co/datasets/oscar)の日本語データ
	* [Wikipedia](https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8)の日本語ダンプデータ
	* 自社データ

	Fine-tuningは、693万件の大喜利データを用いて行いました。

	## 使用方法
	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "watashiha/watashiha-gpt-6b"
	tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
	model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

	if torch.cuda.is_available():
	model = model.to("cuda")

	text = "お題:ホラー映画の「○○○から逃げろ！」<SEP>回答:"
	token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)

	output_ids = model.generate(
	token_ids,
	do_sample=True,
	max_new_tokens=32,
	top_p=0.9,
	top_k=50,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id,
	)
	output = tokenizer.decode(output_ids.tolist()[0], skip_special_tokens=True)
	print(output)
	"""お題:ホラー映画の「○○○から逃げろ！」<SEP>回答:怖いもの知らずの大学生"""
	```

	## 性能比較
	以下は各モデルを同様の条件でFine-tuningし、出力させたボケをケータイ大喜利レジェンドに4段階で評価してもらった結果です。

	圏外:お題を日本語として理解できていない
	1本:お題を理解はできているがボケとして成立していない（面白みがない）
	2本:ボケとして成立している（面白みがある）
	3本:面白い（一定以上の面白さがある）

	\| \| 圏外 \| 1本 \| 2本 \| 3本 \|
	\|--------------\|------\|-----\|-----\|-----\|
	\| watashiha-gpt-6b \| 77 \| 204 \| 175 \| 44 \|
	\|[rinna/japanese-gpt-neox-3.6b](https://huggingface.co/rinna/japanese-gpt-neox-3.6b) \| 88 \| 194 \| 185 \| 30 \|
	\| [stabilityai/japanese-stablelm-base-alpha-7b](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b) \| 96 \| 164 \| 196 \| 43 \|
	\| [elyza/ELYZA-japanese-Llama-2-7b-fast](https://huggingface.co/elyza/ELYZA-japanese-Llama-2-7b-fast) \| 75 \| 197 \| 198 \| 25 \|

	## 開発者
	- 内田達弥 (UCHIDA, Tatsuya)
	- 小橋洋平 (KOBASHI, Yohei)
	- 黒木修弥 (KUROKI, Shuya)
	- 久保田光 (KUBOTA, Hikaru)
	- 竹之内大輔 (TAKENOUCHI, Daisuke)