Update README.md

40bf145 about 1 year ago

5.6 kB

	---
	license: mit
	language: ja
	library_name: transformers
	tags:
	- pytorch
	- deberta
	- deberta-v2
	- commonsenseqa
	- commonsense_qa
	- commonsense-qa
	- CommonsenseQA
	datasets:
	- wikipedia
	- cc100
	- oscar
	metrics:
	- accuracy

	---

	# このモデルはdeberta-v2-base-japaneseをファインチューニングしてCommonsenseQA(選択式の質問)に用いれるようにしたものです。
	このモデルはdeberta-v2-base-japaneseをyahoo japan/JGLUEのJCommonsenseQA( https://github.com/yahoojapan/JGLUE ) を用いてファインチューニングしたものです。
	形態素解析のためにJumanを用いるバージョンです。
	このモデルを利用する際はJumanをインストールしてください。
	JUMANのインストール方法は（　https://qiita.com/Helmet/items/b76ae8abc47186e24401　）を参考にしてください。

	# This model is fine-tuned model for CommonsenseQA which is based on deberta-v2-base-japanese
	This model is fine-tuned by using JGLUE/JCommonsenseQA dataset.

	You could use this model for CommonsenseQA tasks.

	You need to install Juman.

	So, please check out this site ( https://qiita.com/Helmet/items/b76ae8abc47186e24401 ) to install Juman
	# How to use 使い方
	transformersおよびpytorch、knp、pyknp、Juman、textspanをインストールしてください。
	以下のコードを実行することで、CommonsenseQAタスクを解かせることができます。 please execute this code.
	```python
	from transformers import AutoModelForMultipleChoice
	import torch
	import json
	import numpy as np


	# 初回はこちらを実行してください
	#model=AutoModelForMultipleChoice.from_pretrained('Mizuiro-sakura/deberta-v2-base-juman-finetuned-commonsenseqa')

	# 二回目以降はこちらを実行してください
	# modelフォルダをダウンロードしたパスを入力してください。defaultだとC:\Users\[ユーザー名]\.cache\huggingface\hubにあります。
	model=AutoModelForMultipleChoice.from_pretrained('C:\\Users\\.cache\\huggingface\\hub\\models--Mizuiro-sakura--deberta-v2-base-juman-finetuned-commonsenseqa')


	from transformers import DebertaV2TokenizerFast
	tkz=DebertaV2TokenizerFast.from_pretrained("Mizuiro-sakura/deberta-v2-base-juman-finetuned-commonsenseqa")
	tkz.__class__.__name__="JumanppDebertaV2TokenizerFast"
	tkz.init_kwargs["auto_map"]={"AutoTokenizer":[None,"tokenizer.JumanppDebertaV2TokenizerFast"]}
	tkz.save_pretrained("Mizuiro-sakura/deberta-v2-base-juman-finetuned-commonsenseqa")
	from transformers.models.bert_japanese.tokenization_bert_japanese import JumanppTokenizer


	class JumanppPreTokenizer(JumanppTokenizer):
	def jumanpp_split(self,i,normalized_string):
	import textspan
	t=str(normalized_string)
	k=self.tokenize(t)
	return [normalized_string[s:e] for c in textspan.get_original_spans(k,t) for s,e in c]
	def pre_tokenize(self,pretok):
	pretok.split(self.jumanpp_split)
	class JumanppDebertaV2TokenizerFast(DebertaV2TokenizerFast):
	def __init__(self,**kwargs):
	from tokenizers.pre_tokenizers import PreTokenizer,Metaspace,Sequence
	super().__init__(**kwargs)
	self._tokenizer.pre_tokenizer=Sequence([PreTokenizer.custom(JumanppPreTokenizer()),Metaspace()])
	def save_pretrained(self,save_directory,**kwargs):
	import os
	import shutil
	from tokenizers.pre_tokenizers import PreTokenizer,Metaspace,Sequence
	self._auto_map={"AutoTokenizer":[None,"tokenizer.JumanppDebertaV2TokenizerFast"]}
	self._tokenizer.pre_tokenizer=Metaspace()
	super().save_pretrained(save_directory,**kwargs)
	self._tokenizer.pre_tokenizer=Sequence([PreTokenizer.custom(JumanppPreTokenizer()),Metaspace()])
	shutil.copy(os.path.abspath(__file__),os.path.join(save_directory,"tokenizer.py"))

	question ="主に子ども向けのもので、イラストのついた物語が書かれているものはどれ？"
	choice1 = "世界"
	choice2 = "写真集"
	choice3 = "絵本"
	choice4 = "論文"
	choice5 = "図鑑"


	x1=tkz([question,question,question,question,question],[choice1,choice2,choice3,choice4,choice5],
	max_length=64, truncation=True, padding=True)
	leng=len(x1['input_ids'][0])
	leng2=len(x1['attention_mask'][0])

	# モデルに入力するための前処理
	X1 = np.empty(shape=(1, 5, leng))
	X2 = np.empty(shape=(1, 5, leng))
	X1[0, :, :] = x1['input_ids']
	X2[0, :, :] = x1['attention_mask']

	# モデルにトークンを入力し、最も確率が高い選択肢を抽出する
	results = model(torch.tensor(X1).to(torch.int64),torch.tensor(X2).to(torch.int64))
	print(torch.argmax(results.logits)+1)
	```
	# モデルの精度 accuracy of model
	eval_accuracy = 86.51 (日本語baseモデルとしては最高の精度)
	eval_loss = 0.5917

	（参考　BERT : 72.0, XLM RoBERTa base : 68.7, LUKE : 80.0)

	# deberta-v2-base-japaneseとは？
	日本語Wikipedeia（3.2GB）および、cc100(85GB)、oscar(54GB)を用いて訓練されたモデルです。
	京都大学黒橋研究室が公表されました。

	# Model description
	This is a Japanese DeBERTa V2 base model pre-trained on Japanese Wikipedia, the Japanese portion of CC-100, and the Japanese portion of OSCAR.

	# Acknowledgments 謝辞
	モデルを公開してくださった京都大学黒橋研究室には感謝いたします。
	またコードを作成するにあたり、KoichiYasuokaさんの日記( https://srad.jp/~yasuoka/journal/659881/ )を参考にさせて頂きました。
	深く感謝いたします。


	I would like to thank Kurohashi Lab at Kyoto University.

	And I would like to thank KoichiYasuoka.