metadata

license: cc-by-sa-4.0
datasets:
  - globis-university/aozorabunko-clean
  - oscar-corpus/OSCAR-2301
  - Wikipedia
  - WikiBooks
  - CC-100
  - mC4
language:
  - ja

What’s this?

日本語リソースで学習した DeBERTa V3 モデルです。

以下のような特徴を持ちます:

定評のある DeBERTa V3 を用いたモデル
日本語特化
推論時に形態素解析器を用いない
単語境界をある程度尊重する (の都合上 や の判定負けを喫し のような複数語のトークンを生じさせない)

This is a model based on DeBERTa V3 pre-trained on Japanese resources.

The model has the following features:

Based on the well-known DeBERTa V3 model
Specialized for the Japanese language
Does not use a morphological analyzer during inference
Respects word boundaries to some extent (does not produce tokens spanning multiple words like の都合上 or の判定負けを喫し)

How to use

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_name = 'globis-university/deberta-v3-japanese-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

Tokenizer

工藤氏によって示された手法で学習した。

以下のことを意識している:

推論時の形態素解析器なし
トークンが単語 (unidic-cwj-202302) の境界を跨がない
Hugging Faceで使いやすい
大きすぎない語彙数

本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴があるが、反面埋め込み層のパラメータ数が大きくなりすぎることから、本モデルでは小さめの語彙数を採用している。

The tokenizer is trained using the method introduced by Kudo.

Key points include:

No morphological analyzer needed during inference
Tokens do not cross word boundaries (unidic-cwj-202302)
Easy to use with Hugging Face
Smaller vocabulary size

Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer, this model adopts a smaller vocabulary size to address this.

Data

Dataset Name	Notes	File Size (with metadata)	Factor
Wikipedia	2023/07; WikiExtractor	3.5GB	x2
Wikipedia	2023/07; cl-tohoku's method	4.8GB	x2
WikiBooks	2023/07; cl-tohoku's method	43MB	x2
Aozora Bunko	2023/07; globis-university/aozorabunko-clean	496MB	x4
CC-100	ja	90GB	x1
mC4	ja; extracted 10% of Wikipedia-like data using DSIR	91GB	x1
OSCAR 2023	ja; extracted 20% of Wikipedia-like data using DSIR	26GB	x1

Training parameters

Number of devices: 8
Batch size: 24 x 8
Learning rate: 1.92e-4
Maximum sequence length: 512
Optimizer: AdamW
Learning rate scheduler: Linear schedule with warmup
Training steps: 1,000,000
Warmup steps: 100,000
Precision: Mixed (fp16)

Evaluation

Model	JSTS	JNLI	JSQuAD	JCQA
≤ small
izumi-lab/deberta-v2-small-japanese	0.890/0.846	0.880	-	0.737
globis-university/deberta-v3-japanese-xsmall	0.916/0.880	0.913	0.869/0.938	0.821
base
cl-tohoku/bert-base-japanese-v3	0.919/0.881	0.907	0.880/0.946	0.848
nlp-waseda/roberta-base-japanese	0.913/0.873	0.895	0.864/0.927	0.840
izumi-lab/deberta-v2-base-japanese	0.919/0.882	0.912	-	0.859
ku-nlp/deberta-v2-base-japanese	0.922/0.886	0.922	0.899/0.951	-
ku-nlp/deberta-v3-base-japanese	0.927/0.891	0.927	0.896/-	-
globis-university/deberta-v3-japanese-base	0.925/0.895	0.921	0.890/0.950	0.886
large
cl-tohoku/bert-large-japanese-v2	0.926/0.893	0.929	0.893/0.956	0.893
roberta-large-japanese	0.930/0.896	0.924	0.884/0.940	0.907
roberta-large-japanese-seq512	0.926/0.892	0.926	0.918/0.963	0.891
ku-nlp/deberta-v2-large-japanese	0.925/0.892	0.924	0.912/0.959	-
globis-university/deberta-v3-japanese-large	0.928/0.896	0.924	0.896/0.956	0.900

License

CC BY SA 4.0

Acknowledgement

計算リソースに ABCI を利用させていただきました。ありがとうございます。

We used ABCI for computing resources. Thank you.