language: ja
license: cc-by-sa-4.0
library_name: transformers
datasets:
- cc100
- mc4
- oscar
- wikipedia
- izumi-lab/cc100-ja
- izumi-lab/mc4-ja-filter-ja-normal
- izumi-lab/oscar2301-ja-filter-ja-normal
- izumi-lab/wikipedia-ja-20230720
- izumi-lab/wikinews-ja-20230728
DeBERTa V2 base Japanese
This is a DeBERTaV2 model pretrained on Japanese texts. The codes for the pretraining are available at retarfi/language-pretraining.
How to use
You can use this model for masked language modeling as follows:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("izumi-lab/deberta-v2-base-japanese", use_fast=False)
model = AutoModelForMaskedLM.from_pretrained("izumi-lab/deberta-v2-base-japanese")
...
Tokenization
The model uses a sentencepiece-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using sentencepiece.
Training Data
We used the following corpora for pre-training:
- Japanese portion of CC-100
- Japanese portion of mC4
- Japanese portion of OSCAR2301
- Japanese Wikipedia as of July 20, 2023
- Japanese Wikinews as of July 28, 2023
Training Parameters
learning_rate in parentheses indicate the learning rate for additional pre-training with the financial corpus.
- learning_rate: 2.4e-4 (6e-5)
- total_train_batch_size: 2,016
- max_seq_length: 512
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06
- lr_scheduler_type: linear schedule with warmup
- training_steps: 1,000,000
- warmup_steps: 100,000
- precision: FP16
Fine-tuning on General NLU tasks
We evaluate our model with the average of five seeds.
Other models are from JGLUE repository
Model | JSTS | JNLI | JCommonsenseQA |
---|---|---|---|
Pearson/Spearman | acc | acc | |
DeBERTaV2 base | 0.919/0.882 | 0.912 | 0.859 |
Waseda RoBERTa base | 0.913/0.873 | 0.895 | 0.840 |
Tohoku BERT base | 0.909/0.868 | 0.899 | 0.808 |
Citation
Citation will be updated. Please check when you would cite.
@article{Suzuki-etal-2023-ipm,
title = {Constructing and analyzing domain-specific language model for financial text mining},
author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},
journal = {Information Processing \& Management},
volume = {60},
number = {2},
pages = {103194},
year = {2023},
doi = {10.1016/j.ipm.2022.103194}
}
@article{Suzuki-2024-findebertav2,
jtitle = {{FinDeBERTaV2: 単語分割フリーな金融事前学習言語モデル}},
title = {{FinDeBERTaV2: Word-Segmentation-Free Pre-trained Language Model for Finance}},
jauthor = {鈴木, 雅弘 and 坂地, 泰紀 and 平野, 正徳 and 和泉, 潔},
author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},
jjournal = {人工知能学会論文誌},
journal = {Transactions of the Japanese Society for Artificial Intelligence},
volume = {39},
number = {4},
pages={FIN23-G_1-14},
year = {2024},
doi = {10.1527/tjsai.39-4_FIN23-G},
}
Licenses
The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0.
Acknowledgments
This work was supported in part by JSPS KAKENHI Grant Number JP21K12010, and the JST-Mirai Program Grant Number JPMJMI20B1, Japan.