--- language: ja license: cc-by-sa-4.0 library_name: transformers datasets: - cc100 - mc4 - oscar - wikipedia - izumi-lab/cc100-ja - izumi-lab/mc4-ja-filter-ja-normal - izumi-lab/oscar2301-ja-filter-ja-normal - izumi-lab/wikipedia-ja-20230720 - izumi-lab/wikinews-ja-20230728 --- # DeBERTa V2 base Japanese This is a [DeBERTaV2](https://github.com/microsoft/DeBERTa) model pretrained on Japanese texts. The codes for the pretraining are available at [retarfi/language-pretraining](https://github.com/retarfi/language-pretraining/releases/tag/v2.2.1). ## How to use You can use this model for masked language modeling as follows: ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("izumi-lab/deberta-v2-base-japanese", use_fast=False) model = AutoModelForMaskedLM.from_pretrained("izumi-lab/deberta-v2-base-japanese") ... ``` ## Tokenization The model uses a sentencepiece-based tokenizer, the vocabulary was trained on the Japanese Wikipedia using [sentencepiece](https://github.com/google/sentencepiece). ## Training Data We used the following corpora for pre-training: - [Japanese portion of CC-100](https://huggingface.co/datasets/izumi-lab/cc100-ja) - [Japanese portion of mC4](https://huggingface.co/datasets/izumi-lab/mc4-ja-filter-ja-normal) - [Japanese portion of OSCAR2301](https://huggingface.co/datasets/izumi-lab/oscar2301-ja-filter-ja-normal) - [Japanese Wikipedia as of July 20, 2023](https://huggingface.co/datasets/izumi-lab/wikipedia-ja-20230720) - [Japanese Wikinews as of July 28, 2023](https://huggingface.co/datasets/izumi-lab/wikinews-ja-20230728) ## Training Parameters learning_rate in parentheses indicate the learning rate for additional pre-training with the financial corpus. - learning_rate: 2.4e-4 (6e-5) - total_train_batch_size: 2,016 - max_seq_length: 512 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06 - lr_scheduler_type: linear schedule with warmup - training_steps: 1,000,000 - warmup_steps: 100,000 - precision: FP16 ## Fine-tuning on General NLU tasks We evaluate our model with the average of five seeds. Other models are from [JGLUE repository](https://github.com/yahoojapan/JGLUE) | Model | JSTS | JNLI | JCommonsenseQA | |-------------------------------|------------------|-----------|----------------| | | Pearson/Spearman | acc | acc | | **DeBERTaV2 base** | **0.919/0.882** | **0.912** | **0.859** | | Waseda RoBERTa base | 0.913/0.873 | 0.895 | 0.840 | | Tohoku BERT base | 0.909/0.868 | 0.899 | 0.808 | ## Citation Citation will be updated. Please check when you would cite. ``` @article{Suzuki-etal-2023-ipm, title = {Constructing and analyzing domain-specific language model for financial text mining}, author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi}, journal = {Information Processing \& Management}, volume = {60}, number = {2}, pages = {103194}, year = {2023}, doi = {10.1016/j.ipm.2022.103194} } ``` ## Licenses The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/). ## Acknowledgments This work was supported in part by JSPS KAKENHI Grant Number JP21K12010, and the JST-Mirai Program Grant Number JPMJMI20B1, Japan.