File size: 6,266 Bytes
976f5d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cafd879
 
ebb1aa8
 
cafd879
 
 
 
 
 
3db333e
cafd879
 
 
ebb1aa8
 
 
 
 
 
 
 
 
 
 
 
976f5d4
 
 
 
 
 
 
 
 
 
 
 
 
da40370
 
976f5d4
 
da40370
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
976f5d4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
language: ja
license: cc-by-nc-sa-4.0
tags:
- roberta
- medical
inference: false
---

# alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000

## Model description

This is a Japanese RoBERTa base model pre-trained on academic articles in medical sciences collected by Japan Science and Technology Agency (JST).

This model is released under the [Creative Commons 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/deed) (CC BY-NC-SA 4.0).

#### Reference

Ja:

```
@InProceedings{sugimoto_nlp2023_jmedroberta,
    author =    "杉本海人 and 壹岐太一 and 知田悠生 and 金沢輝一 and 相澤彰子",
    title =     "J{M}ed{R}o{BERT}a: 日本語の医学論文にもとづいた事前学習済み言語モデルの構築と評価",
    booktitle = "言語処理学会第29回年次大会",
    year =      "2023",
    url =       "https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/P3-1.pdf"
}
```

En:

```
@InProceedings{sugimoto_nlp2023_jmedroberta,
    author =    "Sugimoto, Kaito and Iki, Taichi and Chida, Yuki and Kanazawa, Teruhito and Aizawa, Akiko",
    title =     "J{M}ed{R}o{BERT}a: a Japanese Pre-trained Language Model on Academic Articles in Medical Sciences (in Japanese)",
    booktitle = "Proceedings of the 29th Annual Meeting of the Association for Natural Language Processing",
    year =      "2023",
    url =       "https://www.anlp.jp/proceedings/annual_meeting/2023/pdf_dir/P3-1.pdf"
}
```

## Datasets used for pre-training

- abstracts (train: 1.6GB (10M sentences), validation: 0.2GB (1.3M sentences))
- abstracts & body texts (train: 0.2GB (1.4M sentences))

## How to use

**Before using the model, make sure that [Manbyo Dictionary](https://sociocom.naist.jp/manbyou-dic/) has been downloaded under `/usr/local/lib/mecab/dic/userdic`.**

```bash
# download Manbyo-Dictionary

mkdir -p /usr/local/lib/mecab/dic/userdic
wget https://sociocom.jp/~data/2018-manbyo/data/MANBYO_201907_Dic-utf8.dic
mv MANBYO_201907_Dic-utf8.dic /usr/local/lib/mecab/dic/userdic
```

---

**Note: If you don't have root privileges and find it difficult to download the Manbyo Dictionary to `/usr/local/lib/mecab/dic/userdic`, you can still load our model by overriding tokenizer settings as follows:**

```bash
# download Manbyo-Dictionary wherever you like

wget https://sociocom.jp/~data/2018-manbyo/data/MANBYO_201907_Dic-utf8.dic
mv MANBYO_201907_Dic-utf8.dic /anywhere/you/like
```

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000")
tokenizer = AutoTokenizer.from_pretrained("alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000", **{
    "mecab_kwargs": {
        "mecab_option": "-u /anywhere/you/like/MANBYO_201907_Dic-utf8.dic"
    }
})
```

---

**Input text must be converted to full-width characters(全角)in advance.**

You can use this model for masked language modeling as follows:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000")
model.eval()
tokenizer = AutoTokenizer.from_pretrained("alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000")

texts = ['この患者は[MASK]と診断された。']
inputs = tokenizer.batch_encode_plus(texts, return_tensors='pt')
outputs = model(**inputs)
tokenizer.convert_ids_to_tokens(outputs.logits[0][1:-1].argmax(axis=-1))
# ['この', '患者', 'は', 'SLE', 'と', '診断', 'さ', 'れ', 'た', '。']
```

Alternatively, you can employ [Fill-mask pipeline](https://huggingface.co/tasks/fill-mask).

```python
from transformers import pipeline

fill = pipeline("fill-mask", model="alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000", top_k=10)
fill("この患者は[MASK]と診断された。")
#[{'score': 0.035826072096824646,
#  'token': 10840,
#  'token_str': 'SLE',
#  'sequence': 'この 患者 は SLE と 診断 さ れ た 。'},
# {'score': 0.020926717668771744,
#  'token': 10777,
#  'token_str': '統合失調症',
#  'sequence': 'この 患者 は 統合失調症 と 診断 さ れ た 。'},
# {'score': 0.02092057280242443,
#  'token': 8338,
#  'token_str': '糖尿病',
#  'sequence': 'この 患者 は 糖尿病 と 診断 さ れ た 。'},
# ...
```

You can fine-tune this model on downstream tasks.

**See also sample Colab notebooks:** https://colab.research.google.com/drive/1p2770dXs0lge1IkuSHYLO-G-KJ4gZtou?usp=sharing

## Tokenization

Mecab (w/ IPAdic & [Manbyo Dictionary](https://sociocom.naist.jp/manbyou-dic/)) was used for pre-training. Each word is tokenized into tokens by [WordPiece](https://huggingface.co/course/chapter6/6).

## Vocabulary

The vocabulary consists of 50000 tokens including words (IPAdic & [Manbyo Dictionary](https://sociocom.naist.jp/manbyou-dic/)) and subwords induced by [WordPiece](https://huggingface.co/course/chapter6/6).

## Training procedure

The following hyperparameters were used during pre-training:

- learning_rate: 0.0001
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- total_train_batch_size: 256
- total_eval_batch_size: 256
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 20000
- training_steps: 2000000
- mixed_precision_training: Native AMP

## Note: Why do we call our model RoBERTa, not BERT?

As the config file suggests, our model is based on HuggingFace's `BertForMaskedLM` class. However, we consider our model as **RoBERTa** for the following reasons:

- We kept training only with max sequence length (= 512) tokens.
- We removed the next sentence prediction (NSP) training objective.
- We introduced dynamic masking (changing the masking pattern in each training iteration).

## Acknowledgements

This work was supported by Japan Japan Science and Technology Agency (JST) AIP Trilateral AI Research (Grant Number: JPMJCR20G9), and Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) (Project ID: jh221004), in Japan.  
In this research work, we used the "[mdx: a platform for the data-driven future](https://mdx.jp/)".