File size: 5,268 Bytes
976f5d4 da40370 976f5d4 da40370 976f5d4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
---
language: ja
license: cc-by-nc-sa-4.0
tags:
- roberta
- medical
inference: false
---
# alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000
## Model description
This is a Japanese RoBERTa base model pre-trained on academic articles in medical sciences collected by Japan Science and Technology Agency (JST).
This model is released under the [Creative Commons 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/deed) (CC BY-NC-SA 4.0).
## Datasets used for pre-training
- abstracts (train: 1.6GB (10M sentences), validation: 0.2GB (1.3M sentences))
- abstracts & body texts (train: 0.2GB (1.4M sentences))
## How to use
**Before using the model, make sure that [Manbyo Dictionary](https://sociocom.naist.jp/manbyou-dic/) has been downloaded under `/usr/local/lib/mecab/dic/userdic`.**
```bash
# download Manbyo-Dictionary
mkdir -p /usr/local/lib/mecab/dic/userdic
wget https://sociocom.jp/~data/2018-manbyo/data/MANBYO_201907_Dic-utf8.dic
mv MANBYO_201907_Dic-utf8.dic /usr/local/lib/mecab/dic/userdic
```
---
**Note: If you don't have root privileges and find it difficult to download the Manbyo Dictionary to `/usr/local/lib/mecab/dic/userdic`, you can still load our model by overriding tokenizer settings as follows:**
```bash
# download Manbyo-Dictionary wherever you like
wget https://sociocom.jp/~data/2018-manbyo/data/MANBYO_201907_Dic-utf8.dic
mv MANBYO_201907_Dic-utf8.dic /anywhere/you/like
```
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000")
tokenizer = AutoTokenizer.from_pretrained("alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000", **{
"mecab_kwargs": {
"mecab_option": "-u /anywhere/you/like/MANBYO_201907_Dic-utf8.dic"
}
})
```
---
**Input text must be converted to full-width characters(全角)in advance.**
You can use this model for masked language modeling as follows:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
model = AutoModelForMaskedLM.from_pretrained("alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000")
model.eval()
tokenizer = AutoTokenizer.from_pretrained("alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000")
texts = ['この患者は[MASK]と診断された。']
inputs = tokenizer.batch_encode_plus(texts, return_tensors='pt')
outputs = model(**inputs)
tokenizer.convert_ids_to_tokens(outputs.logits[0][1:-1].argmax(axis=-1))
# ['この', '患者', 'は', 'SLE', 'と', '診断', 'さ', 'れ', 'た', '。']
```
Alternatively, you can employ [Fill-mask pipeline](https://huggingface.co/tasks/fill-mask).
```python
from transformers import pipeline
fill = pipeline("fill-mask", model="alabnii/jmedroberta-base-manbyo-wordpiece-vocab50000", top_k=10)
fill("この患者は[MASK]と診断された。")
#[{'score': 0.035826072096824646,
# 'token': 10840,
# 'token_str': 'SLE',
# 'sequence': 'この 患者 は SLE と 診断 さ れ た 。'},
# {'score': 0.020926717668771744,
# 'token': 10777,
# 'token_str': '統合失調症',
# 'sequence': 'この 患者 は 統合失調症 と 診断 さ れ た 。'},
# {'score': 0.02092057280242443,
# 'token': 8338,
# 'token_str': '糖尿病',
# 'sequence': 'この 患者 は 糖尿病 と 診断 さ れ た 。'},
# ...
```
You can fine-tune this model on downstream tasks.
**See also sample Colab notebooks:** https://colab.research.google.com/drive/1p2770dXs0lge1IkuSHYLO-G-KJ4gZtou?usp=sharing
## Tokenization
Mecab (w/ IPAdic & [Manbyo Dictionary](https://sociocom.naist.jp/manbyou-dic/)) was used for pre-training. Each word is tokenized into tokens by [WordPiece](https://huggingface.co/course/chapter6/6).
## Vocabulary
The vocabulary consists of 50000 tokens including words (IPAdic & [Manbyo Dictionary](https://sociocom.naist.jp/manbyou-dic/)) and subwords induced by [WordPiece](https://huggingface.co/course/chapter6/6).
## Training procedure
The following hyperparameters were used during pre-training:
- learning_rate: 0.0001
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- total_train_batch_size: 256
- total_eval_batch_size: 256
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 20000
- training_steps: 2000000
- mixed_precision_training: Native AMP
## Note: Why do we call our model RoBERTa, not BERT?
As the config file suggests, our model is based on HuggingFace's `BertForMaskedLM` class. However, we consider our model as **RoBERTa** for the following reasons:
- We kept training only with max sequence length (= 512) tokens.
- We removed the next sentence prediction (NSP) training objective.
- We introduced dynamic masking (changing the masking pattern in each training iteration).
## Acknowledgements
This work was supported by Japan Japan Science and Technology Agency (JST) AIP Trilateral AI Research (Grant Number: JPMJCR20G9), and Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) (Project ID: jh221004), in Japan.
In this research work, we used the "[mdx: a platform for the data-driven future](https://mdx.jp/)". |