kskshr commited on
Commit
2d55de6
1 Parent(s): 8682efc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md CHANGED
@@ -1,3 +1,83 @@
1
  ---
 
2
  license: cc-by-sa-4.0
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: ja
3
  license: cc-by-sa-4.0
4
+ library_name: transformers
5
+ tags:
6
+ - bert
7
+ - fill-mask
8
+ datasets:
9
+ - wikipedia
10
+ mask_token: "[MASK]"
11
+ widget:
12
+ - text: "[MASK] 大学 で 自然 言語 処理 を 専攻 する 。"
13
  ---
14
+
15
+ # ku-accms/bert-base-japanese-ssuw
16
+ ## Model description
17
+ This is a Japanese BERT base model pre-trained on a Japanese Wikipedia dump with super short unit words (SSUW).
18
+
19
+ ## Pre-processing
20
+ The input text should be converted to full-width (zenkaku) characters and segmented into super short unit words in advance (e.g., by KyTea).
21
+
22
+ ## How to use
23
+ You can use this model directly with a pipeline for masked language modeling:
24
+
25
+ ```python
26
+ >>> from transformers import pipeline
27
+ >>> unmasker = pipeline('fill-mask', model='ku-accms/bert-base-japanese-ssuw')
28
+ >>> unmasker("[MASK] 大学 で 自然 言語 処理 を 専攻 する 。")
29
+ [{'sequence': 'スタンフォード 大学 で 自然 言語 処理 を 専攻 する 。',
30
+ 'score': 0.13041487336158752,
31
+ 'token': 26978,
32
+ 'token_str': 'スタンフォード'},
33
+ {'sequence': '早稲田 大学 で 自然 言語 処理 を 専攻 する 。',
34
+ 'score': 0.05302431806921959,
35
+ 'token': 17048,
36
+ 'token_str': '早稲田'},
37
+ {'sequence': 'ハーバード 大学 で 自然 言語 処理 を 専攻 する 。',
38
+ 'score': 0.048841025680303574,
39
+ 'token': 21731,
40
+ 'token_str': 'ハーバード'},
41
+ {'sequence': '筑波 大学 で 自然 言語 処理 を 専攻 する 。',
42
+ 'score': 0.04634753614664078,
43
+ 'token': 20287,
44
+ 'token_str': '筑波'},
45
+ {'sequence': '東京 大学 で 自然 言語 処理 を 専攻 する 。',
46
+ 'score': 0.030050478875637054,
47
+ 'token': 13949,
48
+ 'token_str': '東京'}]
49
+ ```python
50
+ import zenhan
51
+ import Mykytea
52
+ kytea_model_path = "somewhere"
53
+ kytea = Mykytea.Mykytea("-model {} -notags".format(kytea_model_path))
54
+ def preprocess(text):
55
+ return " ".join(kytea.getWS(zenhan.h2z(text)))
56
+
57
+ from transformers import BertTokenizer, BertModel
58
+ tokenizer = BertTokenizer.from_pretrained('ku-accms/bert-base-japanese-ssuw')
59
+ model = BertModel.from_pretrained("ku-accms/bert-base-japanese-ssuw")
60
+ text = "京都大学で自然言語処理を専攻する。"
61
+ encoded_input = tokenizer(preprocess(text), return_tensors='pt')
62
+ output = model(**encoded_input)
63
+ ```
64
+
65
+ ## Training data
66
+ We used a Japanese Wikipedia dump (as of 20230101, 3.7GB).
67
+
68
+ ## Training procedure
69
+ We first segmented the texts into words by KyTea and then tokenized the words into subwords using WordPiece with a vocabulary size of 32,000. We pre-trained the BERT model using [transformers](https://github.com/huggingface/transformers) library. The training took about 8 days using 4 NVIDIA A100-SXM4-80GB GPUs.
70
+
71
+ The following hyperparameters were used for the pre-training.
72
+
73
+ - learning_rate: 2e-4
74
+ - weight decay: 1e-2
75
+ - per_device_train_batch_size: 80
76
+ - num_devices: 4
77
+ - gradient_accumulation_steps: 3
78
+ - total_train_batch_size: 960
79
+ - max_seq_length: 512
80
+ - optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-06
81
+ - lr_scheduler_type: linear schedule with warmup
82
+ - training_steps: 500,000
83
+ - warmup_steps: 10,000