hope04302 commited on
Commit
5fca91b
โ€ข
1 Parent(s): c59317c

Upload 7 files

Browse files
Files changed (7) hide show
  1. .gitattributes +6 -32
  2. README.md +164 -1
  3. config.json +15 -0
  4. flax_model.msgpack +3 -0
  5. pytorch_model.bin +3 -0
  6. tokenizer_config.json +3 -0
  7. vocab.txt +0 -0
.gitattributes CHANGED
@@ -1,35 +1,9 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
  *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
  *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
  *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
1
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
2
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
3
  *.bin filter=lfs diff=lfs merge=lfs -text
 
 
 
 
4
  *.h5 filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  *.tflite filter=lfs diff=lfs merge=lfs -text
6
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.ot filter=lfs diff=lfs merge=lfs -text
8
+ *.onnx filter=lfs diff=lfs merge=lfs -text
9
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
 
 
README.md CHANGED
@@ -1,3 +1,166 @@
1
  ---
2
- license: unknown
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - ko
4
+
5
  ---
6
+
7
+ ## KoRean based Bert pre-trained (KR-BERT)
8
+
9
+ This is a release of Korean-specific, small-scale BERT models with comparable or better performances developed by Computational Linguistics Lab at Seoul National University, referenced in [KR-BERT: A Small-Scale Korean-Specific Language Model](https://arxiv.org/abs/2008.03979).
10
+
11
+ <br>
12
+
13
+ ### Vocab, Parameters and Data
14
+
15
+ | | Mulitlingual BERT<br>(Google) | KorBERT<br>(ETRI) | KoBERT<br>(SKT) | KR-BERT character | KR-BERT sub-character |
16
+ | -------------: | ---------------------------------------------: | ---------------------: | ----------------------------------: | -------------------------------------: | -------------------------------------: |
17
+ | vocab size | 119,547 | 30,797 | 8,002 | 16,424 | 12,367 |
18
+ | parameter size | 167,356,416 | 109,973,391 | 92,186,880 | 99,265,066 | 96,145,233 |
19
+ | data size | -<br>(The Wikipedia data<br>for 104 languages) | 23GB<br>4.7B morphemes | -<br>(25M sentences,<br>233M words) | 2.47GB<br>20M sentences,<br>233M words | 2.47GB<br>20M sentences,<br>233M words |
20
+
21
+ | Model | Masked LM Accuracy |
22
+ | ------------------------------------------- | ------------------ |
23
+ | KoBERT | 0.750 |
24
+ | KR-BERT character BidirectionalWordPiece | **0.779** |
25
+ | KR-BERT sub-character BidirectionalWordPiece | 0.769 |
26
+
27
+ <br>
28
+
29
+ ### Sub-character
30
+
31
+ Korean text is basically represented with Hangul syllable characters, which can be decomposed into sub-characters, or graphemes. To accommodate such characteristics, we trained a new vocabulary and BERT model on two different representations of a corpus: syllable characters and sub-characters.
32
+
33
+ In case of using our sub-character model, you should preprocess your data with the code below.
34
+
35
+ ```python
36
+ import torch
37
+ from transformers import BertConfig, BertModel, BertForPreTraining, BertTokenizer
38
+ from unicodedata import normalize
39
+
40
+ tokenizer_krbert = BertTokenizer.from_pretrained('/path/to/vocab_file.txt', do_lower_case=False)
41
+
42
+ # convert a string into sub-char
43
+ def to_subchar(string):
44
+ return normalize('NFKD', string)
45
+
46
+ sentence = 'ํ† ํฌ๋‚˜์ด์ € ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.'
47
+ print(tokenizer_krbert.tokenize(to_subchar(sentence)))
48
+
49
+ ```
50
+
51
+ ### Tokenization
52
+
53
+ #### BidirectionalWordPiece Tokenizer
54
+
55
+ We use the BidirectionalWordPiece model to reduce search costs while maintaining the possibility of choice. This model applies BPE in both forward and backward directions to obtain two candidates and chooses the one that has a higher frequency.
56
+
57
+
58
+ | | Mulitlingual BERT | KorBERT<br>character | KoBERT | KR-BERT<br>character<br>WordPiece | KR-BERT<br>character<br>BidirectionalWordPiece | KR-BERT<br>sub-character<br>WordPiece | KR-BERT<br>sub-character<br>BidirectionalWordPiece |
59
+ | :-------------------------------------: | :-----------------------: | :-----------------------: | :-----------------------: | :------------------------------: | :-------------------------------------------: | :----------------------------------: | :-----------------------------------------------: |
60
+ | ๋ƒ‰์žฅ๊ณ <br>nayngcangko<br>"refrigerator" | ๋ƒ‰#์žฅ#๊ณ <br>nayng#cang#ko | ๋ƒ‰#์žฅ#๊ณ <br>nayng#cang#ko | ๋ƒ‰#์žฅ#๊ณ <br>nayng#cang#ko | ๋ƒ‰์žฅ๊ณ <br>nayngcangko | ๋ƒ‰์žฅ๊ณ <br>nayngcangko | ๋ƒ‰์žฅ๊ณ <br>nayngcangko | ๋ƒ‰์žฅ๊ณ <br>nayngcangko |
61
+ | ์ถฅ๋‹ค<br>chwupta<br>"cold" | [UNK] | ์ถฅ#๋‹ค<br>chwup#ta | ์ถฅ#๋‹ค<br>chwup#ta | ์ถฅ#๋‹ค<br>chwup#ta | ์ถฅ#๋‹ค<br>chwup#ta | ์ถ”#ใ…‚๋‹ค<br>chwu#pta | ์ถ”#ใ…‚๋‹ค<br>chwu#pta |
62
+ | ๋ฑƒ์‚ฌ๋žŒ<br>paytsalam<br>"seaman" | [UNK] | ๋ฑƒ#์‚ฌ๋žŒ<br>payt#salam | ๋ฑƒ#์‚ฌ๋žŒ<br>payt#salam | ๋ฑƒ#์‚ฌ๋žŒ<br>payt#salam | ๋ฑƒ#์‚ฌ๋žŒ<br>payt#salam | ๋ฐฐ#ใ……#์‚ฌ๋žŒ<br>pay#t#salam | ๋ฐฐ#ใ……#์‚ฌ๋žŒ<br>pay#t#salam |
63
+ | ๋งˆ์ดํฌ<br>maikhu<br>"microphone" | ๋งˆ#์ด#ํฌ<br>ma#i#khu | ๋งˆ์ด#ํฌ<br>mai#khu | ๋งˆ#์ด#ํฌ<br>ma#i#khu | ๋งˆ์ดํฌ<br>maikhu | ๋งˆ์ดํฌ<br>maikhu | ๋งˆ์ดํฌ<br>maikhu | ๋งˆ์ดํฌ<br>maikhu |
64
+
65
+ <br>
66
+
67
+
68
+ ### Models
69
+
70
+ | | TensorFlow | | PyTorch | |
71
+ |:---:|:-------------------------------:|:----------------------------:|:----------------------------:|:----------------------------:|
72
+ | | character | sub-character | character | sub-character |
73
+ | WordPiece <br> tokenizer | [WP char](https://drive.google.com/open?id=1SG5m-3R395VjEEnt0wxWM7SE1j6ndVsX) | [WP subchar](https://drive.google.com/open?id=13oguhQvYD9wsyLwKgU-uLCacQVWA4oHg) | [WP char](https://drive.google.com/file/d/18lsZzx_wonnOezzB5QxqSliA2KL5BF0x/view?usp=sharing) | [WP subchar](https://drive.google.com/open?id=1c1en4AMlCv2k7QapIzqjefnYzNOoh5KZ)
74
+ | Bidirectional <br> WordPiece <br> tokenizer | [BiWP char](https://drive.google.com/open?id=1YhFobehwzdbIxsHHvyFU5okp-HRowRKS) | [BiWP subchar](https://drive.google.com/open?id=12izU0NZXNz9I6IsnknUbencgr7gWHDeM) | [BiWP char](https://drive.google.com/open?id=1C87CCHD9lOQhdgWPkMw_6ZD5M2km7f1p) | [BiWP subchar](https://drive.google.com/file/d/1JvNYFQyb20SWgOiDxZn6h1-n_fjTU25S/view?usp=sharing)
75
+
76
+ <!--
77
+ #### tensorflow
78
+
79
+ * BERT tokenizer, character model ([download](https://drive.google.com/open?id=1SG5m-3R395VjEEnt0wxWM7SE1j6ndVsX))
80
+ * BidirectionalWordPiece tokenizer, character model ([download](https://drive.google.com/open?id=1YhFobehwzdbIxsHHvyFU5okp-HRowRKS))
81
+ * BERT tokenizer, sub-character model ([download](https://drive.google.com/open?id=13oguhQvYD9wsyLwKgU-uLCacQVWA4oHg))
82
+ * BidirectionalWordPiece tokenizer, sub-character model ([download](https://drive.google.com/open?id=12izU0NZXNz9I6IsnknUbencgr7gWHDeM))
83
+
84
+
85
+ #### pytorch
86
+
87
+ * BERT tokenizer, character model ([download](https://drive.google.com/file/d/18lsZzx_wonnOezzB5QxqSliA2KL5BF0x/view?usp=sharing))
88
+ * BidirectionalWordPiece tokenizer, character model ([download](https://drive.google.com/open?id=1C87CCHD9lOQhdgWPkMw_6ZD5M2km7f1p))
89
+ * BERT tokenizer, sub-character model ([download](https://drive.google.com/open?id=1c1en4AMlCv2k7QapIzqjefnYzNOoh5KZ))
90
+ * BidirectionalWordPiece tokenizer, sub-character model ([download](https://drive.google.com/file/d/1JvNYFQyb20SWgOiDxZn6h1-n_fjTU25S/view?usp=sharing))
91
+ -->
92
+
93
+ <br>
94
+
95
+ ### Requirements
96
+
97
+ - transformers == 2.1.1
98
+ - tensorflow < 2.0
99
+
100
+ <br>
101
+
102
+ ## Downstream tasks
103
+
104
+ ### Naver Sentiment Movie Corpus (NSMC)
105
+
106
+ * If you want to use the sub-character version of our models, let the `subchar` argument be `True`.
107
+ * And you can use the original BERT WordPiece tokenizer by entering `bert` for the `tokenizer` argument, and if you use `ranked` you can use our BidirectionalWordPiece tokenizer.
108
+
109
+ * tensorflow: After downloading our pretrained models, put them in a `models` directory in the `krbert_tensorflow` directory.
110
+
111
+ * pytorch: After downloading our pretrained models, put them in a `pretrained` directory in the `krbert_pytorch` directory.
112
+
113
+
114
+ ```sh
115
+ # pytorch
116
+ python3 train.py --subchar {True, False} --tokenizer {bert, ranked}
117
+
118
+ # tensorflow
119
+ python3 run_classifier.py \
120
+ --task_name=NSMC \
121
+ --subchar={True, False} \
122
+ --tokenizer={bert, ranked} \
123
+ --do_train=true \
124
+ --do_eval=true \
125
+ --do_predict=true \
126
+ --do_lower_case=False\
127
+ --max_seq_length=128 \
128
+ --train_batch_size=128 \
129
+ --learning_rate=5e-05 \
130
+ --num_train_epochs=5.0 \
131
+ --output_dir={output_dir}
132
+ ```
133
+
134
+ The pytorch code structure refers to that of https://github.com/aisolab/nlp_implementation .
135
+
136
+ <br>
137
+
138
+ ### NSMC Acc.
139
+
140
+ | | multilingual BERT | KorBERT | KoBERT | KR-BERT character WordPiece | KR-BERT<br>character Bidirectional WordPiece | KR-BERT sub-character WordPiece | KR-BERT<br>sub-character Bidirectional WordPiece |
141
+ |:-----:|-------------------:|----------------:|--------:|----------------------------:|-----------------------------------------:|--------------------------------:|---------------------------------------------:|
142
+ | pytorch | - | **89.84** | 89.01 | 89.34 | **89.38** | 89.20 | 89.34 |
143
+ | tensorflow | 87.08 | 85.94 | n/a | 89.86 | **90.10** | 89.76 | 89.86 |
144
+
145
+
146
+ <br>
147
+
148
+
149
+ ## Citation
150
+
151
+ If you use these models, please cite the following paper:
152
+ ```
153
+ @article{lee2020krbert,
154
+ title={KR-BERT: A Small-Scale Korean-Specific Language Model},
155
+ author={Sangah Lee and Hansol Jang and Yunmee Baik and Suzi Park and Hyopil Shin},
156
+ year={2020},
157
+ journal={ArXiv},
158
+ volume={abs/2008.03979}
159
+ }
160
+ ```
161
+
162
+ <br>
163
+
164
+ ## Contacts
165
+
166
+ nlp.snu@gmail.com
config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attention_probs_dropout_prob": 0.1,
3
+ "hidden_act": "gelu",
4
+ "hidden_dropout_prob": 0.1,
5
+ "hidden_size": 768,
6
+ "initializer_range": 0.02,
7
+ "intermediate_size": 3072,
8
+ "max_position_embeddings": 512,
9
+ "model_type": "bert",
10
+ "num_attention_heads": 12,
11
+ "num_hidden_layers": 12,
12
+ "type_vocab_size": 2,
13
+ "vocab_size": 16424,
14
+ "layer_norm_eps": 1e-12
15
+ }
flax_model.msgpack ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1d0a9b0351c8e59a227ff018f7a3340e62841ddd784df13cd0b7125a62dc481d
3
+ size 394627053
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5d61162104eef931660089a3d1494590de25f78d6b18f04146c81dd2e1b15c8c
3
+ size 397106539
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "do_lower_case": false
3
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff