uer commited on
Commit
443af82
1 Parent(s): 589b700

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -41
README.md CHANGED
@@ -7,31 +7,55 @@ widget:
7
 
8
 
9
  ---
 
10
  # Chinese word-based RoBERTa Miniatures
11
 
12
  ## Model description
13
 
14
  This is the set of 5 Chinese word-based RoBERTa models pre-trained by [UER-py](https://arxiv.org/abs/1909.05658).
15
 
16
- [Turc et al.](https://arxiv.org/abs/1908.08962) have shown that the standard BERT recipe is effective on a wide range of model sizes. Following their paper, we released the 5 Chinese word-based RoBERTa models. In order to facilitate users to reproduce the results, we used the publicly available corpus and word segmentation tool, and provided all training details.
 
 
17
 
18
  You can download the 5 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
19
 
20
  | | Link |
21
  | -------- | :-----------------------: |
22
- | **Tiny** | [**2/128 (Tiny)**][2_128] |
23
- | **Mini** | [**4/256 (Mini)**][4_256] |
24
- | **Small** | [**4/512 (Small)**][4_512] |
25
- | **Medium** | [**8/512 (Medium)**][8_512] |
26
- | **Base** | [**12/768 (Base)**][12_768] |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  ## How to use
29
 
30
- You can use this model directly with a pipeline for masked language modeling:
31
 
32
  ```python
33
  >>> from transformers import pipeline
34
- >>> unmasker = pipeline('fill-mask', model='uer/roberta-base-word-chinese-cluecorpussmall')
35
  >>> unmasker("[MASK]的首都是北京。")
36
  [
37
  {'sequence': '中国 的首都是北京。',
@@ -57,14 +81,12 @@ You can use this model directly with a pipeline for masked language modeling:
57
  ]
58
  ```
59
 
60
- BertTokenizer does not support sentencepiece, so we use AlbertTokenizer here.
61
-
62
  Here is how to use this model to get the features of a given text in PyTorch:
63
 
64
  ```python
65
  from transformers import AlbertTokenizer, BertModel
66
- tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-base-word-chinese-cluecorpussmall')
67
- model = BertModel.from_pretrained("uer/roberta-base-word-chinese-cluecorpussmall")
68
  text = "用你喜欢的任何文本替换我。"
69
  encoded_input = tokenizer(text, return_tensors='pt')
70
  output = model(**encoded_input)
@@ -74,13 +96,15 @@ and in TensorFlow:
74
 
75
  ```python
76
  from transformers import AlbertTokenizer, TFBertModel
77
- tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-base-word-chinese-cluecorpussmall')
78
- model = TFBertModel.from_pretrained("uer/roberta-base-word-chinese-cluecorpussmall")
79
  text = "用你喜欢的任何文本替换我。"
80
  encoded_input = tokenizer(text, return_tensors='tf')
81
  output = model(encoded_input)
82
  ```
83
 
 
 
84
  ## Training data
85
 
86
  [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data. Google's [sentencepiece](https://github.com/google/sentencepiece) is used for word segmentation. The sentencepiece model is trained on CLUECorpusSmall corpus:
@@ -109,64 +133,78 @@ output = model(encoded_input)
109
 
110
  ## Training procedure
111
 
112
- Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
113
 
114
  Taking the case of word-based RoBERTa-Medium
115
 
116
  Stage1:
117
 
118
  ```
119
- python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
120
- --spm_model_path models/cluecorpussmall_spm.model \
121
- --dataset_path cluecorpussmall_word_seq128_dataset.pt \
122
- --processes_num 32 --seq_length 128 \
123
  --dynamic_masking --target mlm
124
  ```
125
 
126
  ```
127
- python3 pretrain.py --dataset_path cluecorpussmall_word_seq128_dataset.pt \
128
- --spm_model_path models/cluecorpussmall_spm.model \
129
- --config_path models/bert/medium_config.json \
130
- --output_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin \
131
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
132
- --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
133
- --learning_rate 1e-4 --batch_size 64 \
134
  --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
135
  ```
136
 
137
  Stage2:
138
 
139
  ```
140
- python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
141
- --spm_model_path models/cluecorpussmall_spm.model \
142
- --dataset_path cluecorpussmall_word_seq512_dataset.pt \
143
- --processes_num 32 --seq_length 512 \
144
  --dynamic_masking --target mlm
145
  ```
146
 
147
  ```
148
- python3 pretrain.py --dataset_path cluecorpussmall_word_seq512_dataset.pt \
149
- --pretrained_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-1000000 \
150
- --spm_model_path models/cluecorpussmall_spm.model \
151
- --config_path models/bert/medium_config.json \
152
- --output_model_path models/cluecorpussmall_word_roberta_medium_seq512_model.bin \
153
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
154
- --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
155
- --learning_rate 5e-5 --batch_size 16 \
156
  --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
157
  ```
158
 
159
  Finally, we convert the pre-trained model into Huggingface's format:
160
 
161
  ```
162
- python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-250000 \
163
- --output_model_path pytorch_model.bin \
164
- --layers_num 12 --target mlm
165
  ```
166
 
167
  ### BibTeX entry and citation info
168
 
169
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
  @article{zhao2019uer,
171
  title={UER: An Open-Source Toolkit for Pre-training Models},
172
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
 
7
 
8
 
9
  ---
10
+
11
  # Chinese word-based RoBERTa Miniatures
12
 
13
  ## Model description
14
 
15
  This is the set of 5 Chinese word-based RoBERTa models pre-trained by [UER-py](https://arxiv.org/abs/1909.05658).
16
 
17
+ Most Chinese pre-trained weights are based on Chinese character. Compared with character-based models, word-based models are faster (because of shorter sequence length) and have better performance according to our experimental results. To this end, we released the 5 Chinese word-based RoBERTa models of different sizes. In order to facilitate users to reproduce the results, we used the publicly available corpus and word segmentation tool, and provided all training details.
18
+
19
+ Notice that the output results of Hosted inference API (right) are not properly displayed. When the predicted word has multiple characters, the single word instead of entire sentence is displayed. One can click **JSON Output** for normal output results.
20
 
21
  You can download the 5 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
22
 
23
  | | Link |
24
  | -------- | :-----------------------: |
25
+ | **word-based RoBERTa-Tiny** | [**L=2/H=128 (Tiny)**][2_128] |
26
+ | **word-based RoBERTa-Mini** | [**L=4/H=256 (Mini)**][4_256] |
27
+ | **word-based RoBERTa-Small** | [**L=4/H=512 (Small)**][4_512] |
28
+ | **word-based RoBERTa-Medium** | [**L=8/H=512 (Medium)**][8_512] |
29
+ | **word-based RoBERTa-Base** | [**L=12/H=768 (Base)**][12_768] |
30
+
31
+ Compared with [char-based models](https://huggingface.co/uer/chinese_roberta_L-2_H-128), word-based models achieve better results in most cases. Here are scores on the devlopment set of six Chinese tasks:
32
+
33
+ | Model | Score | douban | chnsenticorp | lcqmc | tnews(CLUE) | iflytek(CLUE) | ocnli(CLUE) |
34
+ | -------------- | :---: | :----: | :----------: | :---: | :---------: | :-----------: | :---------: |
35
+ | RoBERTa-Tiny(char) | 72.3 | 83.0 | 91.4 | 81.8 | 62.0 | 55.0 | 60.3 |
36
+ | **RoBERTa-Tiny(word)** | **74.3(+2.0)** | **86.4** | **93.2** | **82.0** | **66.4** | **58.2** | **59.6** |
37
+ | RoBERTa-Mini(char) | 75.7 | 84.8 | 93.7 | 86.1 | 63.9 | 58.3 | 67.4 |
38
+ | **RoBERTa-Mini(word)** | **76.7(+1.0)** | **87.6** | **94.1** | **85.4** | **66.9** | **59.2** | **67.3** |
39
+ | RoBERTa-Small(char) | 76.8 | 86.5 | 93.4 | 86.5 | 65.1 | 59.4 | 69.7 |
40
+ | **RoBERTa-Small(word)** | **78.1(+1.3)** | **88.5** | **94.7** | **87.4** | **67.6** | **60.9** | **69.8** |
41
+ | RoBERTa-Medium(char) | 77.8 | 87.6 | 94.8 | 88.1 | 65.6 | 59.5 | 71.2 |
42
+ | **RoBERTa-Medium(word)** | **78.9(+1.1)** | **89.2** | **95.1** | **88.0** | **67.8** | **60.6** | **73.0** |
43
+ | RoBERTa-Base(char) | 79.5 | 89.1 | 95.2 | 89.2 | 67.0 | 60.9 | 75.5 |
44
+ | **RoBERTa-Base(word)** | **80.2(+0.7)** | **90.3** | **95.7** | **89.4** | **68.0** | **61.5** | **76.8** |
45
+
46
+ For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained with the sequence length of 128:
47
+
48
+ - epochs: 3, 5, 8
49
+ - batch sizes: 32, 64
50
+ - learning rates: 3e-5, 1e-4, 3e-4
51
 
52
  ## How to use
53
 
54
+ You can use this model directly with a pipeline for masked language modeling (take the case of word-based RoBERTa-Medium):
55
 
56
  ```python
57
  >>> from transformers import pipeline
58
+ >>> unmasker = pipeline('fill-mask', model='uer/roberta-medium-word-chinese-cluecorpussmall')
59
  >>> unmasker("[MASK]的首都是北京。")
60
  [
61
  {'sequence': '中国 的首都是北京。',
 
81
  ]
82
  ```
83
 
 
 
84
  Here is how to use this model to get the features of a given text in PyTorch:
85
 
86
  ```python
87
  from transformers import AlbertTokenizer, BertModel
88
+ tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-medium-word-chinese-cluecorpussmall')
89
+ model = BertModel.from_pretrained("uer/roberta-medium-word-chinese-cluecorpussmall")
90
  text = "用你喜欢的任何文本替换我。"
91
  encoded_input = tokenizer(text, return_tensors='pt')
92
  output = model(**encoded_input)
 
96
 
97
  ```python
98
  from transformers import AlbertTokenizer, TFBertModel
99
+ tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-medium-word-chinese-cluecorpussmall')
100
+ model = TFBertModel.from_pretrained("uer/roberta-medium-word-chinese-cluecorpussmall")
101
  text = "用你喜欢的任何文本替换我。"
102
  encoded_input = tokenizer(text, return_tensors='tf')
103
  output = model(encoded_input)
104
  ```
105
 
106
+ Since BertTokenizer does not support sentencepiece, AlbertTokenizer is used here.
107
+
108
  ## Training data
109
 
110
  [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data. Google's [sentencepiece](https://github.com/google/sentencepiece) is used for word segmentation. The sentencepiece model is trained on CLUECorpusSmall corpus:
 
133
 
134
  ## Training procedure
135
 
136
+ Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
137
 
138
  Taking the case of word-based RoBERTa-Medium
139
 
140
  Stage1:
141
 
142
  ```
143
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\\\
144
+ --spm_model_path models/cluecorpussmall_spm.model \\\\
145
+ --dataset_path cluecorpussmall_word_seq128_dataset.pt \\\\
146
+ --processes_num 32 --seq_length 128 \\\\
147
  --dynamic_masking --target mlm
148
  ```
149
 
150
  ```
151
+ python3 pretrain.py --dataset_path cluecorpussmall_word_seq128_dataset.pt \\\\
152
+ --spm_model_path models/cluecorpussmall_spm.model \\\\
153
+ --config_path models/bert/medium_config.json \\\\
154
+ --output_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin \\\\
155
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\\
156
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \\\\
157
+ --learning_rate 1e-4 --batch_size 64 \\\\
158
  --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
159
  ```
160
 
161
  Stage2:
162
 
163
  ```
164
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\\\
165
+ --spm_model_path models/cluecorpussmall_spm.model \\\\
166
+ --dataset_path cluecorpussmall_word_seq512_dataset.pt \\\\
167
+ --processes_num 32 --seq_length 512 \\\\
168
  --dynamic_masking --target mlm
169
  ```
170
 
171
  ```
172
+ python3 pretrain.py --dataset_path cluecorpussmall_word_seq512_dataset.pt \\\\
173
+ --pretrained_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-1000000 \\\\
174
+ --spm_model_path models/cluecorpussmall_spm.model \\\\
175
+ --config_path models/bert/medium_config.json \\\\
176
+ --output_model_path models/cluecorpussmall_word_roberta_medium_seq512_model.bin \\\\
177
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\\
178
+ --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \\\\
179
+ --learning_rate 5e-5 --batch_size 16 \\\\
180
  --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
181
  ```
182
 
183
  Finally, we convert the pre-trained model into Huggingface's format:
184
 
185
  ```
186
+ python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-250000 \\\\
187
+ --output_model_path pytorch_model.bin \\\\
188
+ --layers_num 8 --target mlm
189
  ```
190
 
191
  ### BibTeX entry and citation info
192
 
193
  ```
194
+ @article{devlin2018bert,
195
+ title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
196
+ author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
197
+ journal={arXiv preprint arXiv:1810.04805},
198
+ year={2018}
199
+ }
200
+
201
+ @article{turc2019,
202
+ title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
203
+ author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
204
+ journal={arXiv preprint arXiv:1908.08962v2 },
205
+ year={2019}
206
+ }
207
+
208
  @article{zhao2019uer,
209
  title={UER: An Open-Source Toolkit for Pre-training Models},
210
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},