uer commited on
Commit
7ac1aae
1 Parent(s): e5f51b9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -36
README.md CHANGED
@@ -30,16 +30,16 @@ Compared with [char-based models](https://huggingface.co/uer/chinese_roberta_L-2
30
 
31
  | Model | Score | douban | chnsenticorp | lcqmc | tnews(CLUE) | iflytek(CLUE) | ocnli(CLUE) |
32
  | -------------- | :---: | :----: | :----------: | :---: | :---------: | :-----------: | :---------: |
33
- | RoBERTa-Tiny (char) | 72.3 | 83.0 | 91.4 | 81.8 | 62.0 | 55.0 | 60.3 |
34
- | **RoBERTa-Tiny (word)** | **74.3 (+2.0)** | **86.4** | **93.2** | **82.0** | **66.4** | **58.2** | **59.6** |
35
- | RoBERTa-Mini (char) | 75.7 | 84.8 | 93.7 | 86.1 | 63.9 | 58.3 | 67.4 |
36
- | **RoBERTa-Mini (word)** | **76.7 (+1.0)** | **87.6** | **94.1** | **85.4** | **66.9** | **59.2** | **67.3** |
37
- | RoBERTa-Small (char) | 76.8 | 86.5 | 93.4 | 86.5 | 65.1 | 59.4 | 69.7 |
38
- | **RoBERTa-Small (word)** | **78.1 (+1.3)** | **88.5** | **94.7** | **87.4** | **67.6** | **60.9** | **69.8** |
39
- | RoBERTa-Medium (char) | 77.8 | 87.6 | 94.8 | 88.1 | 65.6 | 59.5 | 71.2 |
40
- | **RoBERTa-Medium (word)** | **78.9 (+1.1)** | **89.2** | **95.1** | **88.0** | **67.8** | **60.6** | **73.0** |
41
- | RoBERTa-Base (char) | 79.5 | 89.1 | 95.2 | 89.2 | 67.0 | 60.9 | 75.5 |
42
- | **RoBERTa-Base (word)** | **80.2 (+0.7)** | **90.3** | **95.7** | **89.4** | **68.0** | **61.5** | **76.8** |
43
 
44
  For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained with the sequence length of 128:
45
 
@@ -131,58 +131,58 @@ Since BertTokenizer does not support sentencepiece, AlbertTokenizer is used here
131
 
132
  ## Training procedure
133
 
134
- Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
135
 
136
  Taking the case of word-based RoBERTa-Medium
137
 
138
  Stage1:
139
 
140
  ```
141
- python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
142
- --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
143
- --dataset_path cluecorpussmall_word_seq128_dataset.pt \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
144
- --processes_num 32 --seq_length 128 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
145
  --dynamic_masking --target mlm
146
  ```
147
 
148
  ```
149
- python3 pretrain.py --dataset_path cluecorpussmall_word_seq128_dataset.pt \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
150
- --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
151
- --config_path models/bert/medium_config.json \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
152
- --output_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
153
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
154
- --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
155
- --learning_rate 1e-4 --batch_size 64 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
156
  --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
157
  ```
158
 
159
  Stage2:
160
 
161
  ```
162
- python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
163
- --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
164
- --dataset_path cluecorpussmall_word_seq512_dataset.pt \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
165
- --processes_num 32 --seq_length 512 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
166
  --dynamic_masking --target mlm
167
  ```
168
 
169
  ```
170
- python3 pretrain.py --dataset_path cluecorpussmall_word_seq512_dataset.pt \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
171
- --pretrained_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-1000000 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
172
- --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
173
- --config_path models/bert/medium_config.json \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
174
- --output_model_path models/cluecorpussmall_word_roberta_medium_seq512_model.bin \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
175
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
176
- --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
177
- --learning_rate 5e-5 --batch_size 16 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
178
  --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
179
  ```
180
 
181
  Finally, we convert the pre-trained model into Huggingface's format:
182
 
183
  ```
184
- python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-250000 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
185
- --output_model_path pytorch_model.bin \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
186
  --layers_num 12 --target mlm
187
  ```
188
 
 
30
 
31
  | Model | Score | douban | chnsenticorp | lcqmc | tnews(CLUE) | iflytek(CLUE) | ocnli(CLUE) |
32
  | -------------- | :---: | :----: | :----------: | :---: | :---------: | :-----------: | :---------: |
33
+ | RoBERTa-Tiny(char) | 72.3 | 83.0 | 91.4 | 81.8 | 62.0 | 55.0 | 60.3 |
34
+ | **RoBERTa-Tiny(word)** | **74.3(+2.0)** | **86.4** | **93.2** | **82.0** | **66.4** | **58.2** | **59.6** |
35
+ | RoBERTa-Mini(char) | 75.7 | 84.8 | 93.7 | 86.1 | 63.9 | 58.3 | 67.4 |
36
+ | **RoBERTa-Mini(word)** | **76.7(+1.0)** | **87.6** | **94.1** | **85.4** | **66.9** | **59.2** | **67.3** |
37
+ | RoBERTa-Small(char) | 76.8 | 86.5 | 93.4 | 86.5 | 65.1 | 59.4 | 69.7 |
38
+ | **RoBERTa-Small(word)** | **78.1(+1.3)** | **88.5** | **94.7** | **87.4** | **67.6** | **60.9** | **69.8** |
39
+ | RoBERTa-Medium(char) | 77.8 | 87.6 | 94.8 | 88.1 | 65.6 | 59.5 | 71.2 |
40
+ | **RoBERTa-Medium(word)** | **78.9(+1.1)** | **89.2** | **95.1** | **88.0** | **67.8** | **60.6** | **73.0** |
41
+ | RoBERTa-Base(char) | 79.5 | 89.1 | 95.2 | 89.2 | 67.0 | 60.9 | 75.5 |
42
+ | **RoBERTa-Base(word)** | **80.2(+0.7)** | **90.3** | **95.7** | **89.4** | **68.0** | **61.5** | **76.8** |
43
 
44
  For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained with the sequence length of 128:
45
 
 
131
 
132
  ## Training procedure
133
 
134
+ Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
135
 
136
  Taking the case of word-based RoBERTa-Medium
137
 
138
  Stage1:
139
 
140
  ```
141
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\
142
+ --spm_model_path models/cluecorpussmall_spm.model \\
143
+ --dataset_path cluecorpussmall_word_seq128_dataset.pt \\
144
+ --processes_num 32 --seq_length 128 \\
145
  --dynamic_masking --target mlm
146
  ```
147
 
148
  ```
149
+ python3 pretrain.py --dataset_path cluecorpussmall_word_seq128_dataset.pt \\
150
+ --spm_model_path models/cluecorpussmall_spm.model \\
151
+ --config_path models/bert/medium_config.json \\
152
+ --output_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin \\
153
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\
154
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \\
155
+ --learning_rate 1e-4 --batch_size 64 \\
156
  --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
157
  ```
158
 
159
  Stage2:
160
 
161
  ```
162
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\
163
+ --spm_model_path models/cluecorpussmall_spm.model \\
164
+ --dataset_path cluecorpussmall_word_seq512_dataset.pt \\
165
+ --processes_num 32 --seq_length 512 \\
166
  --dynamic_masking --target mlm
167
  ```
168
 
169
  ```
170
+ python3 pretrain.py --dataset_path cluecorpussmall_word_seq512_dataset.pt \\
171
+ --pretrained_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-1000000 \\
172
+ --spm_model_path models/cluecorpussmall_spm.model \\
173
+ --config_path models/bert/medium_config.json \\
174
+ --output_model_path models/cluecorpussmall_word_roberta_medium_seq512_model.bin \\
175
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\
176
+ --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \\
177
+ --learning_rate 5e-5 --batch_size 16 \\
178
  --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
179
  ```
180
 
181
  Finally, we convert the pre-trained model into Huggingface's format:
182
 
183
  ```
184
+ python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-250000 \\
185
+ --output_model_path pytorch_model.bin \\
186
  --layers_num 12 --target mlm
187
  ```
188