uer commited on
Commit
000b4d2
1 Parent(s): 8b668c3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -36
README.md CHANGED
@@ -26,20 +26,20 @@ You can download the 5 Chinese RoBERTa miniatures either from the [UER-py Github
26
  | **word-based RoBERTa-Medium** | [**L=8/H=512 (Medium)**][8_512] |
27
  | **word-based RoBERTa-Base** | [**L=12/H=768 (Base)**][12_768] |
28
 
29
- Here are scores on the devlopment set of six Chinese tasks:
30
 
31
  | Model | Score | douban | chnsenticorp | lcqmc | tnews(CLUE) | iflytek(CLUE) | ocnli(CLUE) |
32
  | -------------- | :---: | :----: | :----------: | :---: | :---------: | :-----------: | :---------: |
33
- | RoBERTa-Tiny(char) | 72.3 | 83.0 | 91.4 | 81.8 | 62.0 | 55.0 | 60.3 |
34
- | RoBERTa-Tiny(word) | 74.3(+2.0) | 86.4 | 93.2 | 82.0 | 66.4 | 58.2 | 59.6 |
35
- | RoBERTa-Mini(char) | 75.7 | 84.8 | 93.7 | 86.1 | 63.9 | 58.3 | 67.4 |
36
- | RoBERTa-Mini(word) | 76.7(+1.0) | 87.6 | 94.1 | 85.4 | 66.9 | 59.2 | 67.3 |
37
- | RoBERTa-Small(char) | 76.8 | 86.5 | 93.4 | 86.5 | 65.1 | 59.4 | 69.7 |
38
- | RoBERTa-Small(word) | 78.1(+1.3) | 88.5 | 94.7 | 87.4 | 67.6 | 60.9 | 69.8 |
39
- | RoBERTa-Medium(char) | 77.8 | 87.6 | 94.8 | 88.1 | 65.6 | 59.5 | 71.2 |
40
- | RoBERTa-Medium(word) | 78.9(+1.1) | 89.2 | 95.1 | 88.0 | 67.8 | 60.6 | 73.0 |
41
- | RoBERTa-Base(char) | 79.5 | 89.1 | 95.2 | 89.2 | 67.0 | 60.9 | 75.5 |
42
- | RoBERTa-Base(word) | 80.2(+0.7) | 90.3 | 95.7 | 89.4 | 68.0 | 61.5 | 76.8 |
43
 
44
  For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained with the sequence length of 128:
45
 
@@ -138,51 +138,51 @@ Taking the case of word-based RoBERTa-Medium
138
  Stage1:
139
 
140
  ```
141
- python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\\\\\\\
142
- --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\
143
- --dataset_path cluecorpussmall_word_seq128_dataset.pt \\\\\\\\
144
- --processes_num 32 --seq_length 128 \\\\\\\\
145
  --dynamic_masking --target mlm
146
  ```
147
 
148
  ```
149
- python3 pretrain.py --dataset_path cluecorpussmall_word_seq128_dataset.pt \\\\\\\\
150
- --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\
151
- --config_path models/bert/medium_config.json \\\\\\\\
152
- --output_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin \\\\\\\\
153
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\\\\\\
154
- --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \\\\\\\\
155
- --learning_rate 1e-4 --batch_size 64 \\\\\\\\
156
  --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
157
  ```
158
 
159
  Stage2:
160
 
161
  ```
162
- python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\\\\\\\
163
- --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\
164
- --dataset_path cluecorpussmall_word_seq512_dataset.pt \\\\\\\\
165
- --processes_num 32 --seq_length 512 \\\\\\\\
166
  --dynamic_masking --target mlm
167
  ```
168
 
169
  ```
170
- python3 pretrain.py --dataset_path cluecorpussmall_word_seq512_dataset.pt \\\\\\\\
171
- --pretrained_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-1000000 \\\\\\\\
172
- --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\
173
- --config_path models/bert/medium_config.json \\\\\\\\
174
- --output_model_path models/cluecorpussmall_word_roberta_medium_seq512_model.bin \\\\\\\\
175
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\\\\\\
176
- --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \\\\\\\\
177
- --learning_rate 5e-5 --batch_size 16 \\\\\\\\
178
  --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
179
  ```
180
 
181
  Finally, we convert the pre-trained model into Huggingface's format:
182
 
183
  ```
184
- python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-250000 \\\\\\\\
185
- --output_model_path pytorch_model.bin \\\\\\\\
186
  --layers_num 12 --target mlm
187
  ```
188
 
 
26
  | **word-based RoBERTa-Medium** | [**L=8/H=512 (Medium)**][8_512] |
27
  | **word-based RoBERTa-Base** | [**L=12/H=768 (Base)**][12_768] |
28
 
29
+ Compared with [char-based models](https://huggingface.co/uer/chinese_roberta_L-2_H-128), word-based models achieve better results in most cases. Here are scores on the devlopment set of six Chinese tasks:
30
 
31
  | Model | Score | douban | chnsenticorp | lcqmc | tnews(CLUE) | iflytek(CLUE) | ocnli(CLUE) |
32
  | -------------- | :---: | :----: | :----------: | :---: | :---------: | :-----------: | :---------: |
33
+ | RoBERTa-Tiny(char) | 72.3 | 83.0 | 91.4 | 81.8 | 62.0 | 55.0 | 60.3 |
34
+ | **RoBERTa-Tiny(word)** | **74.3(+2.0)** | **86.4** | **93.2** | **82.0** | **66.4** | **58.2** | **59.6** |
35
+ | RoBERTa-Mini(char) | 75.7 | 84.8 | 93.7 | 86.1 | 63.9 | 58.3 | 67.4 |
36
+ | **RoBERTa-Mini(word)** | **76.7(+1.0)** | **87.6** | **94.1** | **85.4** | **66.9** | **59.2** | **67.3** |
37
+ | RoBERTa-Small(char) | 76.8 | 86.5 | 93.4 | 86.5 | 65.1 | 59.4 | 69.7 |
38
+ | **RoBERTa-Small(word)** | **78.1(+1.3)** | **88.5** | **94.7** | **87.4** | **67.6** | **60.9** | **69.8** |
39
+ | RoBERTa-Medium(char) | 77.8 | 87.6 | 94.8 | 88.1 | 65.6 | 59.5 | 71.2 |
40
+ | **RoBERTa-Medium(word)** | **78.9(+1.1)** | **89.2** | **95.1** | **88.0** | **67.8** | **60.6** | **73.0** |
41
+ | RoBERTa-Base(char) | 79.5 | 89.1 | 95.2 | 89.2 | 67.0 | 60.9 | 75.5 |
42
+ | **RoBERTa-Base(word)** | **80.2(+0.7)** | **90.3** | **95.7** | **89.4** | **68.0** | **61.5** | **76.8** |
43
 
44
  For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained with the sequence length of 128:
45
 
 
138
  Stage1:
139
 
140
  ```
141
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\\\\\\\\\\\\\\\
142
+ --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\\\\\\\\\
143
+ --dataset_path cluecorpussmall_word_seq128_dataset.pt \\\\\\\\\\\\\\\\
144
+ --processes_num 32 --seq_length 128 \\\\\\\\\\\\\\\\
145
  --dynamic_masking --target mlm
146
  ```
147
 
148
  ```
149
+ python3 pretrain.py --dataset_path cluecorpussmall_word_seq128_dataset.pt \\\\\\\\\\\\\\\\
150
+ --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\\\\\\\\\
151
+ --config_path models/bert/medium_config.json \\\\\\\\\\\\\\\\
152
+ --output_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin \\\\\\\\\\\\\\\\
153
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\\\\\\\\\\\\\\
154
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \\\\\\\\\\\\\\\\
155
+ --learning_rate 1e-4 --batch_size 64 \\\\\\\\\\\\\\\\
156
  --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
157
  ```
158
 
159
  Stage2:
160
 
161
  ```
162
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\\\\\\\\\\\\\\\
163
+ --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\\\\\\\\\
164
+ --dataset_path cluecorpussmall_word_seq512_dataset.pt \\\\\\\\\\\\\\\\
165
+ --processes_num 32 --seq_length 512 \\\\\\\\\\\\\\\\
166
  --dynamic_masking --target mlm
167
  ```
168
 
169
  ```
170
+ python3 pretrain.py --dataset_path cluecorpussmall_word_seq512_dataset.pt \\\\\\\\\\\\\\\\
171
+ --pretrained_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-1000000 \\\\\\\\\\\\\\\\
172
+ --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\\\\\\\\\
173
+ --config_path models/bert/medium_config.json \\\\\\\\\\\\\\\\
174
+ --output_model_path models/cluecorpussmall_word_roberta_medium_seq512_model.bin \\\\\\\\\\\\\\\\
175
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\\\\\\\\\\\\\\
176
+ --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \\\\\\\\\\\\\\\\
177
+ --learning_rate 5e-5 --batch_size 16 \\\\\\\\\\\\\\\\
178
  --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
179
  ```
180
 
181
  Finally, we convert the pre-trained model into Huggingface's format:
182
 
183
  ```
184
+ python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-250000 \\\\\\\\\\\\\\\\
185
+ --output_model_path pytorch_model.bin \\\\\\\\\\\\\\\\
186
  --layers_num 12 --target mlm
187
  ```
188