uer
/

roberta-base-word-chinese-cluecorpussmall

@@ -30,16 +30,16 @@ Compared with [char-based models](https://huggingface.co/uer/chinese_roberta_L-2
 | Model          | Score | douban | chnsenticorp | lcqmc | tnews(CLUE) | iflytek(CLUE) | ocnli(CLUE) |
 | -------------- | :---: | :----: | :----------: | :---: | :---------: | :-----------: | :---------: |
-| RoBERTa-Tiny (char)       | 72.3             |  83.0      |     91.4         | 81.8      |    62.0         |     55.0          |    60.3     |
-| **RoBERTa-Tiny (word)**   | **74.3 (+2.0)**  |  **86.4**  |     **93.2**     | **82.0**  |    **66.4**     |     **58.2**      |    **59.6**     |
-| RoBERTa-Mini (char)       | 75.7             |  84.8      |     93.7         | 86.1      |    63.9         |     58.3          |    67.4     |
-| **RoBERTa-Mini (word)**   | **76.7 (+1.0)**  |  **87.6**  |     **94.1**     | **85.4**  |    **66.9**     |     **59.2**      |    **67.3**     |
-| RoBERTa-Small (char)      | 76.8             |  86.5      |     93.4         | 86.5      |    65.1         |     59.4          |    69.7     |
-| **RoBERTa-Small (word)**  | **78.1 (+1.3)**  |  **88.5**  |     **94.7**     | **87.4**  |    **67.6**     |     **60.9**      |    **69.8**     |
-| RoBERTa-Medium (char)     | 77.8             |  87.6      |     94.8         | 88.1      |    65.6         |     59.5          |    71.2     |
-| **RoBERTa-Medium (word)** | **78.9 (+1.1)**  |  **89.2**  |     **95.1**     | **88.0**  |    **67.8**     |     **60.6**      |    **73.0**     |
-| RoBERTa-Base (char)       | 79.5             |  89.1      |     95.2         | 89.2      |    67.0         |     60.9          |    75.5     |
-| **RoBERTa-Base (word)**   | **80.2 (+0.7)**  |  **90.3**  |     **95.7**     | **89.4**  |    **68.0**     |     **61.5**      |    **76.8**     |
 For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained with the sequence length of 128:
@@ -131,58 +131,58 @@ Since BertTokenizer does not support sentencepiece, AlbertTokenizer is used here
 ## Training procedure
-Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
 Taking the case of word-based RoBERTa-Medium
 Stage1:
 ```
-python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                      --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                      --dataset_path cluecorpussmall_word_seq128_dataset.pt \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                      --processes_num 32 --seq_length 128 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
                       --dynamic_masking --target mlm
 ```
 ```
-python3 pretrain.py --dataset_path cluecorpussmall_word_seq128_dataset.pt \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --config_path models/bert/medium_config.json \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --output_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --learning_rate 1e-4 --batch_size 64 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
                     --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
 ```
 Stage2:
 ```
-python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                      --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                      --dataset_path cluecorpussmall_word_seq512_dataset.pt \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                      --processes_num 32 --seq_length 512 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
                       --dynamic_masking --target mlm
 ```
 ```
-python3 pretrain.py --dataset_path cluecorpussmall_word_seq512_dataset.pt \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --pretrained_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-1000000 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --spm_model_path models/cluecorpussmall_spm.model \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --config_path models/bert/medium_config.json \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --output_model_path models/cluecorpussmall_word_roberta_medium_seq512_model.bin \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                    --learning_rate 5e-5 --batch_size 16 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
                     --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
 ```
 Finally, we convert the pre-trained model into Huggingface's format:
 ```
-python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-250000 \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-                                                        --output_model_path pytorch_model.bin \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
                                                         --layers_num 12 --target mlm
 ```

 | Model          | Score | douban | chnsenticorp | lcqmc | tnews(CLUE) | iflytek(CLUE) | ocnli(CLUE) |
 | -------------- | :---: | :----: | :----------: | :---: | :---------: | :-----------: | :---------: |
+| RoBERTa-Tiny(char)       | 72.3            |  83.0      |     91.4         | 81.8      |    62.0         |     55.0          |    60.3     |
+| **RoBERTa-Tiny(word)**   | **74.3(+2.0)**  |  **86.4**  |     **93.2**     | **82.0**  |    **66.4**     |     **58.2**      |    **59.6**     |
+| RoBERTa-Mini(char)       | 75.7            |  84.8      |     93.7         | 86.1      |    63.9         |     58.3          |    67.4     |
+| **RoBERTa-Mini(word)**   | **76.7(+1.0)**  |  **87.6**  |     **94.1**     | **85.4**  |    **66.9**     |     **59.2**      |    **67.3**     |
+| RoBERTa-Small(char)      | 76.8            |  86.5      |     93.4         | 86.5      |    65.1         |     59.4          |    69.7     |
+| **RoBERTa-Small(word)**  | **78.1(+1.3)**  |  **88.5**  |     **94.7**     | **87.4**  |    **67.6**     |     **60.9**      |    **69.8**     |
+| RoBERTa-Medium(char)     | 77.8            |  87.6      |     94.8         | 88.1      |    65.6         |     59.5          |    71.2     |
+| **RoBERTa-Medium(word)** | **78.9(+1.1)**  |  **89.2**  |     **95.1**     | **88.0**  |    **67.8**     |     **60.6**      |    **73.0**     |
+| RoBERTa-Base(char)       | 79.5            |  89.1      |     95.2         | 89.2      |    67.0         |     60.9          |    75.5     |
+| **RoBERTa-Base(word)**   | **80.2(+0.7)**  |  **90.3**  |     **95.7**     | **89.4**  |    **68.0**     |     **61.5**      |    **76.8**     |
 For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained with the sequence length of 128:
 ## Training procedure
+Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
 Taking the case of word-based RoBERTa-Medium
 Stage1:
 ```
+python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\
+                      --spm_model_path models/cluecorpussmall_spm.model \\
+                      --dataset_path cluecorpussmall_word_seq128_dataset.pt \\
+                      --processes_num 32 --seq_length 128 \\
                       --dynamic_masking --target mlm
 ```
 ```
+python3 pretrain.py --dataset_path cluecorpussmall_word_seq128_dataset.pt \\
+                    --spm_model_path models/cluecorpussmall_spm.model \\
+                    --config_path models/bert/medium_config.json \\
+                    --output_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin \\
+                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\
+                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \\
+                    --learning_rate 1e-4 --batch_size 64 \\
                     --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
 ```
 Stage2:
 ```
+python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \\
+                      --spm_model_path models/cluecorpussmall_spm.model \\
+                      --dataset_path cluecorpussmall_word_seq512_dataset.pt \\
+                      --processes_num 32 --seq_length 512 \\
                       --dynamic_masking --target mlm
 ```
 ```
+python3 pretrain.py --dataset_path cluecorpussmall_word_seq512_dataset.pt \\
+                    --pretrained_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-1000000 \\
+                    --spm_model_path models/cluecorpussmall_spm.model \\
+                    --config_path models/bert/medium_config.json \\
+                    --output_model_path models/cluecorpussmall_word_roberta_medium_seq512_model.bin \\
+                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\
+                    --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \\
+                    --learning_rate 5e-5 --batch_size 16 \\
                     --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
 ```
 Finally, we convert the pre-trained model into Huggingface's format:
 ```
+python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_word_roberta_medium_seq128_model.bin-250000 \\
+                                                        --output_model_path pytorch_model.bin \\
                                                         --layers_num 12 --target mlm
 ```