uer commited on
Commit
3a33988
1 Parent(s): 6e8e3b2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -38
README.md CHANGED
@@ -3,6 +3,8 @@ language: Chinese
3
  datasets: CLUECorpusSmall
4
  widget:
5
  - text: "北京是[MASK]国的首都。"
 
 
6
  ---
7
 
8
 
@@ -16,26 +18,27 @@ This is the set of 24 Chinese RoBERTa models pre-trained by [UER-py](https://www
16
 
17
  You can download the 24 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
18
 
19
- | |H=128|H=256|H=512|H=768|
20
- |---|:---:|:---:|:---:|:---:|
21
- | **L=2** |[**2/128 (Tiny)**][2_128]|[2/256]|[2/512]|[2/768]|
22
- | **L=4** |[4/128]|[**4/256 (Mini)**][4_256]|[**4/512 (Small)**][4_512]|[4/768]|
23
- | **L=6** |[6/128]|[6/256]|[6/512]|[6/768]|
24
- | **L=8** |[8/128]|[8/256]|[**8/512 (Medium)**][8_512]|[8/768]|
25
- | **L=10** |[10/128]|[10/256]|[10/512]|[10/768]|
26
- | **L=12** |[12/128]|[12/256]|[12/512]|[**12/768 (Base)**][12_768]|
27
 
28
  Here are scores on the devlopment set of six Chinese tasks:
29
 
30
- |Model|Score|douban|chnsenticorp|lcqmc|tnews(CLUE)|iflytek(CLUE)|ocnli(CLUE)|
31
- |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
32
- |RoBERTa-Tiny|72.3|83.0|91.4|81.8|62.0|55.0|60.3|
33
- |RoBERTa-Mini|75.7|84.8|93.7|86.1|63.9|58.3|67.4|
34
- |RoBERTa-Small|76.8|86.5|93.4|86.5|65.1|59.4|69.7|
35
- |RoBERTa-Medium|77.8|87.6|94.8|88.1|65.6|59.5|71.2|
36
- |RoBERTa-Base|79.5|89.1|95.2|89.2|67.0|60.9|75.5|
 
 
37
 
38
- For each task, we selected the best fine-tuning hyperparameters from the lists below:
39
  - epochs: 3, 5, 8
40
  - batch sizes: 32, 64
41
  - learning rates: 3e-5, 1e-4, 3e-4
@@ -96,7 +99,7 @@ output = model(encoded_input)
96
 
97
  ## Training data
98
 
99
- CLUECorpusSmall is used as training data. We found that models pre-trained on CLUECorpusSmall outperform those pre-trained on CLUECorpus2020, although CLUECorpus2020 is much larger than CLUECorpusSmall.
100
 
101
  ## Training procedure
102
 
@@ -105,41 +108,54 @@ Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent
105
  Taking the case of RoBERTa-Medium
106
 
107
  Stage1:
 
108
  ```
109
  python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
110
  --vocab_path models/google_zh_vocab.txt \
111
- --dataset_path cluecorpussmall_seq128_dataset.pt \
112
- --processes_num 32 --seq_length 128 \
113
- --dynamic_masking --target mlm
114
  ```
 
115
  ```
116
  python3 pretrain.py --dataset_path cluecorpussmall_seq128_dataset.pt \
117
  --vocab_path models/google_zh_vocab.txt \
118
- --config_path models/bert_medium_config.json \
119
- --output_model_path models/cluecorpussmall_roberta_medium_seq128_model.bin \
120
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
121
- --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
122
- --learning_rate 1e-4 --batch_size 64 \
123
- --tie_weights --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm
124
  ```
 
125
  Stage2:
 
126
  ```
127
  python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
128
  --vocab_path models/google_zh_vocab.txt \
129
- --dataset_path cluecorpussmall_seq512_dataset.pt \
130
- --processes_num 32 --seq_length 512 \
131
- --dynamic_masking --target mlm
132
  ```
 
133
  ```
134
  python3 pretrain.py --dataset_path cluecorpussmall_seq512_dataset.pt \
135
  --pretrained_model_path models/cluecorpussmall_roberta_medium_seq128_model.bin-1000000 \
136
- --vocab_path models/google_zh_vocab.txt \
137
- --config_path models/bert_medium_config.json \
138
- --output_model_path models/cluecorpussmall_roberta_medium_seq512_model.bin \
139
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
140
- --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
141
- --learning_rate 5e-5 --batch_size 16 \
142
- --tie_weights --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm
 
 
 
 
 
 
 
 
143
  ```
144
 
145
  ### BibTeX entry and citation info
@@ -158,5 +174,4 @@ python3 pretrain.py --dataset_path cluecorpussmall_seq512_dataset.pt \
158
  [4_256]: https://huggingface.co/uer/chinese_roberta_L-4_H-256
159
  [4_512]: https://huggingface.co/uer/chinese_roberta_L-4_H-512
160
  [8_512]: https://huggingface.co/uer/chinese_roberta_L-8_H-512
161
- [12_768]: https://huggingface.co/uer/chinese_roberta_L-12_H-768
162
-
 
3
  datasets: CLUECorpusSmall
4
  widget:
5
  - text: "北京是[MASK]国的首都。"
6
+
7
+
8
  ---
9
 
10
 
 
18
 
19
  You can download the 24 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
20
 
21
+ | | H=128 | H=256 | H=512 | H=768 |
22
+ | -------- | :-----------------------: | :-----------------------: | :-------------------------: | :-------------------------: |
23
+ | **L=2** | [**2/128 (Tiny)**][2_128] | [2/256] | [2/512] | [2/768] |
24
+ | **L=4** | [4/128] | [**4/256 (Mini)**][4_256] | [**4/512 (Small)**][4_512] | [4/768] |
25
+ | **L=6** | [6/128] | [6/256] | [6/512] | [6/768] |
26
+ | **L=8** | [8/128] | [8/256] | [**8/512 (Medium)**][8_512] | [8/768] |
27
+ | **L=10** | [10/128] | [10/256] | [10/512] | [10/768] |
28
+ | **L=12** | [12/128] | [12/256] | [12/512] | [**12/768 (Base)**][12_768] |
29
 
30
  Here are scores on the devlopment set of six Chinese tasks:
31
 
32
+ | Model | Score | douban | chnsenticorp | lcqmc | tnews(CLUE) | iflytek(CLUE) | ocnli(CLUE) |
33
+ | -------------- | :---: | :----: | :----------: | :---: | :---------: | :-----------: | :---------: |
34
+ | RoBERTa-Tiny | 72.3 | 83.0 | 91.4 | 81.8 | 62.0 | 55.0 | 60.3 |
35
+ | RoBERTa-Mini | 75.7 | 84.8 | 93.7 | 86.1 | 63.9 | 58.3 | 67.4 |
36
+ | RoBERTa-Small | 76.8 | 86.5 | 93.4 | 86.5 | 65.1 | 59.4 | 69.7 |
37
+ | RoBERTa-Medium | 77.8 | 87.6 | 94.8 | 88.1 | 65.6 | 59.5 | 71.2 |
38
+ | RoBERTa-Base | 79.5 | 89.1 | 95.2 | 89.2 | 67.0 | 60.9 | 75.5 |
39
+
40
+ For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained with the sequence length of 128:
41
 
 
42
  - epochs: 3, 5, 8
43
  - batch sizes: 32, 64
44
  - learning rates: 3e-5, 1e-4, 3e-4
 
99
 
100
  ## Training data
101
 
102
+ [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data. We found that models pre-trained on CLUECorpusSmall outperform those pre-trained on CLUECorpus2020, although CLUECorpus2020 is much larger than CLUECorpusSmall.
103
 
104
  ## Training procedure
105
 
 
108
  Taking the case of RoBERTa-Medium
109
 
110
  Stage1:
111
+
112
  ```
113
  python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
114
  --vocab_path models/google_zh_vocab.txt \
115
+ --dataset_path cluecorpussmall_seq128_dataset.pt \
116
+ --processes_num 32 --seq_length 128 \
117
+ --dynamic_masking --target mlm
118
  ```
119
+
120
  ```
121
  python3 pretrain.py --dataset_path cluecorpussmall_seq128_dataset.pt \
122
  --vocab_path models/google_zh_vocab.txt \
123
+ --config_path models/bert_medium_config.json \
124
+ --output_model_path models/cluecorpussmall_roberta_medium_seq128_model.bin \
125
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
126
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
127
+ --learning_rate 1e-4 --batch_size 64 \
128
+ --tie_weights --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm
129
  ```
130
+
131
  Stage2:
132
+
133
  ```
134
  python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
135
  --vocab_path models/google_zh_vocab.txt \
136
+ --dataset_path cluecorpussmall_seq512_dataset.pt \
137
+ --processes_num 32 --seq_length 512 \
138
+ --dynamic_masking --target mlm
139
  ```
140
+
141
  ```
142
  python3 pretrain.py --dataset_path cluecorpussmall_seq512_dataset.pt \
143
  --pretrained_model_path models/cluecorpussmall_roberta_medium_seq128_model.bin-1000000 \
144
+ --vocab_path models/google_zh_vocab.txt \
145
+ --config_path models/bert_medium_config.json \
146
+ --output_model_path models/cluecorpussmall_roberta_medium_seq512_model.bin \
147
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
148
+ --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
149
+ --learning_rate 5e-5 --batch_size 16 \
150
+ --tie_weights --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm
151
+ ```
152
+
153
+ Finally, we convert the pre-trained model into Huggingface's format:
154
+
155
+ ```
156
+ python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_roberta_medium_seq512_model.bin-250000 \
157
+ --output_model_path pytorch_model.bin \
158
+ --layers_num 8 --target mlm
159
  ```
160
 
161
  ### BibTeX entry and citation info
 
174
  [4_256]: https://huggingface.co/uer/chinese_roberta_L-4_H-256
175
  [4_512]: https://huggingface.co/uer/chinese_roberta_L-4_H-512
176
  [8_512]: https://huggingface.co/uer/chinese_roberta_L-8_H-512
177
+ [12_768]: https://huggingface.co/uer/chinese_roberta_L-12_H-768