1 ---
2 language: Chinese
3 datasets: CLUECorpusSmall
4 widget:
5 - text: "北京是[MASK]国的首都。"
6
7
8 ---
9
10
11 # Chinese RoBERTa Miniatures
12
13 ## Model description
14
15 This is the set of 24 Chinese RoBERTa models pre-trained by [UER-py](https://github.com/dbiir/UER-py/), which is introduced in [this paper](https://arxiv.org/abs/1909.05658).
16
17 [Turc et al.](https://arxiv.org/abs/1908.08962) have shown that the standard BERT recipe is effective on a wide range of model sizes. Following their paper, we released the 24 Chinese RoBERTa models. In order to facilitate users to reproduce the results, we used the publicly available corpus and provided all training details.
18
19 You can download the 24 Chinese RoBERTa miniatures either from the [UER-py Modelzoo page](https://github.com/dbiir/UER-py/wiki/Modelzoo), or via HuggingFace from the links below:
20
21 | | H=128 | H=256 | H=512 | H=768 |
22 | -------- | :-----------------------: | :-----------------------: | :-------------------------: | :-------------------------: |
23 | **L=2** | [**2/128 (Tiny)**][2_128] | [2/256][2_256] | [2/512][2_512] | [2/768][2_768] |
24 | **L=4** | [4/128][4_128] | [**4/256 (Mini)**][4_256] | [**4/512 (Small)**][4_512] | [4/768][4_768] |
25 | **L=6** | [6/128][6_128] | [6/256][6_256] | [6/512][6_512] | [6/768][6_768] |
26 | **L=8** | [8/128][8_128] | [8/256][8_256] | [**8/512 (Medium)**][8_512] | [8/768][8_768] |
27 | **L=10** | [10/128][10_128] | [10/256][10_256] | [10/512][10_512] | [10/768][10_768] |
28 | **L=12** | [12/128][12_128] | [12/256][12_256] | [12/512][12_512] | [**12/768 (Base)**][12_768] |
29
30 Here are scores on the devlopment set of six Chinese tasks:
31
32 | Model | Score | douban | chnsenticorp | lcqmc | tnews(CLUE) | iflytek(CLUE) | ocnli(CLUE) |
33 | -------------- | :---: | :----: | :----------: | :---: | :---------: | :-----------: | :---------: |
34 | RoBERTa-Tiny | 72.3 | 83.0 | 91.4 | 81.8 | 62.0 | 55.0 | 60.3 |
35 | RoBERTa-Mini | 75.7 | 84.8 | 93.7 | 86.1 | 63.9 | 58.3 | 67.4 |
36 | RoBERTa-Small | 76.8 | 86.5 | 93.4 | 86.5 | 65.1 | 59.4 | 69.7 |
37 | RoBERTa-Medium | 77.8 | 87.6 | 94.8 | 88.1 | 65.6 | 59.5 | 71.2 |
38 | RoBERTa-Base | 79.5 | 89.1 | 95.2 | 89.2 | 67.0 | 60.9 | 75.5 |
39
40 For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained with the sequence length of 128:
41
42 - epochs: 3, 5, 8
43 - batch sizes: 32, 64
44 - learning rates: 3e-5, 1e-4, 3e-4
45
46 ## How to use
47
48 You can use this model directly with a pipeline for masked language modeling (take the case of RoBERTa-Medium):
49
50 ```python
51 >>> from transformers import pipeline
52 >>> unmasker = pipeline('fill-mask', model='uer/chinese_roberta_L-8_H-512')
53 >>> unmasker("中国的首都是[MASK]京。")
54 [
55 {'sequence': '[CLS] 中 国 的 首 都 是 北 京 。 [SEP]',
56 'score': 0.8701988458633423,
57 'token': 1266,
58 'token_str': '北'},
59 {'sequence': '[CLS] 中 国 的 首 都 是 南 京 。 [SEP]',
60 'score': 0.1194809079170227,
61 'token': 1298,
62 'token_str': '南'},
63 {'sequence': '[CLS] 中 国 的 首 都 是 东 京 。 [SEP]',
64 'score': 0.0037803512532263994,
65 'token': 691,
66 'token_str': '东'},
67 {'sequence': '[CLS] 中 国 的 首 都 是 普 京 。 [SEP]',
68 'score': 0.0017127094324678183,
69 'token': 3249,
70 'token_str': '普'},
71 {'sequence': '[CLS] 中 国 的 首 都 是 望 京 。 [SEP]',
72 'score': 0.001687526935711503,
73 'token': 3307,
74 'token_str': '望'}
75 ]
76 ```
77
78 Here is how to use this model to get the features of a given text in PyTorch:
79
80 ```python
81 from transformers import BertTokenizer, BertModel
82 tokenizer = BertTokenizer.from_pretrained('uer/chinese_roberta_L-8_H-512')
83 model = BertModel.from_pretrained("uer/chinese_roberta_L-8_H-512")
84 text = "用你喜欢的任何文本替换我。"
85 encoded_input = tokenizer(text, return_tensors='pt')
86 output = model(**encoded_input)
87 ```
88
89 and in TensorFlow:
90
91 ```python
92 from transformers import BertTokenizer, TFBertModel
93 tokenizer = BertTokenizer.from_pretrained('uer/chinese_roberta_L-8_H-512')
94 model = TFBertModel.from_pretrained("uer/chinese_roberta_L-8_H-512")
95 text = "用你喜欢的任何文本替换我。"
96 encoded_input = tokenizer(text, return_tensors='tf')
97 output = model(encoded_input)
98 ```
99
100 ## Training data
101
102 [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data. We found that models pre-trained on CLUECorpusSmall outperform those pre-trained on CLUECorpus2020, although CLUECorpus2020 is much larger than CLUECorpusSmall.
103
104 ## Training procedure
105
106 Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
107
108 Taking the case of RoBERTa-Medium
109
110 Stage1:
111
112 ```
113 python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
114 --vocab_path models/google_zh_vocab.txt \
115 --dataset_path cluecorpussmall_seq128_dataset.pt \
116 --processes_num 32 --seq_length 128 \
117 --dynamic_masking --target mlm
118 ```
119
120 ```
121 python3 pretrain.py --dataset_path cluecorpussmall_seq128_dataset.pt \
122 --vocab_path models/google_zh_vocab.txt \
123 --config_path models/bert/medium_config.json \
124 --output_model_path models/cluecorpussmall_roberta_medium_seq128_model.bin \
125 --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
126 --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
127 --learning_rate 1e-4 --batch_size 64 \
128 --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
129 ```
130
131 Stage2:
132
133 ```
134 python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
135 --vocab_path models/google_zh_vocab.txt \
136 --dataset_path cluecorpussmall_seq512_dataset.pt \
137 --processes_num 32 --seq_length 512 \
138 --dynamic_masking --target mlm
139 ```
140
141 ```
142 python3 pretrain.py --dataset_path cluecorpussmall_seq512_dataset.pt \
143 --pretrained_model_path models/cluecorpussmall_roberta_medium_seq128_model.bin-1000000 \
144 --vocab_path models/google_zh_vocab.txt \
145 --config_path models/bert/medium_config.json \
146 --output_model_path models/cluecorpussmall_roberta_medium_seq512_model.bin \
147 --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
148 --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
149 --learning_rate 5e-5 --batch_size 16 \
150 --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm --tie_weights
151 ```
152
153 Finally, we convert the pre-trained model into Huggingface's format:
154
155 ```
156 python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_roberta_medium_seq512_model.bin-250000 \
157 --output_model_path pytorch_model.bin \
158 --layers_num 8 --target mlm
159 ```
160
161 ### BibTeX entry and citation info
162
163 ```
164 @article{devlin2018bert,
165 title={Bert: Pre-training of deep bidirectional transformers for language understanding},
166 author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
167 journal={arXiv preprint arXiv:1810.04805},
168 year={2018}
169 }
170
171 @article{liu2019roberta,
172 title={Roberta: A robustly optimized bert pretraining approach},
173 author={Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin},
174 journal={arXiv preprint arXiv:1907.11692},
175 year={2019}
176 }
177
178 @article{turc2019,
179 title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
180 author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
181 journal={arXiv preprint arXiv:1908.08962v2 },
182 year={2019}
183 }
184
185 @article{zhao2019uer,
186 title={UER: An Open-Source Toolkit for Pre-training Models},
187 author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
188 journal={EMNLP-IJCNLP 2019},
189 pages={241},
190 year={2019}
191 }
192 ```
193
194 [2_128]:https://huggingface.co/uer/chinese_roberta_L-2_H-128
195 [2_256]:https://huggingface.co/uer/chinese_roberta_L-2_H-256
196 [2_512]:https://huggingface.co/uer/chinese_roberta_L-2_H-512
197 [2_768]:https://huggingface.co/uer/chinese_roberta_L-2_H-768
198 [4_128]:https://huggingface.co/uer/chinese_roberta_L-4_H-128
199 [4_256]:https://huggingface.co/uer/chinese_roberta_L-4_H-256
200 [4_512]:https://huggingface.co/uer/chinese_roberta_L-4_H-512
201 [4_768]:https://huggingface.co/uer/chinese_roberta_L-4_H-768
202 [6_128]:https://huggingface.co/uer/chinese_roberta_L-6_H-128
203 [6_256]:https://huggingface.co/uer/chinese_roberta_L-6_H-256
204 [6_512]:https://huggingface.co/uer/chinese_roberta_L-6_H-512
205 [6_768]:https://huggingface.co/uer/chinese_roberta_L-6_H-768
206 [8_128]:https://huggingface.co/uer/chinese_roberta_L-8_H-128
207 [8_256]:https://huggingface.co/uer/chinese_roberta_L-8_H-256
208 [8_512]:https://huggingface.co/uer/chinese_roberta_L-8_H-512
209 [8_768]:https://huggingface.co/uer/chinese_roberta_L-8_H-768
210 [10_128]:https://huggingface.co/uer/chinese_roberta_L-10_H-128
211 [10_256]:https://huggingface.co/uer/chinese_roberta_L-10_H-256
212 [10_512]:https://huggingface.co/uer/chinese_roberta_L-10_H-512
213 [10_768]:https://huggingface.co/uer/chinese_roberta_L-10_H-768
214 [12_128]:https://huggingface.co/uer/chinese_roberta_L-12_H-128
215 [12_256]:https://huggingface.co/uer/chinese_roberta_L-12_H-256
216 [12_512]:https://huggingface.co/uer/chinese_roberta_L-12_H-512
217 [12_768]:https://huggingface.co/uer/chinese_roberta_L-12_H-768