uer commited on
Commit
64bd4f9
1 Parent(s): 7c87595

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -16
README.md CHANGED
@@ -2,29 +2,41 @@
2
  language: zh
3
  datasets: CLUECorpusSmall
4
  widget:
5
- - text: "这是很久之前的事情了"
6
 
7
 
8
  ---
9
 
10
 
11
- # Chinese GPT2 Model
12
 
13
  ## Model description
14
 
15
- The model is used to generate Chinese texts. You can download the model either from the [GPT2-Chinese Github page](https://github.com/Morizeyao/GPT2-Chinese), or via HuggingFace from the link [gpt2-chinese-cluecorpussmall](https://huggingface.co/uer/gpt2-chinese-cluecorpussmall).
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## How to use
18
 
19
- You can use the model directly with a pipeline for text generation:
20
 
21
  ```python
22
  >>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
23
- >>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-cluecorpussmall")
24
- >>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-chinese-cluecorpussmall")
25
  >>> text_generator = TextGenerationPipeline(model, tokenizer)
26
  >>> text_generator("这是很久之前的事情了", max_length=100, do_sample=True)
27
- [{'generated_text': '这是很久之前的事情了 他 们 他 们 到 现 在 , 我 们 的 。'}]
28
  ```
29
 
30
  ## Training data
@@ -33,7 +45,9 @@ You can use the model directly with a pipeline for text generation:
33
 
34
  ## Training procedure
35
 
36
- The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 1024.
 
 
37
 
38
  Stage1:
39
 
@@ -47,8 +61,8 @@ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
47
  ```
48
  python3 pretrain.py --dataset_path cluecorpussmall_lm_seq128_dataset.pt \
49
  --vocab_path models/google_zh_vocab.txt \
50
- --config_path models/gpt2/config.json \
51
- --output_model_path models/cluecorpussmall_gpt2_seq128_model.bin \
52
  --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
53
  --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
54
  --learning_rate 1e-4 --batch_size 64
@@ -66,9 +80,9 @@ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
66
  ```
67
  python3 pretrain.py --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
68
  --vocab_path models/google_zh_vocab.txt \
69
- --pretrained_model_path models/cluecorpussmall_gpt2_seq128_model.bin-1000000 \
70
- --config_path models/gpt2/config.json \
71
- --output_model_path models/cluecorpussmall_gpt2_seq1024_model.bin \
72
  --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
73
  --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
74
  --learning_rate 5e-5 --batch_size 16
@@ -77,9 +91,74 @@ python3 pretrain.py --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
77
  Finally, we convert the pre-trained model into Huggingface's format:
78
 
79
  ```
80
- python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path cluecorpussmall_gpt2_seq1024_model.bin-250000 \
81
  --output_model_path pytorch_model.bin \
82
- --layers_num 12
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  ```
84
 
85
  ### BibTeX entry and citation info
@@ -98,4 +177,17 @@ python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path cluec
98
  pages={241},
99
  year={2019}
100
  }
101
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  language: zh
3
  datasets: CLUECorpusSmall
4
  widget:
5
+ - text: "米饭是一种用稻米与水煮成的食物"
6
 
7
 
8
  ---
9
 
10
 
11
+ # Chinese GPT2-distil Model
12
 
13
  ## Model description
14
 
15
+ The set of GPT2 models, except for GPT2-xlarge model, are pre-trained by [UER-py](https://github.com/dbiir/UER-py/), which is introduced in [this paper](https://arxiv.org/abs/1909.05658). The GPT2-xlarge model is pre-trained by [TencentPretrain](https://github.com/Tencent/TencentPretrain) introduced in [this paper](https://arxiv.org/abs/2212.06385), which inherits UER-py to support models with parameters above one billion, and extends it to a multimodal pre-training framework. Besides, the other models could also be pre-trained by TencentPretrain.
16
+
17
+ The model is used to generate Chinese texts. You can download the set of Chinese GPT2 models either from the [UER-py Modelzoo page](https://github.com/dbiir/UER-py/wiki/Modelzoo), or via HuggingFace from the links below:
18
+
19
+ | | Link |
20
+ | ----------------- | :----------------------------: |
21
+ | **GPT2-distil** | [**L=6/H=768**][distil] |
22
+ | **GPT2** | [**L=12/H=768**][base] |
23
+ | **GPT2-medium** | [**L=24/H=1024**][medium] |
24
+ | **GPT2-large** | [**L=36/H=1280**][large] |
25
+ | **GPT2-xlarge** | [**L=48/H=1600**][xlarge] |
26
+
27
+ Note that the 6-layer model is called GPT2-distil model because it follows the configuration of [distilgpt2](https://huggingface.co/distilgpt2), and the pre-training does not involve the supervision of larger models.
28
 
29
  ## How to use
30
 
31
+ You can use the model directly with a pipeline for text generation (take the case of GPT2-distil):
32
 
33
  ```python
34
  >>> from transformers import BertTokenizer, GPT2LMHeadModel, TextGenerationPipeline
35
+ >>> tokenizer = BertTokenizer.from_pretrained("uer/gpt2-distil-chinese-cluecorpussmall")
36
+ >>> model = GPT2LMHeadModel.from_pretrained("uer/gpt2-distil-chinese-cluecorpussmall")
37
  >>> text_generator = TextGenerationPipeline(model, tokenizer)
38
  >>> text_generator("这是很久之前的事情了", max_length=100, do_sample=True)
39
+ [{'generated_text': '这是很久之前的事情了 , 我 们 的 生 个 信 用 体 系 的 我 不 知'}]
40
  ```
41
 
42
  ## Training data
 
45
 
46
  ## Training procedure
47
 
48
+ The GPT2-xlarge model is pre-trained by [TencentPretrain](https://github.com/Tencent/TencentPretrain), and the others are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 1024.
49
+
50
+ For the models pre-trained by UER-py, take the case of GPT2-distil
51
 
52
  Stage1:
53
 
 
61
  ```
62
  python3 pretrain.py --dataset_path cluecorpussmall_lm_seq128_dataset.pt \
63
  --vocab_path models/google_zh_vocab.txt \
64
+ --config_path models/gpt2/distil_config.json \
65
+ --output_model_path models/cluecorpussmall_gpt2_distil_seq128_model.bin \
66
  --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
67
  --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
68
  --learning_rate 1e-4 --batch_size 64
 
80
  ```
81
  python3 pretrain.py --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
82
  --vocab_path models/google_zh_vocab.txt \
83
+ --pretrained_model_path models/cluecorpussmall_gpt2_distil_seq128_model.bin-1000000 \
84
+ --config_path models/gpt2/distil_config.json \
85
+ --output_model_path models/cluecorpussmall_gpt2_distil_seq1024_model.bin \
86
  --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
87
  --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
88
  --learning_rate 5e-5 --batch_size 16
 
91
  Finally, we convert the pre-trained model into Huggingface's format:
92
 
93
  ```
94
+ python3 scripts/convert_gpt2_from_uer_to_huggingface.py --input_model_path cluecorpussmall_gpt2_distil_seq1024_model.bin-250000 \
95
  --output_model_path pytorch_model.bin \
96
+ --layers_num 6
97
+ ```
98
+
99
+ For GPT2-xlarge model, we use TencetPretrain.
100
+
101
+ Stage1:
102
+
103
+ ```
104
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
105
+ --vocab_path models/google_zh_vocab.txt \
106
+ --dataset_path cluecorpussmall_lm_seq128_dataset.pt \
107
+ --seq_length 128 --processes_num 32 --data_processor lm
108
+ ```
109
+
110
+ ```
111
+ deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
112
+ --dataset_path corpora/cluecorpussmall_lm_seq128_dataset.pt \
113
+ --vocab_path models/google_zh_vocab.txt \
114
+ --config_path models/gpt2/xlarge_config.json \
115
+ --output_model_path models/cluecorpussmall_gpt2_xlarge_seq128 \
116
+ --world_size 8 --batch_size 64 \
117
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
118
+ --deepspeed_checkpoint_activations --deepspeed_checkpoint_layers_num 24
119
+ ```
120
+
121
+ Before stage2, we extract fp32 consolidated weights from a zero 2 and 3 DeepSpeed checkpoints:
122
+
123
+ ```
124
+ python3 models/cluecorpussmall_gpt2_xlarge_seq128/zero_to_fp32.py models/cluecorpussmall_gpt2_xlarge_seq128/ \
125
+ models/cluecorpussmall_gpt2_xlarge_seq128.bin
126
+ ```
127
+
128
+ Stage2:
129
+
130
+ ```
131
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
132
+ --vocab_path models/google_zh_vocab.txt \
133
+ --dataset_path cluecorpussmall_lm_seq1024_dataset.pt \
134
+ --seq_length 1024 --processes_num 32 --data_processor lm
135
+ ```
136
+
137
+ ```
138
+ deepspeed pretrain.py --deepspeed --deepspeed_config models/deepspeed_config.json \
139
+ --dataset_path corpora/cluecorpussmall_lm_seq1024_dataset.pt \
140
+ --vocab_path models/google_zh_vocab.txt \
141
+ --config_path models/gpt2/xlarge_config.json \
142
+ --pretrained_model_path models/cluecorpussmall_gpt2_xlarge_seq128.bin \
143
+ --output_model_path models/cluecorpussmall_gpt2_xlarge_seq1024_stage2 \
144
+ --world_size 8 --batch_size 16 --learning_rate 5e-5 \
145
+ --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
146
+ --deepspeed_checkpoint_activations --deepspeed_checkpoint_layers_num 6
147
+ ```
148
+
149
+ Then, we extract fp32 consolidated weights from a zero 2 and 3 DeepSpeed checkpoints:
150
+
151
+ ```
152
+ python3 models/cluecorpussmall_gpt2_xlarge_seq1024_stage2/zero_to_fp32.py models/cluecorpussmall_gpt2_xlarge_seq1024_stage2/ \
153
+ models/cluecorpussmall_gpt2_xlarge_seq1024_stage2.bin
154
+ ```
155
+
156
+ Finally, we convert the pre-trained model into Huggingface's format:
157
+
158
+ ```
159
+ python3 scripts/convert_gpt2_from_tencentpretrain_to_huggingface.py --input_model_path models/cluecorpussmall_gpt2_xlarge_seq1024_stage2.bin \
160
+ --output_model_path pytorch_model.bin \
161
+ --layers_num 48
162
  ```
163
 
164
  ### BibTeX entry and citation info
 
177
  pages={241},
178
  year={2019}
179
  }
180
+
181
+ @article{zhao2023tencentpretrain,
182
+ title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},
183
+ author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},
184
+ journal={ACL 2023},
185
+ pages={217},
186
+ year={2023}
187
+ ```
188
+
189
+ [distil]:https://huggingface.co/uer/gpt2-distil-chinese-cluecorpussmall
190
+ [base]:https://huggingface.co/uer/gpt2-chinese-cluecorpussmall
191
+ [medium]:https://huggingface.co/uer/gpt2-medium-chinese-cluecorpussmall
192
+ [large]:https://huggingface.co/uer/gpt2-large-chinese-cluecorpussmall
193
+ [xlarge]:https://huggingface.co/uer/gpt2-xlarge-chinese-cluecorpussmall