hhou435 commited on
Commit
978152d
1 Parent(s): 536051d
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: Chinese
3
+ datasets: CLUECorpusSmall
4
+ widget:
5
+ - text: "内容丰富、版式设计考究、图片华丽、印制精美。[MASK]纸箱内还放了充气袋用于保护。"
6
+
7
+
8
+ ---
9
+
10
+ # Chinese Pegasus
11
+
12
+ ## Model description
13
+
14
+ This model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/), which is introduced in [this paper](https://arxiv.org/abs/1909.05658).
15
+
16
+ You can download the set of Chinese PEGASUS models either from the [UER-py Modelzoo page](https://github.com/dbiir/UER-py/wiki/Modelzoo), or via HuggingFace from the links below:
17
+
18
+ | | Link |
19
+ | ----------------- | :----------------------------: |
20
+ | **PEGASUS-Base** | [**L=12/H=768 (Base)**][base] |
21
+ | **PEGASUS-Large** | [**L=16/H=1024 (Large)**][large] |
22
+
23
+ ## How to use
24
+
25
+ You can use this model directly with a pipeline for text2text generation (take the case of PEGASUS-Base):
26
+
27
+ ```python
28
+ >>> from transformers import BertTokenizer, PegasusForConditionalGeneration, Text2TextGenerationPipeline
29
+ >>> tokenizer = BertTokenizer.from_pretrained("uer/pegasus-base-chinese-cluecorpussmall")
30
+ >>> model = PegasusForConditionalGeneration.from_pretrained("uer/pegasus-base-chinese-cluecorpussmall")
31
+ >>> text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
32
+ >>> text2text_generator("内容丰富、版式设计考究、图片华丽、印制精美。[MASK]纸箱内还放了充气袋用于保护。", max_length=50, do_sample=False)
33
+ [{'generated_text': '书 的 质 量 很 好 。'}]
34
+ ```
35
+
36
+ ## Training data
37
+
38
+ [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data.
39
+
40
+ ## Training procedure
41
+
42
+ The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 512.
43
+ Taking the case of PEGASUS-Base
44
+
45
+ ```
46
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
47
+ --vocab_path models/google_zh_vocab.txt \
48
+ --dataset_path cluecorpussmall_pegasus_seq512_dataset.pt \
49
+ --processes_num 32 --seq_length 512 \
50
+ --data_processor gsg --sentence_selection_strategy random
51
+ ```
52
+
53
+ ```
54
+ python3 pretrain.py --dataset_path cluecorpussmall_pegasus_seq512_dataset.pt \
55
+ --vocab_path models/google_zh_vocab.txt \
56
+ --config_path models/pegasus/base_config.json \
57
+ --output_model_path models/cluecorpussmall_pegasus_base_seq512_model.bin \
58
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
59
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
60
+ --learning_rate 1e-4 --batch_size 8
61
+ ```
62
+
63
+ Finally, we convert the pre-trained model into Huggingface's format:
64
+
65
+ ```
66
+ python3 scripts/convert_pegasus_from_uer_to_huggingface.py --input_model_path cluecorpussmall_pegasus_base_seq512_model.bin-250000 \
67
+ --output_model_path pytorch_model.bin \
68
+ --layers_num 12
69
+ ```
70
+
71
+
72
+ ### BibTeX entry and citation info
73
+
74
+ ```
75
+ @inproceedings{zhang2020pegasus,
76
+ title={Pegasus: Pre-training with extracted gap-sentences for abstractive summarization},
77
+ author={Zhang, Jingqing and Zhao, Yao and Saleh, Mohammad and Liu, Peter},
78
+ booktitle={International Conference on Machine Learning},
79
+ pages={11328--11339},
80
+ year={2020},
81
+ organization={PMLR}
82
+ }
83
+
84
+ @article{zhao2019uer,
85
+ title={UER: An Open-Source Toolkit for Pre-training Models},
86
+ author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
87
+ journal={EMNLP-IJCNLP 2019},
88
+ pages={241},
89
+ year={2019}
90
+ }
91
+ ```
92
+
93
+ [base]:https://huggingface.co/uer/pegasus-base-chinese-cluecorpussmall
94
+ [large]:https://huggingface.co/uer/pegasus-large-chinese-cluecorpussmall
config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "pegasus",
3
+ "activation_dropout": 0.1,
4
+ "activation_function": "relu",
5
+ "add_bias_logits": false,
6
+ "add_final_layer_norm": true,
7
+ "architectures": [
8
+ "PegasusForConditionalGeneration"
9
+ ],
10
+ "attention_dropout": 0.1,
11
+ "bos_token_id": 101,
12
+ "classif_dropout": 0.0,
13
+ "classifier_dropout": 0.0,
14
+ "d_model": 1024,
15
+ "decoder_attention_heads": 16,
16
+ "decoder_ffn_dim": 4096,
17
+ "decoder_layerdrop": 0.0,
18
+ "decoder_layers": 16,
19
+ "decoder_start_token_id": 101,
20
+ "dropout": 0.1,
21
+ "encoder_attention_heads": 16,
22
+ "encoder_ffn_dim": 4096,
23
+ "encoder_layerdrop": 0.0,
24
+ "encoder_layers": 16,
25
+ "eos_token_id": 1,
26
+ "extra_pos_embeddings": 1,
27
+ "force_bos_token_to_be_generated": false,
28
+ "forced_eos_token_id": 102,
29
+ "gradient_checkpointing": false,
30
+ "id2label": {
31
+ "0": "LABEL_0",
32
+ "1": "LABEL_1",
33
+ "2": "LABEL_2"
34
+ },
35
+ "init_std": 0.02,
36
+ "is_encoder_decoder": true,
37
+ "label2id": {
38
+ "LABEL_0": 0,
39
+ "LABEL_1": 1,
40
+ "LABEL_2": 2
41
+ },
42
+ "max_length": 256,
43
+ "max_position_embeddings": 1024,
44
+ "model_type": "pegasus",
45
+ "normalize_before": true,
46
+ "normalize_embedding": false,
47
+ "num_hidden_layers": 16,
48
+ "pad_token_id": 0,
49
+ "scale_embedding": true,
50
+ "static_position_embeddings": true,
51
+ "transformers_version": "4.13.0.dev0",
52
+ "use_cache": true,
53
+ "vocab_size": 21128
54
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bca70d7816a42a3751d9ab17cc7f86ae606ba020c313775092a2a1c08d7dcf06
3
+ size 1976418801
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e9d4f6cda0f75f2ef3562ad7a88a1d2ca911a35c000f56ca336f6a09bf3563f5
3
+ size 1976809520
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "pegasus", "tokenizer_class": "BertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff