Update

Browse files

Files changed (7) hide show

README.md +94 -0
config.json +54 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tf_model.h5 +3 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,94 @@

+---
+language: Chinese
+datasets: CLUECorpusSmall
+widget:
+- text: "内容丰富、版式设计考究、图片华丽、印制精美。[MASK]纸箱内还放了充气袋用于保护。"
+---
+# Chinese Pegasus
+## Model description
+This model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/), which is introduced in [this paper](https://arxiv.org/abs/1909.05658).
+You can download the set of Chinese PEGASUS models either from the [UER-py Modelzoo page](https://github.com/dbiir/UER-py/wiki/Modelzoo), or via HuggingFace from the links below:
+|                   |              Link              |
+| ----------------- | :----------------------------: |
+| **PEGASUS-Base** | [**L=12/H=768 (Base)**][base] |
+| **PEGASUS-Large**  | [**L=16/H=1024 (Large)**][large] |
+## How to use
+You can use this model directly with a pipeline for text2text generation (take the case of PEGASUS-Base):
+```python
+>>> from transformers import BertTokenizer, PegasusForConditionalGeneration, Text2TextGenerationPipeline
+>>> tokenizer = BertTokenizer.from_pretrained("uer/pegasus-base-chinese-cluecorpussmall")
+>>> model = PegasusForConditionalGeneration.from_pretrained("uer/pegasus-base-chinese-cluecorpussmall")
+>>> text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
+>>> text2text_generator("内容丰富、版式设计考究、图片华丽、印制精美。[MASK]纸箱内还放了充气袋用于保护。", max_length=50, do_sample=False)
+    [{'generated_text': '书 的 质 量 很 好 。'}]
+```
+## Training data
+[CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data.
+## Training procedure
+The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 512.
+Taking the case of PEGASUS-Base
+```
+python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
+                      --vocab_path models/google_zh_vocab.txt \
+                      --dataset_path cluecorpussmall_pegasus_seq512_dataset.pt \
+                      --processes_num 32 --seq_length 512 \
+                      --data_processor gsg --sentence_selection_strategy random
+```
+```
+python3 pretrain.py --dataset_path cluecorpussmall_pegasus_seq512_dataset.pt \
+                    --vocab_path models/google_zh_vocab.txt \
+                    --config_path models/pegasus/base_config.json \
+                    --output_model_path models/cluecorpussmall_pegasus_base_seq512_model.bin \
+                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
+                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
+                    --learning_rate 1e-4 --batch_size 8
+```
+Finally, we convert the pre-trained model into Huggingface's format:
+```
+python3 scripts/convert_pegasus_from_uer_to_huggingface.py --input_model_path cluecorpussmall_pegasus_base_seq512_model.bin-250000 \
+                                                           --output_model_path pytorch_model.bin \
+                                                           --layers_num 12
+```
+### BibTeX entry and citation info
+```
+@inproceedings{zhang2020pegasus,
+  title={Pegasus: Pre-training with extracted gap-sentences for abstractive summarization},
+  author={Zhang, Jingqing and Zhao, Yao and Saleh, Mohammad and Liu, Peter},
+  booktitle={International Conference on Machine Learning},
+  pages={11328--11339},
+  year={2020},
+  organization={PMLR}
+}
+@article{zhao2019uer,
+  title={UER: An Open-Source Toolkit for Pre-training Models},
+  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
+  journal={EMNLP-IJCNLP 2019},
+  pages={241},
+  year={2019}
+}
+```
+[base]:https://huggingface.co/uer/pegasus-base-chinese-cluecorpussmall
+[large]:https://huggingface.co/uer/pegasus-large-chinese-cluecorpussmall

config.json ADDED Viewed

	@@ -0,0 +1,54 @@

+{
+  "_name_or_path": "pegasus",
+  "activation_dropout": 0.1,
+  "activation_function": "relu",
+  "add_bias_logits": false,
+  "add_final_layer_norm": true,
+  "architectures": [
+    "PegasusForConditionalGeneration"
+  ],
+  "attention_dropout": 0.1,
+  "bos_token_id": 101,
+  "classif_dropout": 0.0,
+  "classifier_dropout": 0.0,
+  "d_model": 1024,
+  "decoder_attention_heads": 16,
+  "decoder_ffn_dim": 4096,
+  "decoder_layerdrop": 0.0,
+  "decoder_layers": 16,
+  "decoder_start_token_id": 101,
+  "dropout": 0.1,
+  "encoder_attention_heads": 16,
+  "encoder_ffn_dim": 4096,
+  "encoder_layerdrop": 0.0,
+  "encoder_layers": 16,
+  "eos_token_id": 1,
+  "extra_pos_embeddings": 1,
+  "force_bos_token_to_be_generated": false,
+  "forced_eos_token_id": 102,
+  "gradient_checkpointing": false,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2"
+  },
+  "init_std": 0.02,
+  "is_encoder_decoder": true,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_2": 2
+  },
+  "max_length": 256,
+  "max_position_embeddings": 1024,
+  "model_type": "pegasus",
+  "normalize_before": true,
+  "normalize_embedding": false,
+  "num_hidden_layers": 16,
+  "pad_token_id": 0,
+  "scale_embedding": true,
+  "static_position_embeddings": true,
+  "transformers_version": "4.13.0.dev0",
+  "use_cache": true,
+  "vocab_size": 21128
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bca70d7816a42a3751d9ab17cc7f86ae606ba020c313775092a2a1c08d7dcf06
+size 1976418801

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e9d4f6cda0f75f2ef3562ad7a88a1d2ca911a35c000f56ca336f6a09bf3563f5
+size 1976809520

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "pegasus", "tokenizer_class": "BertTokenizer"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff