Update

Browse files

Files changed (7) hide show

README.md +94 -0
config.json +49 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tf_model.h5 +3 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,94 @@

+---
+language: Chinese
+datasets: CLUECorpusSmall
+widget:
+- text: "作为电子[MASK]的平台，京东绝对是领先者。如今的刘强[MASK]已经是身价过[MASK]的老板。"
+---
+# Chinese BART
+## Model description
+This model is pre-trained by [UER-py](https://arxiv.org/abs/1909.05658).
+This model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/), which is introduced in [this paper](https://arxiv.org/abs/1909.05658).
+You can download the set of Chinese BART models either from the [UER-py Modelzoo page](https://github.com/dbiir/UER-py/wiki/Modelzoo), or via HuggingFace from the links below:
+|                   |              Link              |
+| ----------------- | :----------------------------: |
+| **BART-Base** | [**L=6/H=768 (Base)**][base] |
+| **BART-Large**  | [**L=12/H=1024 (Large)**][large] |
+## How to use
+You can use this model directly with a pipeline for text2text generation (take the case of BART-Base):
+```python
+>>> from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
+>>> tokenizer = BertTokenizer.from_pretrained("uer/bart-base-chinese-cluecorpussmall")
+>>> model = BartForConditionalGeneration.from_pretrained("uer/bart-base-chinese-cluecorpussmall")
+>>> text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
+>>> text2text_generator("中国的首都是[MASK]京", max_length=50, do_sample=False)
+    [{'generated_text': '中 国 的 首 都 是 北 京'}]
+```
+## Training data
+[CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data.
+## Training procedure
+The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 512.
+Taking the case of BART-Base
+```
+python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
+                      --vocab_path models/google_zh_vocab.txt \
+                      --dataset_path cluecorpussmall_bart_seq512_dataset.pt \
+                      --processes_num 32 --seq_length 512 \
+                      --data_processor bart
+```
+```
+python3 pretrain.py --dataset_path cluecorpussmall_bart_seq512_dataset.pt \
+                    --vocab_path models/google_zh_vocab.txt \
+                    --config_path models/bart/base_config.json \
+                    --output_model_path models/cluecorpussmall_bart_base_seq512_model.bin \
+                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
+                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
+                    --learning_rate 5e-5 --batch_size 8 \
+                    --span_masking --span_max_length 3
+```
+Finally, we convert the pre-trained model into Huggingface's format:
+```
+python3 scripts/convert_bart_from_uer_to_huggingface.py --input_model_path cluecorpussmall_bart_base_seq512_model.bin-1000000 \
+                                                        --output_model_path pytorch_model.bin \
+                                                        --layers_num 6
+```
+### BibTeX entry and citation info
+```
+@article{lewis2019bart,
+  title={Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension},
+  author={Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Ves and Zettlemoyer, Luke},
+  journal={arXiv preprint arXiv:1910.13461},
+  year={2019}
+}
+@article{zhao2019uer,
+  title={UER: An Open-Source Toolkit for Pre-training Models},
+  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
+  journal={EMNLP-IJCNLP 2019},
+  pages={241},
+  year={2019}
+}
+```
+[base]:https://huggingface.co/uer/bart-base-chinese-cluecorpussmall
+[large]:https://huggingface.co/uer/bart-large-chinese-cluecorpussmall

config.json ADDED Viewed

	@@ -0,0 +1,49 @@

+{
+  "_name_or_path": "bart",
+  "activation_dropout": 0.1,
+  "activation_function": "gelu",
+  "architectures": [
+    "BartForConditionalGeneration"
+  ],
+  "attention_dropout": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": 0.0,
+  "d_model": 1024,
+  "decoder_attention_heads": 16,
+  "decoder_ffn_dim": 4096,
+  "decoder_layerdrop": 0.1,
+  "decoder_layers": 12,
+  "decoder_start_token_id": 101,
+  "dropout": 0.1,
+  "early_stopping": true,
+  "encoder_attention_heads": 16,
+  "encoder_ffn_dim": 4096,
+  "encoder_layerdrop": 0.1,
+  "encoder_layers": 12,
+  "eos_token_id": 0,
+  "forced_eos_token_id": 0,
+  "gradient_checkpointing": false,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2"
+  },
+  "init_std": 0.02,
+  "is_encoder_decoder": true,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_2": 2
+  },
+  "max_length": 256,
+  "max_position_embeddings": 1024,
+  "model_type": "bart",
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "scale_embedding": false,
+  "tokenizer_class": "BertTokenizer",
+  "torch_dtype": "float32",
+  "transformers_version": "4.13.0.dev0",
+  "use_cache": true,
+  "vocab_size": 21128
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:133d3270d6048d49b5cc0de7e97f68a596a83170dfeb9f4022c55d9a2fd118d7
+size 1506088449

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1ea941e6f9e6062c153aa42c55036b52b5441d79363e4fa2aadec88cbbc9be3f
+size 1506377248

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "bart", "tokenizer_class": "BertTokenizer"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff