hhou435 commited on
Commit
fe53e7e
1 Parent(s): 9c01a8e
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: Chinese
3
+ datasets: CLUECorpusSmall
4
+ widget:
5
+ - text: "作为电子[MASK]的平台,京东绝对是领先者。如今的刘强[MASK]已经是身价过[MASK]的老板。"
6
+
7
+
8
+ ---
9
+
10
+ # Chinese BART
11
+
12
+ ## Model description
13
+
14
+ This model is pre-trained by [UER-py](https://arxiv.org/abs/1909.05658).
15
+ This model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/), which is introduced in [this paper](https://arxiv.org/abs/1909.05658).
16
+
17
+ You can download the set of Chinese BART models either from the [UER-py Modelzoo page](https://github.com/dbiir/UER-py/wiki/Modelzoo), or via HuggingFace from the links below:
18
+
19
+ | | Link |
20
+ | ----------------- | :----------------------------: |
21
+ | **BART-Base** | [**L=6/H=768 (Base)**][base] |
22
+ | **BART-Large** | [**L=12/H=1024 (Large)**][large] |
23
+
24
+ ## How to use
25
+
26
+ You can use this model directly with a pipeline for text2text generation (take the case of BART-Base):
27
+
28
+ ```python
29
+ >>> from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
30
+ >>> tokenizer = BertTokenizer.from_pretrained("uer/bart-base-chinese-cluecorpussmall")
31
+ >>> model = BartForConditionalGeneration.from_pretrained("uer/bart-base-chinese-cluecorpussmall")
32
+ >>> text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
33
+ >>> text2text_generator("中国的首都是[MASK]京", max_length=50, do_sample=False)
34
+ [{'generated_text': '中 国 的 首 都 是 北 京'}]
35
+ ```
36
+
37
+ ## Training data
38
+
39
+ [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data.
40
+
41
+ ## Training procedure
42
+
43
+ The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 512.
44
+ Taking the case of BART-Base
45
+
46
+ ```
47
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
48
+ --vocab_path models/google_zh_vocab.txt \
49
+ --dataset_path cluecorpussmall_bart_seq512_dataset.pt \
50
+ --processes_num 32 --seq_length 512 \
51
+ --data_processor bart
52
+ ```
53
+
54
+ ```
55
+ python3 pretrain.py --dataset_path cluecorpussmall_bart_seq512_dataset.pt \
56
+ --vocab_path models/google_zh_vocab.txt \
57
+ --config_path models/bart/base_config.json \
58
+ --output_model_path models/cluecorpussmall_bart_base_seq512_model.bin \
59
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
60
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
61
+ --learning_rate 5e-5 --batch_size 8 \
62
+ --span_masking --span_max_length 3
63
+ ```
64
+
65
+ Finally, we convert the pre-trained model into Huggingface's format:
66
+
67
+ ```
68
+ python3 scripts/convert_bart_from_uer_to_huggingface.py --input_model_path cluecorpussmall_bart_base_seq512_model.bin-1000000 \
69
+ --output_model_path pytorch_model.bin \
70
+ --layers_num 6
71
+ ```
72
+
73
+
74
+ ### BibTeX entry and citation info
75
+
76
+ ```
77
+ @article{lewis2019bart,
78
+ title={Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension},
79
+ author={Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Ves and Zettlemoyer, Luke},
80
+ journal={arXiv preprint arXiv:1910.13461},
81
+ year={2019}
82
+ }
83
+
84
+ @article{zhao2019uer,
85
+ title={UER: An Open-Source Toolkit for Pre-training Models},
86
+ author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
87
+ journal={EMNLP-IJCNLP 2019},
88
+ pages={241},
89
+ year={2019}
90
+ }
91
+ ```
92
+
93
+ [base]:https://huggingface.co/uer/bart-base-chinese-cluecorpussmall
94
+ [large]:https://huggingface.co/uer/bart-large-chinese-cluecorpussmall
config.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "bart",
3
+ "activation_dropout": 0.1,
4
+ "activation_function": "gelu",
5
+ "architectures": [
6
+ "BartForConditionalGeneration"
7
+ ],
8
+ "attention_dropout": 0.1,
9
+ "bos_token_id": 0,
10
+ "classifier_dropout": 0.0,
11
+ "d_model": 1024,
12
+ "decoder_attention_heads": 16,
13
+ "decoder_ffn_dim": 4096,
14
+ "decoder_layerdrop": 0.1,
15
+ "decoder_layers": 12,
16
+ "decoder_start_token_id": 101,
17
+ "dropout": 0.1,
18
+ "early_stopping": true,
19
+ "encoder_attention_heads": 16,
20
+ "encoder_ffn_dim": 4096,
21
+ "encoder_layerdrop": 0.1,
22
+ "encoder_layers": 12,
23
+ "eos_token_id": 0,
24
+ "forced_eos_token_id": 0,
25
+ "gradient_checkpointing": false,
26
+ "id2label": {
27
+ "0": "LABEL_0",
28
+ "1": "LABEL_1",
29
+ "2": "LABEL_2"
30
+ },
31
+ "init_std": 0.02,
32
+ "is_encoder_decoder": true,
33
+ "label2id": {
34
+ "LABEL_0": 0,
35
+ "LABEL_1": 1,
36
+ "LABEL_2": 2
37
+ },
38
+ "max_length": 256,
39
+ "max_position_embeddings": 1024,
40
+ "model_type": "bart",
41
+ "num_hidden_layers": 12,
42
+ "pad_token_id": 0,
43
+ "scale_embedding": false,
44
+ "tokenizer_class": "BertTokenizer",
45
+ "torch_dtype": "float32",
46
+ "transformers_version": "4.13.0.dev0",
47
+ "use_cache": true,
48
+ "vocab_size": 21128
49
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:133d3270d6048d49b5cc0de7e97f68a596a83170dfeb9f4022c55d9a2fd118d7
3
+ size 1506088449
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1ea941e6f9e6062c153aa42c55036b52b5441d79363e4fa2aadec88cbbc9be3f
3
+ size 1506377248
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "tokenizer_file": null, "name_or_path": "bart", "tokenizer_class": "BertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff