yunfan commited on
Commit
4e93f21
1 Parent(s): cac8228

update model to version 2.0

Browse files
README.md CHANGED
@@ -6,9 +6,38 @@ tags:
6
  - BART
7
  language: zh
8
  ---
9
-
10
  # Chinese BART-Base
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ## Model description
13
 
14
  This is an implementation of Chinese BART-Base.
@@ -19,7 +48,6 @@ Yunfan Shao, Zhichao Geng, Yitao Liu, Junqi Dai, Fei Yang, Li Zhe, Hujun Bao, Xi
19
 
20
  **Github Link:** https://github.com/fastnlp/CPT
21
 
22
-
23
  ## Usage
24
 
25
  ```python
@@ -42,4 +70,4 @@ Yunfan Shao, Zhichao Geng, Yitao Liu, Junqi Dai, Fei Yang, Li Zhe, Hujun Bao, Xi
42
  journal={arXiv preprint arXiv:2109.05729},
43
  year={2021}
44
  }
45
- ```
6
  - BART
7
  language: zh
8
  ---
 
9
  # Chinese BART-Base
10
 
11
+ ### News
12
+
13
+ **12/30/2022**
14
+
15
+ An updated version of CPT & Chinese BART are released. In the new version, we changed the following parts:
16
+
17
+ - **Vocabulary** We replace the old BERT vocabulary with a larger one of size 51271 built from the training data, in which we 1) add missing 6800+ Chinese characters (most of them are traditional Chinese characters); 2) remove redundant tokens (e.g. Chinese character tokens with ## prefix); 3) add some English tokens to reduce OOV.
18
+ - **Position Embeddings** We extend the max_position_embeddings from 512 to 1024.
19
+
20
+ We initialize the new version of models with the old version of checkpoints with vocabulary alignment. Token embeddings found in the old checkpoints are copied. And other newly added parameters are randomly initialized. We further train the new CPT & Chinese BART 50K steps with batch size 2048, max-seq-length 1024, peak learning rate 2e-5, and warmup ratio 0.1.
21
+
22
+ The result compared to the previous checkpoints is as followings:
23
+
24
+ | | AFQMC | IFLYTEK | CSL-sum | LCSTS | AVG |
25
+ | :--------- | :---: | :-----: | :-----: | :---: | :---: |
26
+ | Previous | | | | | |
27
+ | bart-base | 73.0 | 60 | 62.1 | 37.8 | 58.23 |
28
+ | cpt-base | 75.1 | 60.5 | 63.0 | 38.2 | 59.20 |
29
+ | bart-large | 75.7 | 62.1 | 64.2 | 40.6 | 60.65 |
30
+ | cpt-large | 75.9 | 61.8 | 63.7 | 42.0 | 60.85 |
31
+ | Updataed | | | | | |
32
+ | bart-base | 73.03 | 61.25 | 61.51 | 38.78 | 58.64 |
33
+ | cpt-base | 74.40 | 61.23 | 62.09 | 38.81 | 59.13 |
34
+ | bart-large | 75.81 | 61.52 | 64.62 | 40.90 | 60.71 |
35
+ | cpt-large | 75.97 | 61.63 | 63.83 | 42.08 | 60.88 |
36
+
37
+ The result shows that the updated models maintain comparative performance compared with previous checkpoints. There are still some cases that the updated model is slightly worse than the previous one, which results from the following reasons: 1) Training additional a few steps did not lead to significant performance improvement; 2) some downstream tasks are not affected by the newly added tokens and longer encoding sequences, but sensitive to the fine-tuning hyperparameters.
38
+
39
+ - Note that to use updated models, please update the `modeling_cpt.py` (new version download [Here](https://github.com/fastnlp/CPT/blob/master/finetune/modeling_cpt.py)) and the vocabulary (refresh the cache).
40
+
41
  ## Model description
42
 
43
  This is an implementation of Chinese BART-Base.
48
 
49
  **Github Link:** https://github.com/fastnlp/CPT
50
 
 
51
  ## Usage
52
 
53
  ```python
70
  journal={arXiv preprint arXiv:2109.05729},
71
  year={2021}
72
  }
73
+ ```
config.json CHANGED
@@ -4,7 +4,7 @@
4
  "add_bias_logits": false,
5
  "add_final_layer_norm": false,
6
  "architectures": [
7
- "BartModel"
8
  ],
9
  "attention_dropout": 0.1,
10
  "bos_token_id": 101,
@@ -37,7 +37,7 @@
37
  "LABEL_1": 1,
38
  "LABEL_2": 2
39
  },
40
- "max_position_embeddings": 512,
41
  "model_type": "bart",
42
  "no_repeat_ngram_size": 3,
43
  "normalize_before": false,
@@ -68,5 +68,6 @@
68
  },
69
  "transformers_version": "4.4.1",
70
  "use_cache": true,
71
- "vocab_size": 21128
 
72
  }
4
  "add_bias_logits": false,
5
  "add_final_layer_norm": false,
6
  "architectures": [
7
+ "BartForConditionalGeneration"
8
  ],
9
  "attention_dropout": 0.1,
10
  "bos_token_id": 101,
37
  "LABEL_1": 1,
38
  "LABEL_2": 2
39
  },
40
+ "max_position_embeddings": 1024,
41
  "model_type": "bart",
42
  "no_repeat_ngram_size": 3,
43
  "normalize_before": false,
68
  },
69
  "transformers_version": "4.4.1",
70
  "use_cache": true,
71
+ "tokenizer_class": "BertTokenizer",
72
+ "vocab_size": 51271
73
  }
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f47b2af68988f741f290fa2a290167733a9cee98467574b1a471fbe572f0b212
3
- size 598070940
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1664f0f2e60f75fdd9153db3cba24ed7cf3a38ce79bd884a3f3217f30c0e0f8
3
+ size 561076499
special_tokens_map.json CHANGED
@@ -1 +1 @@
1
- {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
1
+ {"bos_token": "[CLS]", "eos_token": "[EOS]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json CHANGED
@@ -1 +1 @@
1
- {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": "/remote-home/yfshao/.cache/huggingface/transformers/d521373fc7ac35f63d56cf303de74a202403dcf1aaa792cd01f653694be59563.dd8bd9bfd3664b530ea4e645105f557769387b3da9f79bdb55ed556bdd80611d", "name_or_path": "hfl/chinese-roberta-wwm-ext"}
1
+ {"do_lower_case": false, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "bos_token": "[CLS]", "eos_token": "[EOS]", "name_or_path": "/remote-home/yfshao/workdir/code-base/Megatron-LM/init_models_ckpt/bart_zh/base", "special_tokens_map_file": "vocab/cpt_v3_vocab/special_tokens_map.json", "tokenizer_file": null}
vocab.txt CHANGED
The diff for this file is too large to render. See raw diff