hhou435 commited on
Commit
b16ef7c
1 Parent(s): c6bb938

First version of the chinese_roberta_L-8_H-512 model and tokenizer.

Browse files
Files changed (7) hide show
  1. README.md +144 -0
  2. config.json +20 -0
  3. pytorch_model.bin +3 -0
  4. special_tokens_map.json +1 -0
  5. tf_model.h5 +3 -0
  6. tokenizer_config.json +1 -0
  7. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,144 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: Chinese
3
+ datasets: CLUECorpus
4
+ widget:
5
+ - text: "北京是[MASK]国的首都。"
6
+ ---
7
+
8
+
9
+ # Chinese RoBERTa Miniatures
10
+
11
+ ## Model description
12
+
13
+ This is the set of 24 Chinese RoBERTa models pre-trained by [UER-py](https://www.aclweb.org/anthology/D19-3041.pdf).
14
+
15
+ You can download the 24 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
16
+
17
+ | |H=128|H=256|H=512|H=768|
18
+ |---|:---:|:---:|:---:|:---:|
19
+ | **L=2** |[**2/128 (Tiny)**][2_128]|[2/256]|[2/512]|[2/768]|
20
+ | **L=4** |[4/128]|[**4/256 (Mini)**][4_256]|[**4/512 (Small)**]|[4/768]|
21
+ | **L=6** |[6/128]|[6/256]|[6/512]|[6/768]|
22
+ | **L=8** |[8/128]|[8/256]|[**8/512 (Medium)**][8_512]|[8/768]|
23
+ | **L=10** |[10/128]|[10/256]|[10/512]|[10/768]|
24
+ | **L=12** |[12/128]|[12/256]|[12/512]|[**12/768 (Base)**]|
25
+
26
+ ## How to use
27
+
28
+ You can use this model directly with a pipeline for masked language modeling:
29
+
30
+ ```python
31
+ >>> from transformers import pipeline
32
+ >>> unmasker = pipeline('fill-mask', model='uer/chinese_roberta_L-8_H-512')
33
+ >>> unmasker("中国的首都是[MASK]京。")
34
+ [
35
+ {'sequence': '[CLS] 中 国 的 首 都 是 北 京 。 [SEP]',
36
+ 'score': 0.9338967204093933,
37
+ 'token': 1266,
38
+ 'token_str': '北'},
39
+ {'sequence': '[CLS] 中 国 的 首 都 是 南 京 。 [SEP]',
40
+ 'score': 0.039428312331438065,
41
+ 'token': 1298,
42
+ 'token_str': '南'},
43
+ {'sequence': '[CLS] 中 国 的 首 都 是 东 京 。 [SEP]',
44
+ 'score': 0.01681734062731266,
45
+ 'token': 691,
46
+ 'token_str': '东'},
47
+ {'sequence': '[CLS] 中 国 的 首 都 是 普 京 。 [SEP]',
48
+ 'score': 0.004590896889567375,
49
+ 'token': 3249,
50
+ 'token_str': '普'},
51
+ {'sequence': '[CLS] 中 国 的 首 都 是 燕 京 。 [SEP]',
52
+ 'score': 0.0007656012894585729,
53
+ 'token': 4242,
54
+ 'token_str': '燕'}
55
+ ]
56
+
57
+ ```
58
+
59
+ Here is how to use this model to get the features of a given text in PyTorch:
60
+
61
+ ```python
62
+ from transformers import BertTokenizer, BertModel
63
+ tokenizer = BertTokenizer.from_pretrained('uer/chinese_roberta_L-8_H-512')
64
+ model = BertModel.from_pretrained("uer/chinese_roberta_L-8_H-512")
65
+ text = "用你喜欢的任何文本替换我。"
66
+ encoded_input = tokenizer(text, return_tensors='pt')
67
+ output = model(**encoded_input)
68
+ ```
69
+
70
+ and in TensorFlow:
71
+
72
+ ```python
73
+ from transformers import BertTokenizer, TFBertModel
74
+ tokenizer = BertTokenizer.from_pretrained('uer/chinese_roberta_L-8_H-512')
75
+ model = TFBertModel.from_pretrained("uer/chinese_roberta_L-8_H-512")
76
+ text = "用你喜欢的任何文本替换我。"
77
+ encoded_input = tokenizer(text, return_tensors='tf')
78
+ output = model(encoded_input)
79
+ ```
80
+
81
+
82
+
83
+ ## Training data
84
+
85
+ CLUECorpus2020 and CLUECorpusSmall are used as training data.
86
+
87
+ ## Training procedure
88
+
89
+ Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512.
90
+
91
+ Stage1:
92
+ ```
93
+ python3 preprocess.py --corpus_path corpora/cluecorpus.txt \
94
+ --vocab_path models/google_zh_vocab.txt \
95
+ --dataset_path cluecorpus_seq128_dataset.pt \
96
+ --processes_num 32 --seq_length 128 \
97
+ --dynamic_masking --target mlm
98
+ ```
99
+ ```
100
+ python3 pretrain.py --dataset_path cluecorpus_seq128_dataset.pt \
101
+ --vocab_path models/google_zh_vocab.txt \
102
+ --config_path models/bert_medium_config.json \
103
+ --output_model_path models/cluecorpus_roberta_medium_seq128_model.bin \
104
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
105
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
106
+ --learning_rate 1e-4 --batch_size 64 \
107
+ --tie_weights --encoder bert --target mlm
108
+ ```
109
+ Stage2:
110
+ ```
111
+ python3 preprocess.py --corpus_path corpora/cluecorpus.txt \
112
+ --vocab_path models/google_zh_vocab.txt \
113
+ --dataset_path cluecorpus_seq512_dataset.pt \
114
+ --processes_num 32 --seq_length 512 \
115
+ --dynamic_masking --target mlm
116
+ ```
117
+ ```
118
+ python3 pretrain.py --dataset_path cluecorpus_seq512_dataset.pt \
119
+ --pretrained_model_path models/cluecorpus_roberta_medium_seq128_model.bin-1000000 \
120
+ --vocab_path models/google_zh_vocab.txt \
121
+ --config_path models/bert_medium_config.json \
122
+ --output_model_path models/cluecorpus_roberta_medium_seq512_model.bin \
123
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
124
+ --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
125
+ --learning_rate 5e-5 --batch_size 16 \
126
+ --tie_weights --encoder bert --target mlm
127
+ ```
128
+
129
+ ### BibTeX entry and citation info
130
+
131
+ ```
132
+ @article{zhao2019uer,
133
+ title={UER: An Open-Source Toolkit for Pre-training Models},
134
+ author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
135
+ journal={EMNLP-IJCNLP 2019},
136
+ pages={241},
137
+ year={2019}
138
+ }
139
+ ```
140
+
141
+ [2_128]: https://huggingface.co/uer/chinese_roberta_L-2_H-128
142
+ [4_256]: https://huggingface.co/uer/chinese_roberta_L-4_H-256
143
+ [8_512]: https://huggingface.co/uer/chinese_roberta_L-8_H-512
144
+
config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "gradient_checkpointing": false,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 512,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 2048,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 8,
16
+ "num_hidden_layers": 8,
17
+ "pad_token_id": 0,
18
+ "type_vocab_size": 2,
19
+ "vocab_size": 21128
20
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:08d4f174eca71a30050c061139d9224158cbc1f07c12a5e8d31413823d304539
3
+ size 146403143
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ac1fbad67249f2b0453069239129dc8e8e3a5ae24ecf23cab296d5e9438fb17f
3
+ size 191919800
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"do_lower_case": false, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff