hhou435 commited on
Commit
9fab674
1 Parent(s): 167f4f2

First version of the chinese_roberta_L-12_H-128 model and tokenizer.

Browse files
Files changed (7) hide show
  1. README.md +148 -0
  2. config.json +20 -0
  3. pytorch_model.bin +3 -0
  4. special_tokens_map.json +1 -0
  5. tf_model.h5 +3 -0
  6. tokenizer_config.json +1 -0
  7. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: Chinese
3
+ datasets: CLUECorpus
4
+ widget:
5
+ - text: "北京是[MASK]国的首都。"
6
+ ---
7
+
8
+
9
+ # Chinese RoBERTa Miniatures
10
+
11
+ ## Model description
12
+
13
+ This is the set of 24 Chinese RoBERTa models pre-trained by [UER-py](https://www.aclweb.org/anthology/D19-3041.pdf).
14
+
15
+ You can download the 24 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
16
+
17
+ | |H=128|H=256|H=512|H=768|
18
+ |---|:---:|:---:|:---:|:---:|
19
+ | **L=2** |[**2/128 (Tiny)**][2_128]|[2/256][2_256]|[2/512]|[2/768]|
20
+ | **L=4** |[4/128]|[**4/256 (Mini)**][4_256]|[**4/512 (Small)**][4_512]|[4/768]|
21
+ | **L=6** |[6/128]|[6/256]|[6/512]|[6/768]|
22
+ | **L=8** |[8/128]|[8/256]|[**8/512 (Medium)**][8_512]|[8/768]|
23
+ | **L=10** |[10/128]|[10/256]|[10/512]|[10/768]|
24
+ | **L=12** |[12/128][12_128]|[12/256]|[12/512]|[**12/768 (Base)**]|
25
+
26
+ ## How to use
27
+
28
+ You can use this model directly with a pipeline for masked language modeling:
29
+
30
+ ```python
31
+ >>> from transformers import pipeline
32
+ >>> unmasker = pipeline('fill-mask', model='uer/chinese_roberta_L-12_H-128')
33
+ >>> unmasker("中国的首都是[MASK]京。")
34
+ [
35
+ {'sequence': '[CLS] 中 国 的 首 都 是 北 京 。 [SEP]',
36
+ 'score': 0.7025051116943359,
37
+ 'token': 1266,
38
+ 'token_str': '北'},
39
+ {'sequence': '[CLS] 中 国 的 首 都 是 东 京 。 [SEP]',
40
+ 'score': 0.10765232890844345,
41
+ 'token': 691,
42
+ 'token_str': '东'},
43
+ {'sequence': '[CLS] 中 国 的 首 都 是 南 京 。 [SEP]',
44
+ 'score': 0.10444136708974838,
45
+ 'token': 1298,
46
+ 'token_str': '南'},
47
+ {'sequence': '[CLS] 中 国 的 首 都 是 葡 京 。 [SEP]',
48
+ 'score': 0.05845486372709274,
49
+ 'token': 5868,
50
+ 'token_str': '葡'},
51
+ {'sequence': '[CLS] 中 国 的 首 都 是 普 京 。 [SEP]',
52
+ 'score': 0.004515049979090691,
53
+ 'token': 3249,
54
+ 'token_str': '普'}
55
+ ]
56
+ ```
57
+
58
+ Here is how to use this model to get the features of a given text in PyTorch:
59
+
60
+ ```python
61
+ from transformers import BertTokenizer, BertModel
62
+ tokenizer = BertTokenizer.from_pretrained('uer/chinese_roberta_L-12_H-128')
63
+ model = BertModel.from_pretrained("uer/chinese_roberta_L-12_H-128")
64
+ text = "用你喜欢的任何文本替换我。"
65
+ encoded_input = tokenizer(text, return_tensors='pt')
66
+ output = model(**encoded_input)
67
+ ```
68
+
69
+ and in TensorFlow:
70
+
71
+ ```python
72
+ from transformers import BertTokenizer, TFBertModel
73
+ tokenizer = BertTokenizer.from_pretrained('uer/chinese_roberta_L-12_H-128')
74
+ model = TFBertModel.from_pretrained("uer/chinese_roberta_L-12_H-128")
75
+ text = "用你喜欢的任何文本替换我。"
76
+ encoded_input = tokenizer(text, return_tensors='tf')
77
+ output = model(encoded_input)
78
+ ```
79
+
80
+
81
+
82
+ ## Training data
83
+
84
+ CLUECorpus2020 and CLUECorpusSmall are used as training data.
85
+
86
+ ## Training procedure
87
+
88
+ Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512.
89
+
90
+ Stage1:
91
+ ```
92
+ python3 preprocess.py --corpus_path corpora/cluecorpus.txt \
93
+ --vocab_path models/google_zh_vocab.txt \
94
+ --dataset_path cluecorpus_seq128_dataset.pt \
95
+ --processes_num 32 --seq_length 128 \
96
+ --dynamic_masking --target mlm
97
+ ```
98
+ ```
99
+ python3 pretrain.py --dataset_path cluecorpus_seq128_dataset.pt \
100
+ --vocab_path models/google_zh_vocab.txt \
101
+ --config_path models/bert_l12h128_config.json \
102
+ --output_model_path models/cluecorpus_roberta_l12h128_seq512_model.bin \
103
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
104
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
105
+ --learning_rate 1e-4 --batch_size 64 \
106
+ --tie_weights --encoder bert --target mlm
107
+ ```
108
+ Stage2:
109
+ ```
110
+ python3 preprocess.py --corpus_path corpora/cluecorpus.txt \
111
+ --vocab_path models/google_zh_vocab.txt \
112
+ --dataset_path cluecorpus_seq512_dataset.pt \
113
+ --processes_num 32 --seq_length 512 \
114
+ --dynamic_masking --target mlm
115
+ ```
116
+ ```
117
+ python3 pretrain.py --dataset_path cluecorpus_seq512_dataset.pt \
118
+ --pretrained_model_path models/cluecorpus_roberta_l12h128_seq512_model.bin-1000000 \
119
+ --vocab_path models/google_zh_vocab.txt \
120
+ --config_path models/bert_l12h128_config.json \
121
+ --output_model_path models/cluecorpus_roberta_l12h128_seq512_model.bin \
122
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
123
+ --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
124
+ --learning_rate 5e-5 --batch_size 16 \
125
+ --tie_weights --encoder bert --target mlm
126
+ ```
127
+
128
+ ### BibTeX entry and citation info
129
+
130
+ ```
131
+ @article{zhao2019uer,
132
+ title={UER: An Open-Source Toolkit for Pre-training Models},
133
+ author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
134
+ journal={EMNLP-IJCNLP 2019},
135
+ pages={241},
136
+ year={2019}
137
+ }
138
+ ```
139
+
140
+ [2_128]: https://huggingface.co/uer/chinese_roberta_L-2_H-128
141
+ [4_256]: https://huggingface.co/uer/chinese_roberta_L-4_H-256
142
+ [8_512]: https://huggingface.co/uer/chinese_roberta_L-8_H-512
143
+ [4_512]: https://huggingface.co/uer/chinese_roberta_L-4_H-512
144
+
145
+ [2_256]: https://huggingface.co/uer/chinese_roberta_L-2_H-256
146
+
147
+ [12_128]: https://huggingface.co/uer/chinese_roberta_L-12_H-128
148
+
config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "gradient_checkpointing": false,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 128,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 512,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 2,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "type_vocab_size": 2,
19
+ "vocab_size": 21128
20
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e40c58ffb53d537e062dcdefad08bd3337e7de8389d1cd39836661f67627b92
3
+ size 20836679
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e1169bc549d56bdcc6249565084471447994e4eb7c8b662d9899648c544718e
3
+ size 32177320
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
1
+ {"do_lower_case": false, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff