uer commited on
Commit
f55e2b0
1 Parent(s): c1ba339

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +147 -1
README.md CHANGED
@@ -1 +1,147 @@
1
- The model is coming soon.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: Chinese
3
+ datasets: CLUECorpusSmall
4
+ widget:
5
+ - text: "中国的首都是[MASK]"
6
+
7
+
8
+
9
+ ---
10
+
11
+
12
+ # Chinese RoBERTa-base-word Model
13
+
14
+ ## Model description
15
+
16
+ We use sentencepiece model to train this roberta base model. You can download the model via HuggingFace from the link [roberta-base-word-chinese-cluecorpussmall](https://huggingface.co/uer/roberta-base-word-chinese-cluecorpussmall).
17
+
18
+ ## How to use
19
+
20
+ You can use this model directly with a pipeline for masked language modeling:
21
+
22
+ ```python
23
+ >>> from transformers import pipeline
24
+ >>> unmasker = pipeline('fill-mask', model='uer/roberta-base-word-chinese-cluecorpussmall')
25
+ >>> unmasker("中国的首都是[MASK]。")
26
+ ```
27
+
28
+ BertTokenizer does not support sentencepiece, so we use AlbertTokenizer here.
29
+
30
+ Here is how to use this model to get the features of a given text in PyTorch:
31
+
32
+ ```python
33
+ from transformers import AlbertTokenizer, BertModel
34
+ tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-base-word-chinese-cluecorpussmall')
35
+ model = BertModel.from_pretrained("uer/roberta-base-word-chinese-cluecorpussmall")
36
+ text = "用你喜欢的任何文本替换我。"
37
+ encoded_input = tokenizer(text, return_tensors='pt')
38
+ output = model(**encoded_input)
39
+ ```
40
+
41
+ and in TensorFlow:
42
+
43
+ ```python
44
+ from transformers import AlbertTokenizer, TFBertModel
45
+ tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-base-word-chinese-cluecorpussmall')
46
+ model = TFBertModel.from_pretrained("uer/roberta-base-word-chinese-cluecorpussmall")
47
+ text = "用你喜欢的任何文本替换我。"
48
+ encoded_input = tokenizer(text, return_tensors='tf')
49
+ output = model(encoded_input)
50
+ ```
51
+
52
+ ## Training data
53
+
54
+ [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data.
55
+
56
+ ## Training procedure
57
+
58
+ We use google's **[sentencepiece](https://github.com/google/sentencepiece)** to train the sentencepiece model.
59
+
60
+ ```
61
+ >>> import sentencepiece as spm
62
+ >>> spm.SentencePieceTrainer.train(input='CLUEsmall_shuf.txt',
63
+ model_prefix='clue_6',
64
+ vocab_size=100000,
65
+ max_sentence_length=1024,
66
+ max_sentencepiece_length=6,
67
+ user_defined_symbols=['[MASK]','[unused1]','[unused2]',
68
+ '[unused3]','[unused4]','[unused5]','[unused6]',
69
+ '[unused7]','[unused8]','[unused9]','[unused10]'],
70
+ pad_id=0,
71
+ pad_piece='[PAD]',
72
+ unk_id=1,
73
+ unk_piece='[UNK]',
74
+ bos_id=2,
75
+ bos_piece='[CLS]',
76
+ eos_id=3,
77
+ eos_piece='[SEP]',
78
+ train_extremely_large_corpus=True
79
+ )
80
+ ```
81
+
82
+ The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512.
83
+
84
+ Stage1:
85
+
86
+ ```
87
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
88
+ --spm_model_path models/clue_6.model \
89
+ --dataset_path cluecorpussmall_seq128_dataset.pt \
90
+ --processes_num 32 --seq_length 128 \
91
+ --dynamic_masking --target mlm
92
+ ```
93
+
94
+ ```
95
+ python3 pretrain.py --dataset_path cluecorpussmall_seq128_dataset.pt \
96
+ --spm_model_path models/clue_6.model \
97
+ --config_path models/bert/base_config.json \
98
+ --output_model_path models/cluecorpussmall_word_roberta_base_128.bin \
99
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
100
+ --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
101
+ --learning_rate 1e-4 --batch_size 64 \
102
+ --embedding word_pos_seg --encoder transformer --mask fully_visible \
103
+ --target mlm --tie_weights
104
+ ```
105
+
106
+ Stage2:
107
+
108
+ ```
109
+ python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
110
+ --spm_model_path models/clue_6.model \
111
+ --dataset_path cluecorpussmall_seq512_dataset.pt \
112
+ --processes_num 32 --seq_length 512 \
113
+ --dynamic_masking --target mlm
114
+ ```
115
+
116
+ ```
117
+ python3 pretrain.py --dataset_path cluecorpussmall_seq128_dataset.pt \
118
+ --pretrained_model_path models/cluecorpussmall_word_roberta_base_128.bin-1000000 \
119
+ --spm_model_path models/clue_6.model \
120
+ --config_path models/bert/base_config.json \
121
+ --output_model_path models/cluecorpussmall_word_roberta_base_512.bin \
122
+ --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
123
+ --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
124
+ --learning_rate 5e-5 --batch_size 16 \
125
+ --embedding word_pos_seg --encoder transformer --mask fully_visible \
126
+ --target mlm --tie_weights
127
+ ```
128
+
129
+ Finally, we convert the pre-trained model into Huggingface's format:
130
+
131
+ ```
132
+ python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_word_roberta_base_512.bin-250000 \
133
+ --output_model_path pytorch_model.bin \
134
+ --layers_num 12 --target mlm
135
+ ```
136
+
137
+ ### BibTeX entry and citation info
138
+
139
+ ```
140
+ @article{zhao2019uer,
141
+ title={UER: An Open-Source Toolkit for Pre-training Models},
142
+ author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
143
+ journal={EMNLP-IJCNLP 2019},
144
+ pages={241},
145
+ year={2019}
146
+ }
147
+ ```