First version of the chinese_roberta_L-8_H-512 model and tokenizer.

Browse files

Files changed (7) hide show

README.md +144 -0
config.json +20 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tf_model.h5 +3 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,144 @@

+---
+language: Chinese
+datasets: CLUECorpus
+widget:
+- text: "北京是[MASK]国的首都。"
+---
+# Chinese RoBERTa Miniatures
+## Model description
+This is the set of 24 Chinese RoBERTa models pre-trained by [UER-py](https://www.aclweb.org/anthology/D19-3041.pdf).
+You can download the 24 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
+|   |H=128|H=256|H=512|H=768|
+|---|:---:|:---:|:---:|:---:|
+| **L=2**  |[**2/128 (Tiny)**][2_128]|[2/256]|[2/512]|[2/768]|
+| **L=4**  |[4/128]|[**4/256 (Mini)**][4_256]|[**4/512 (Small)**]|[4/768]|
+| **L=6**  |[6/128]|[6/256]|[6/512]|[6/768]|
+| **L=8**  |[8/128]|[8/256]|[**8/512 (Medium)**][8_512]|[8/768]|
+| **L=10** |[10/128]|[10/256]|[10/512]|[10/768]|
+| **L=12** |[12/128]|[12/256]|[12/512]|[**12/768 (Base)**]|
+## How to use
+You can use this model directly with a pipeline for masked language modeling:
+```python
+>>> from transformers import pipeline
+>>> unmasker = pipeline('fill-mask', model='uer/chinese_roberta_L-8_H-512')
+>>> unmasker("中国的首都是[MASK]京。")
+[
+    {'sequence': '[CLS] 中 国 的 首 都 是 北 京 。 [SEP]',
+     'score': 0.9338967204093933,
+     'token': 1266,
+     'token_str': '北'},
+    {'sequence': '[CLS] 中 国 的 首 都 是 南 京 。 [SEP]',
+     'score': 0.039428312331438065,
+     'token': 1298,
+     'token_str': '南'},
+    {'sequence': '[CLS] 中 国 的 首 都 是 东 京 。 [SEP]',
+     'score': 0.01681734062731266,
+     'token': 691,
+     'token_str': '东'},
+    {'sequence': '[CLS] 中 国 的 首 都 是 普 京 。 [SEP]',
+     'score': 0.004590896889567375,
+     'token': 3249,
+     'token_str': '普'},
+    {'sequence': '[CLS] 中 国 的 首 都 是 燕 京 。 [SEP]',
+     'score': 0.0007656012894585729,
+     'token': 4242,
+     'token_str': '燕'}
+]
+```
+Here is how to use this model to get the features of a given text in PyTorch:
+```python
+from transformers import BertTokenizer, BertModel
+tokenizer = BertTokenizer.from_pretrained('uer/chinese_roberta_L-8_H-512')
+model = BertModel.from_pretrained("uer/chinese_roberta_L-8_H-512")
+text = "用你喜欢的任何文本替换我。"
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+```
+and in TensorFlow:
+```python
+from transformers import BertTokenizer, TFBertModel
+tokenizer = BertTokenizer.from_pretrained('uer/chinese_roberta_L-8_H-512')
+model = TFBertModel.from_pretrained("uer/chinese_roberta_L-8_H-512")
+text = "用你喜欢的任何文本替换我。"
+encoded_input = tokenizer(text, return_tensors='tf')
+output = model(encoded_input)
+```
+## Training data
+CLUECorpus2020 and CLUECorpusSmall are used as training data.
+## Training procedure
+Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512.
+Stage1:
+```
+python3 preprocess.py --corpus_path corpora/cluecorpus.txt \
+                      --vocab_path models/google_zh_vocab.txt \
+					  --dataset_path cluecorpus_seq128_dataset.pt \
+					  --processes_num 32 --seq_length 128 \
+					  --dynamic_masking --target mlm
+```
+```
+python3 pretrain.py --dataset_path cluecorpus_seq128_dataset.pt \
+                    --vocab_path models/google_zh_vocab.txt \
+					--config_path models/bert_medium_config.json \
+					--output_model_path models/cluecorpus_roberta_medium_seq128_model.bin \
+					--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
+					--total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
+					--learning_rate 1e-4 --batch_size 64 \
+					--tie_weights --encoder bert --target mlm
+```
+Stage2:
+```
+python3 preprocess.py --corpus_path corpora/cluecorpus.txt \
+                      --vocab_path models/google_zh_vocab.txt \
+					  --dataset_path cluecorpus_seq512_dataset.pt \
+					  --processes_num 32 --seq_length 512 \
+					  --dynamic_masking --target mlm
+```
+```
+python3 pretrain.py --dataset_path cluecorpus_seq512_dataset.pt \
+                    --pretrained_model_path models/cluecorpus_roberta_medium_seq128_model.bin-1000000 \
+					--vocab_path models/google_zh_vocab.txt \
+					--config_path models/bert_medium_config.json \
+					--output_model_path models/cluecorpus_roberta_medium_seq512_model.bin \
+					--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
+					--total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
+					--learning_rate 5e-5 --batch_size 16 \
+					--tie_weights --encoder bert --target mlm
+```
+### BibTeX entry and citation info
+```
+@article{zhao2019uer,
+  title={UER: An Open-Source Toolkit for Pre-training Models},
+  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
+  journal={EMNLP-IJCNLP 2019},
+  pages={241},
+  year={2019}
+}
+```
+[2_128]: https://huggingface.co/uer/chinese_roberta_L-2_H-128
+[4_256]: https://huggingface.co/uer/chinese_roberta_L-4_H-256
+[8_512]: https://huggingface.co/uer/chinese_roberta_L-8_H-512

config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "architectures": [
+    "BertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 512,
+  "initializer_range": 0.02,
+  "intermediate_size": 2048,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 8,
+  "num_hidden_layers": 8,
+  "pad_token_id": 0,
+  "type_vocab_size": 2,
+  "vocab_size": 21128
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:08d4f174eca71a30050c061139d9224158cbc1f07c12a5e8d31413823d304539
+size 146403143

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ac1fbad67249f2b0453069239129dc8e8e3a5ae24ecf23cab296d5e9438fb17f
+size 191919800

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": false, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff