uer commited on
Commit
f618567
1 Parent(s): 7af6d04

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -14,7 +14,7 @@ widget:
14
 
15
  This is the set of 5 Chinese word-based RoBERTa models pre-trained by [UER-py](https://arxiv.org/abs/1909.05658).
16
 
17
- [Turc et al.](https://arxiv.org/abs/1908.08962) have shown that the standard BERT recipe is effective on a wide range of model sizes. Following their paper, we released the 5 Chinese word-based RoBERTa models. In order to facilitate users to reproduce the results, we used the publicly available corpus and word segmentation tool, and provided all training details.
18
 
19
  You can download the 5 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
20
 
@@ -28,11 +28,11 @@ You can download the 5 Chinese RoBERTa miniatures either from the [UER-py Github
28
 
29
  ## How to use
30
 
31
- You can use this model directly with a pipeline for masked language modeling:
32
 
33
  ```python
34
  >>> from transformers import pipeline
35
- >>> unmasker = pipeline('fill-mask', model='uer/roberta-base-word-chinese-cluecorpussmall')
36
  >>> unmasker("[MASK]的首都是北京。")
37
  [
38
  {'sequence': '中国 的首都是北京。',
@@ -58,14 +58,12 @@ You can use this model directly with a pipeline for masked language modeling:
58
  ]
59
  ```
60
 
61
- BertTokenizer does not support sentencepiece, so we use AlbertTokenizer here.
62
-
63
  Here is how to use this model to get the features of a given text in PyTorch:
64
 
65
  ```python
66
  from transformers import AlbertTokenizer, BertModel
67
- tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-base-word-chinese-cluecorpussmall')
68
- model = BertModel.from_pretrained("uer/roberta-base-word-chinese-cluecorpussmall")
69
  text = "用你喜欢的任何文本替换我。"
70
  encoded_input = tokenizer(text, return_tensors='pt')
71
  output = model(**encoded_input)
@@ -75,13 +73,15 @@ and in TensorFlow:
75
 
76
  ```python
77
  from transformers import AlbertTokenizer, TFBertModel
78
- tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-base-word-chinese-cluecorpussmall')
79
- model = TFBertModel.from_pretrained("uer/roberta-base-word-chinese-cluecorpussmall")
80
  text = "用你喜欢的任何文本替换我。"
81
  encoded_input = tokenizer(text, return_tensors='tf')
82
  output = model(encoded_input)
83
  ```
84
 
 
 
85
  ## Training data
86
 
87
  [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data. Google's [sentencepiece](https://github.com/google/sentencepiece) is used for word segmentation. The sentencepiece model is trained on CLUECorpusSmall corpus:
@@ -110,7 +110,7 @@ output = model(encoded_input)
110
 
111
  ## Training procedure
112
 
113
- Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud TI-ONE](https://cloud.tencent.com/product/tione/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
114
 
115
  Taking the case of word-based RoBERTa-Medium
116
 
 
14
 
15
  This is the set of 5 Chinese word-based RoBERTa models pre-trained by [UER-py](https://arxiv.org/abs/1909.05658).
16
 
17
+ Most Chinese pre-trained weights are based on Chinese character. Compared with character-based models, word-based models are faster (because of shorter sequence length) and have better performance according to our experimental results. To this end, we released the 5 Chinese word-based RoBERTa models of different sizes. In order to facilitate users to reproduce the results, we used the publicly available corpus and word segmentation tool, and provided all training details.
18
 
19
  You can download the 5 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
20
 
 
28
 
29
  ## How to use
30
 
31
+ You can use this model directly with a pipeline for masked language modeling (take the case of word-based RoBERTa-Medium):
32
 
33
  ```python
34
  >>> from transformers import pipeline
35
+ >>> unmasker = pipeline('fill-mask', model='uer/roberta-medium-word-chinese-cluecorpussmall')
36
  >>> unmasker("[MASK]的首都是北京。")
37
  [
38
  {'sequence': '中国 的首都是北京。',
 
58
  ]
59
  ```
60
 
 
 
61
  Here is how to use this model to get the features of a given text in PyTorch:
62
 
63
  ```python
64
  from transformers import AlbertTokenizer, BertModel
65
+ tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-medium-word-chinese-cluecorpussmall')
66
+ model = BertModel.from_pretrained("uer/roberta-medium-word-chinese-cluecorpussmall")
67
  text = "用你喜欢的任何文本替换我。"
68
  encoded_input = tokenizer(text, return_tensors='pt')
69
  output = model(**encoded_input)
 
73
 
74
  ```python
75
  from transformers import AlbertTokenizer, TFBertModel
76
+ tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-medium-word-chinese-cluecorpussmall')
77
+ model = TFBertModel.from_pretrained("uer/roberta-medium-word-chinese-cluecorpussmall")
78
  text = "用你喜欢的任何文本替换我。"
79
  encoded_input = tokenizer(text, return_tensors='tf')
80
  output = model(encoded_input)
81
  ```
82
 
83
+ Since BertTokenizer does not support sentencepiece, AlbertTokenizer is used here.
84
+
85
  ## Training data
86
 
87
  [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data. Google's [sentencepiece](https://github.com/google/sentencepiece) is used for word segmentation. The sentencepiece model is trained on CLUECorpusSmall corpus:
 
110
 
111
  ## Training procedure
112
 
113
+ Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
114
 
115
  Taking the case of word-based RoBERTa-Medium
116