Update README.md
Browse files
README.md
CHANGED
@@ -14,7 +14,7 @@ widget:
|
|
14 |
|
15 |
This is the set of 5 Chinese word-based RoBERTa models pre-trained by [UER-py](https://arxiv.org/abs/1909.05658).
|
16 |
|
17 |
-
|
18 |
|
19 |
You can download the 5 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
|
20 |
|
@@ -28,11 +28,11 @@ You can download the 5 Chinese RoBERTa miniatures either from the [UER-py Github
|
|
28 |
|
29 |
## How to use
|
30 |
|
31 |
-
You can use this model directly with a pipeline for masked language modeling:
|
32 |
|
33 |
```python
|
34 |
>>> from transformers import pipeline
|
35 |
-
>>> unmasker = pipeline('fill-mask', model='uer/roberta-
|
36 |
>>> unmasker("[MASK]的首都是北京。")
|
37 |
[
|
38 |
{'sequence': '中国 的首都是北京。',
|
@@ -58,14 +58,12 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
58 |
]
|
59 |
```
|
60 |
|
61 |
-
BertTokenizer does not support sentencepiece, so we use AlbertTokenizer here.
|
62 |
-
|
63 |
Here is how to use this model to get the features of a given text in PyTorch:
|
64 |
|
65 |
```python
|
66 |
from transformers import AlbertTokenizer, BertModel
|
67 |
-
tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-
|
68 |
-
model = BertModel.from_pretrained("uer/roberta-
|
69 |
text = "用你喜欢的任何文本替换我。"
|
70 |
encoded_input = tokenizer(text, return_tensors='pt')
|
71 |
output = model(**encoded_input)
|
@@ -75,13 +73,15 @@ and in TensorFlow:
|
|
75 |
|
76 |
```python
|
77 |
from transformers import AlbertTokenizer, TFBertModel
|
78 |
-
tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-
|
79 |
-
model = TFBertModel.from_pretrained("uer/roberta-
|
80 |
text = "用你喜欢的任何文本替换我。"
|
81 |
encoded_input = tokenizer(text, return_tensors='tf')
|
82 |
output = model(encoded_input)
|
83 |
```
|
84 |
|
|
|
|
|
85 |
## Training data
|
86 |
|
87 |
[CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data. Google's [sentencepiece](https://github.com/google/sentencepiece) is used for word segmentation. The sentencepiece model is trained on CLUECorpusSmall corpus:
|
@@ -110,7 +110,7 @@ output = model(encoded_input)
|
|
110 |
|
111 |
## Training procedure
|
112 |
|
113 |
-
Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud
|
114 |
|
115 |
Taking the case of word-based RoBERTa-Medium
|
116 |
|
|
|
14 |
|
15 |
This is the set of 5 Chinese word-based RoBERTa models pre-trained by [UER-py](https://arxiv.org/abs/1909.05658).
|
16 |
|
17 |
+
Most Chinese pre-trained weights are based on Chinese character. Compared with character-based models, word-based models are faster (because of shorter sequence length) and have better performance according to our experimental results. To this end, we released the 5 Chinese word-based RoBERTa models of different sizes. In order to facilitate users to reproduce the results, we used the publicly available corpus and word segmentation tool, and provided all training details.
|
18 |
|
19 |
You can download the 5 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
|
20 |
|
|
|
28 |
|
29 |
## How to use
|
30 |
|
31 |
+
You can use this model directly with a pipeline for masked language modeling (take the case of word-based RoBERTa-Medium):
|
32 |
|
33 |
```python
|
34 |
>>> from transformers import pipeline
|
35 |
+
>>> unmasker = pipeline('fill-mask', model='uer/roberta-medium-word-chinese-cluecorpussmall')
|
36 |
>>> unmasker("[MASK]的首都是北京。")
|
37 |
[
|
38 |
{'sequence': '中国 的首都是北京。',
|
|
|
58 |
]
|
59 |
```
|
60 |
|
|
|
|
|
61 |
Here is how to use this model to get the features of a given text in PyTorch:
|
62 |
|
63 |
```python
|
64 |
from transformers import AlbertTokenizer, BertModel
|
65 |
+
tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-medium-word-chinese-cluecorpussmall')
|
66 |
+
model = BertModel.from_pretrained("uer/roberta-medium-word-chinese-cluecorpussmall")
|
67 |
text = "用你喜欢的任何文本替换我。"
|
68 |
encoded_input = tokenizer(text, return_tensors='pt')
|
69 |
output = model(**encoded_input)
|
|
|
73 |
|
74 |
```python
|
75 |
from transformers import AlbertTokenizer, TFBertModel
|
76 |
+
tokenizer = AlbertTokenizer.from_pretrained('uer/roberta-medium-word-chinese-cluecorpussmall')
|
77 |
+
model = TFBertModel.from_pretrained("uer/roberta-medium-word-chinese-cluecorpussmall")
|
78 |
text = "用你喜欢的任何文本替换我。"
|
79 |
encoded_input = tokenizer(text, return_tensors='tf')
|
80 |
output = model(encoded_input)
|
81 |
```
|
82 |
|
83 |
+
Since BertTokenizer does not support sentencepiece, AlbertTokenizer is used here.
|
84 |
+
|
85 |
## Training data
|
86 |
|
87 |
[CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data. Google's [sentencepiece](https://github.com/google/sentencepiece) is used for word segmentation. The sentencepiece model is trained on CLUECorpusSmall corpus:
|
|
|
110 |
|
111 |
## Training procedure
|
112 |
|
113 |
+
Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
|
114 |
|
115 |
Taking the case of word-based RoBERTa-Medium
|
116 |
|