File size: 3,213 Bytes
8ad3246
2f41077
 
8ad3246
 
2f41077
8ad3246
 
 
 
 
 
 
 
 
d21a666
 
8ad3246
d21a666
8ad3246
d21a666
ecac05c
 
 
7d8867b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ecac05c
8ad3246
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
language: Chinese
datasets: CLUECorpus
---

# Chinese RoBERTa Miniatures

## Model description

This is the set of 24 Chinese RoBERTa models pre-trained by [UER-py](https://www.aclweb.org/anthology/D19-3041.pdf).

You can download the 24 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:

|   |H=128|H=256|H=512|H=768|
|---|:---:|:---:|:---:|:---:|
| **L=2**  |[**2/128 (Tiny)**][2_128]|[2/256]|[2/512]|[2/768]|
| **L=4**  |[4/128]|[**4/256 (Mini)**]|[**4/512 (Small)**]|[4/768]|
| **L=6**  |[6/128]|[6/256]|[6/512]|[6/768]|
| **L=8**  |[8/128]|[8/256]|[**8/512 (Medium)**]|[8/768]|
| **L=10** |[10/128]|[10/256]|[10/512]|[10/768]|
| **L=12** |[12/128]|[12/256]|[12/512]|[**12/768 (Base)**]|

## How to use

You can use this model directly with a pipeline for masked language modeling:

```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='hhou435/chinese_roberta_L-2_H-128')
>>> unmasker("中国的首都是[MASK]京。")
[
    {'sequence': '[CLS] 中 国 的 首 都 是 北 京 。 [SEP]', 
     'score': 0.9427323937416077, 
     'token': 1266, 
     'token_str': '北'}, 
    {'sequence': '[CLS] 中 国 的 首 都 是 南 京 。 [SEP]',
     'score': 0.029202355071902275, 
     'token': 1298,
     'token_str': '南'}, 
    {'sequence': '[CLS] 中 国 的 首 都 是 东 京 。 [SEP]',
     'score': 0.00977553054690361,
     'token': 691, 
     'token_str': '东'}, 
    {'sequence': '[CLS] 中 国 的 首 都 是 葡 京 。 [SEP]',
     'score': 0.00489805219694972,
     'token': 5868, 
     'token_str': '葡'},
    {'sequence': '[CLS] 中 国 的 首 都 是 新 京 。 [SEP]',
     'score': 0.0027360401581972837, 
     'token': 3173, 
     'token_str': '新'}
]

```

Here is how to use this model to get the features of a given text in PyTorch:

```python
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('hhou435/chinese_roberta_L-2_H-128')
model = BertModel.from_pretrained("hhou435/chinese_roberta_L-2_H-128")
text = "用你喜欢的任何文本替换我。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```

and in TensorFlow:

```python
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('hhou435/chinese_roberta_L-2_H-128')
model = TFBertModel.from_pretrained("hhou435/chinese_roberta_L-2_H-128")
text = "用你喜欢的任何文本替换我。"
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```



## Training data

CLUECorpus2020 and CLUECorpusSmall are used as training corpus.

## Training procedure

Training details can be found in [UER-py](https://github.com/dbiir/UER-py/).

### BibTeX entry and citation info

```
@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}
```

[2_128]: https://huggingface.co/uer/chinese_roberta_L-2_H-128