File size: 6,046 Bytes
82181a1
8634d16
629fe90
a350fd0
 
 
 
 
 
629fe90
 
 
 
82181a1
 
629fe90
 
 
 
 
 
 
82181a1
 
629fe90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82181a1
629fe90
 
 
 
 
 
 
82181a1
629fe90
 
 
 
 
 
 
 
 
82181a1
 
 
629fe90
 
 
82181a1
530746d
629fe90
 
 
 
 
 
82181a1
629fe90
 
 
f26cf2e
82181a1
 
 
629fe90
 
 
 
 
 
f26cf2e
629fe90
82181a1
629fe90
 
 
 
 
 
f26cf2e
629fe90
 
 
 
 
f26cf2e
629fe90
 
 
 
f26cf2e
629fe90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82181a1
 
629fe90
 
 
 
 
 
82181a1
 
f26cf2e
82181a1
b6aac78
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
language: zh
datasets: CLUECorpusSmall
widget:
- text: 中国的首都是[MASK]京
license: bigscience-openrail-m
metrics:
- code_eval
pipeline_tag: question-answering
---


# Chinese ALBERT

## Model description

This is the set of Chinese ALBERT models pre-trained by UER-py. You can download the model either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:

|          |           Link           |
| -------- | :-----------------------: |
| **ALBERT-Base**  | [**L=12/H=768 (Base)**][base] |
| **ALBERT-Large**  | [**L=24/H=1024 (Large)**][large] |

## How to use

You can use the model directly with a pipeline for text generation:

```python
>>> from transformers import BertTokenizer, AlbertForMaskedLM, FillMaskPipeline
>>> tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
>>> model = AlbertForMaskedLM.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
>>> unmasker = FillMaskPipeline(model, tokenizer)   
>>> unmasker("中国的首都是[MASK]京。")
    [
        {'sequence': '中 国 的 首 都 是 北 京 。',
         'score': 0.8528032898902893, 
         'token': 1266, 
         'token_str': '北'}, 
        {'sequence': '中 国 的 首 都 是 南 京 。',
         'score': 0.07667620480060577, 
         'token': 1298, 
         'token_str': '南'}, 
        {'sequence': '中 国 的 首 都 是 东 京 。', 
         'score': 0.020440367981791496, 
         'token': 691, 
         'token_str': '东'},
        {'sequence': '中 国 的 首 都 是 维 京 。', 
         'score': 0.010197942145168781,
         'token': 5335, 
         'token_str': '维'}, 
        {'sequence': '中 国 的 首 都 是 汴 京 。', 
         'score': 0.0075391442514956, 
         'token': 3745, 
         'token_str': '汴'}
    ]

```

Here is how to use this model to get the features of a given text in PyTorch:

```python
from transformers import BertTokenizer, AlbertModel
tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
model = AlbertModel.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
text = "用你喜欢的任何文本替换我。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```

and in TensorFlow:

```python
from transformers import BertTokenizer, TFAlbertModel
tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
model = TFAlbertModel.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
text = "用你喜欢的任何文本替换我。"
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
```

## Training data

[CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data. 

## Training procedure

The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.

Taking the case of ALBERT-Base

Stage1:

```
python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
                      --vocab_path models/google_zh_vocab.txt \
                      --dataset_path cluecorpussmall_albert_seq128_dataset.pt \
                      --seq_length 128 --processes_num 32 --data_processor albert 
```

```
python3 pretrain.py --dataset_path cluecorpussmall_albert_seq128_dataset.pt \
                    --vocab_path models/google_zh_vocab.txt \
                    --config_path models/albert/base_config.json \
                    --output_model_path models/cluecorpussmall_albert_base_seq128_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
                    --learning_rate 1e-4 --batch_size 64
```

Stage2:

```
python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
                      --vocab_path models/google_zh_vocab.txt \
                      --dataset_path cluecorpussmall_albert_seq512_dataset.pt \
                      --seq_length 512 --processes_num 32 --data_processor albert
```

```
python3 pretrain.py --dataset_path cluecorpussmall_albert_seq512_dataset.pt \
                    --vocab_path models/google_zh_vocab.txt \
                    --pretrained_model_path models/cluecorpussmall_albert_base_seq128_model.bin-1000000 \
                    --config_path models/albert/base_config.json \
                    --output_model_path models/cluecorpussmall_albert_base_seq512_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
                    --learning_rate 1e-4 --batch_size 64
```

Finally, we convert the pre-trained model into Huggingface's format:

```
python3 scripts/convert_albert_from_uer_to_huggingface.py --input_model_path cluecorpussmall_albert_base_seq512_model.bin-250000 \
                                                          --output_model_path pytorch_model.bin
```

### BibTeX entry and citation info

```
@article{lan2019albert,
  title={Albert: A lite bert for self-supervised learning of language representations},
  author={Lan, Zhenzhong and Chen, Mingda and Goodman, Sebastian and Gimpel, Kevin and Sharma, Piyush and Soricut, Radu},
  journal={arXiv preprint arXiv:1909.11942},
  year={2019}
}

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}
```

[base]:https://huggingface.co/uer/albert-base-chinese-cluecorpussmall
[large]:https://huggingface.co/uer/albert-large-chinese-cluecorpussmall