hhou435 commited on
Commit
14a0c27
·
1 Parent(s): b6aac78
Files changed (7) hide show
  1. README.md +0 -158
  2. config.json +0 -29
  3. pytorch_model.bin +0 -3
  4. special_tokens_map.json +0 -1
  5. tf_model.h5 +0 -3
  6. tokenizer_config.json +0 -1
  7. vocab.txt +0 -0
README.md DELETED
@@ -1,158 +0,0 @@
1
- ---
2
- language: Chinese
3
- datasets: CLUECorpusSmall
4
- widget:
5
- - text: "中国的首都是[MASK]京"
6
-
7
-
8
- ---
9
-
10
-
11
- # Chinese ALBERT
12
-
13
- ## Model description
14
-
15
- This is the set of Chinese ALBERT models pre-trained by UER-py. You can download the model either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
16
-
17
- | | Link |
18
- | -------- | :-----------------------: |
19
- | **ALBERT-Base** | [**L=12/H=768 (Base)**][base] |
20
- | **ALBERT-Large** | [**L=24/H=1024 (Large)**][large] |
21
-
22
- ## How to use
23
-
24
- You can use the model directly with a pipeline for text generation:
25
-
26
- ```python
27
- >>> from transformers import BertTokenizer, AlbertForMaskedLM, FillMaskPipeline
28
- >>> tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
29
- >>> model = AlbertForMaskedLM.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
30
- >>> unmasker = FillMaskPipeline(model, tokenizer)
31
- >>> unmasker("中国的首都是[MASK]京。")
32
- [
33
- {'sequence': '中 国 的 首 都 是 北 京 。',
34
- 'score': 0.8528032898902893,
35
- 'token': 1266,
36
- 'token_str': '北'},
37
- {'sequence': '中 国 的 首 都 是 南 京 。',
38
- 'score': 0.07667620480060577,
39
- 'token': 1298,
40
- 'token_str': '南'},
41
- {'sequence': '中 国 的 首 都 是 东 京 。',
42
- 'score': 0.020440367981791496,
43
- 'token': 691,
44
- 'token_str': '东'},
45
- {'sequence': '中 国 的 首 都 是 维 京 。',
46
- 'score': 0.010197942145168781,
47
- 'token': 5335,
48
- 'token_str': '维'},
49
- {'sequence': '中 国 的 首 都 是 汴 京 。',
50
- 'score': 0.0075391442514956,
51
- 'token': 3745,
52
- 'token_str': '汴'}
53
- ]
54
-
55
- ```
56
-
57
- Here is how to use this model to get the features of a given text in PyTorch:
58
-
59
- ```python
60
- from transformers import BertTokenizer, AlbertModel
61
- tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
62
- model = AlbertModel.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
63
- text = "用你喜欢的任何文本替换我。"
64
- encoded_input = tokenizer(text, return_tensors='pt')
65
- output = model(**encoded_input)
66
- ```
67
-
68
- and in TensorFlow:
69
-
70
- ```python
71
- from transformers import BertTokenizer, TFAlbertModel
72
- tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
73
- model = TFAlbertModel.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
74
- text = "用你喜欢的任何文本替换我。"
75
- encoded_input = tokenizer(text, return_tensors='tf')
76
- output = model(encoded_input)
77
- ```
78
-
79
- ## Training data
80
-
81
- [CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data.
82
-
83
- ## Training procedure
84
-
85
- The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
86
-
87
- Taking the case of ALBERT-Base
88
-
89
- Stage1:
90
-
91
- ```
92
- python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
93
- --vocab_path models/google_zh_vocab.txt \
94
- --dataset_path cluecorpussmall_albert_seq128_dataset.pt \
95
- --seq_length 128 --processes_num 32 --target albert
96
- ```
97
-
98
- ```
99
- python3 pretrain.py --dataset_path cluecorpussmall_albert_seq128_dataset.pt \
100
- --vocab_path models/google_zh_vocab.txt \
101
- --config_path models/albert/base_config.json \
102
- --output_model_path models/cluecorpussmall_albert_base_seq128_model.bin \
103
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
104
- --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
105
- --learning_rate 1e-4 --batch_size 64 \
106
- --factorized_embedding_parameterization --parameter_sharing \
107
- --embedding word_pos_seg --encoder transformer --mask fully_visible --target albert
108
- ```
109
-
110
- Stage2:
111
-
112
- ```
113
- python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
114
- --vocab_path models/google_zh_vocab.txt \
115
- --dataset_path cluecorpussmall_albert_seq512_dataset.pt \
116
- --seq_length 512 --processes_num 32 --target albert
117
- ```
118
-
119
- ```
120
- python3 pretrain.py --dataset_path cluecorpussmall_albert_seq512_dataset.pt \
121
- --pretrained_model_path models/cluecorpussmall_albert_base_seq128_model.bin-1000000 \
122
- --vocab_path models/google_zh_vocab.txt \
123
- --config_path models/albert/base_config.json \
124
- --output_model_path models/cluecorpussmall_albert_base_seq512_model.bin \
125
- --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
126
- --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
127
- --learning_rate 1e-4 --batch_size 64 \
128
- --factorized_embedding_parameterization --parameter_sharing \
129
- --embedding word_pos_seg --encoder transformer --mask fully_visible --target albert
130
- ```
131
-
132
- Finally, we convert the pre-trained model into Huggingface's format:
133
-
134
- ```
135
- python3 scripts/convert_albert_from_uer_to_huggingface.py --input_model_path cluecorpussmall_albert_base_seq512_model.bin-250000 \
136
- --output_model_path pytorch_model.bin
137
- ```
138
-
139
- ### BibTeX entry and citation info
140
-
141
- ```
142
- @article{lan2019albert,
143
- title={Albert: A lite bert for self-supervised learning of language representations},
144
- author={Lan, Zhenzhong and Chen, Mingda and Goodman, Sebastian and Gimpel, Kevin and Sharma, Piyush and Soricut, Radu},
145
- journal={arXiv preprint arXiv:1909.11942},
146
- year={2019}
147
- }
148
-
149
- @article{zhao2019uer,
150
- title={UER: An Open-Source Toolkit for Pre-training Models},
151
- author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
152
- journal={EMNLP-IJCNLP 2019},
153
- pages={241},
154
- year={2019}
155
- }
156
- ```
157
- [base]:https://huggingface.co/uer/albert-base-chinese-cluecorpussmall
158
- [large]:https://huggingface.co/uer/albert-large-chinese-cluecorpussmall
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json DELETED
@@ -1,29 +0,0 @@
1
- {
2
- "_name_or_path": "albert",
3
- "architectures": [
4
- "AlbertForMaskedLM"
5
- ],
6
- "attention_probs_dropout_prob": 0,
7
- "bos_token_id": 2,
8
- "classifier_dropout_prob": 0.1,
9
- "embedding_size": 128,
10
- "eos_token_id": 3,
11
- "hidden_act": "relu",
12
- "hidden_dropout_prob": 0,
13
- "hidden_size": 768,
14
- "initializer_range": 0.02,
15
- "inner_group_num": 1,
16
- "intermediate_size": 3072,
17
- "layer_norm_eps": 1e-12,
18
- "max_position_embeddings": 512,
19
- "model_type": "albert",
20
- "num_attention_heads": 12,
21
- "num_hidden_groups": 1,
22
- "num_hidden_layers": 12,
23
- "pad_token_id": 0,
24
- "position_embedding_type": "absolute",
25
- "tokenizer_class": "BertTokenizer",
26
- "transformers_version": "4.6.0",
27
- "type_vocab_size": 2,
28
- "vocab_size": 21128
29
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pytorch_model.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:4e90c5f6b64fda667d9a10a8065878a4790515a0df171e361787354b25526141
3
- size 40325143
 
 
 
 
special_tokens_map.json DELETED
@@ -1 +0,0 @@
1
- {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
 
 
tf_model.h5 DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:00b2f0b8fa2b513f5dde4fe14f25978c459e1381cb7ff0fd259fc98c4a6b4d61
3
- size 51528256
 
 
 
 
tokenizer_config.json DELETED
@@ -1 +0,0 @@
1
- {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512}
 
 
vocab.txt DELETED
The diff for this file is too large to render. See raw diff