Update

Browse files

Files changed (7) hide show

README.md +0 -158
config.json +0 -29
pytorch_model.bin +0 -3
special_tokens_map.json +0 -1
tf_model.h5 +0 -3
tokenizer_config.json +0 -1
vocab.txt +0 -0

README.md DELETED Viewed

@@ -1,158 +0,0 @@
----
-language: Chinese
-datasets: CLUECorpusSmall
-widget:
-- text: "中国的首都是[MASK]京"
----
-# Chinese ALBERT
-## Model description
-This is the set of Chinese ALBERT models pre-trained by UER-py. You can download the model either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
-|          |           Link           |
-| -------- | :-----------------------: |
-| **ALBERT-Base**  | [**L=12/H=768 (Base)**][base] |
-| **ALBERT-Large**  | [**L=24/H=1024 (Large)**][large] |
-## How to use
-You can use the model directly with a pipeline for text generation:
-```python
->>> from transformers import BertTokenizer, AlbertForMaskedLM, FillMaskPipeline
->>> tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
->>> model = AlbertForMaskedLM.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
->>> unmasker = FillMaskPipeline(model, tokenizer)
->>> unmasker("中国的首都是[MASK]京。")
-    [
-        {'sequence': '中 国 的 首 都 是 北 京 。',
-         'score': 0.8528032898902893,
-         'token': 1266,
-         'token_str': '北'},
-        {'sequence': '中 国 的 首 都 是 南 京 。',
-         'score': 0.07667620480060577,
-         'token': 1298,
-         'token_str': '南'},
-        {'sequence': '中 国 的 首 都 是 东 京 。',
-         'score': 0.020440367981791496,
-         'token': 691,
-         'token_str': '东'},
-        {'sequence': '中 国 的 首 都 是 维 京 。',
-         'score': 0.010197942145168781,
-         'token': 5335,
-         'token_str': '维'},
-        {'sequence': '中 国 的 首 都 是 汴 京 。',
-         'score': 0.0075391442514956,
-         'token': 3745,
-         'token_str': '汴'}
-    ]
-```
-Here is how to use this model to get the features of a given text in PyTorch:
-```python
-from transformers import BertTokenizer, AlbertModel
-tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
-model = AlbertModel.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
-text = "用你喜欢的任何文本替换我。"
-encoded_input = tokenizer(text, return_tensors='pt')
-output = model(**encoded_input)
-```
-and in TensorFlow:
-```python
-from transformers import BertTokenizer, TFAlbertModel
-tokenizer = BertTokenizer.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
-model = TFAlbertModel.from_pretrained("uer/albert-base-chinese-cluecorpussmall")
-text = "用你喜欢的任何文本替换我。"
-encoded_input = tokenizer(text, return_tensors='tf')
-output = model(encoded_input)
-```
-## Training data
-[CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data.
-## Training procedure
-The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
-Taking the case of ALBERT-Base
-Stage1:
-```
-python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
-                      --vocab_path models/google_zh_vocab.txt \
-                      --dataset_path cluecorpussmall_albert_seq128_dataset.pt \
-                      --seq_length 128 --processes_num 32 --target albert
-```
-```
-python3 pretrain.py --dataset_path cluecorpussmall_albert_seq128_dataset.pt \
-                    --vocab_path models/google_zh_vocab.txt \
-                    --config_path models/albert/base_config.json \
-                    --output_model_path models/cluecorpussmall_albert_base_seq128_model.bin \
-                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
-                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
-                    --learning_rate 1e-4 --batch_size 64 \
-                    --factorized_embedding_parameterization --parameter_sharing \
-                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target albert
-```
-Stage2:
-```
-python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
-                      --vocab_path models/google_zh_vocab.txt \
-                      --dataset_path cluecorpussmall_albert_seq512_dataset.pt \
-                      --seq_length 512 --processes_num 32 --target albert
-```
-```
-python3 pretrain.py --dataset_path cluecorpussmall_albert_seq512_dataset.pt \
-                    --pretrained_model_path models/cluecorpussmall_albert_base_seq128_model.bin-1000000 \
-                    --vocab_path models/google_zh_vocab.txt \
-                    --config_path models/albert/base_config.json \
-                    --output_model_path models/cluecorpussmall_albert_base_seq512_model.bin \
-                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
-                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
-                    --learning_rate 1e-4 --batch_size 64 \
-                    --factorized_embedding_parameterization --parameter_sharing \
-                    --embedding word_pos_seg --encoder transformer --mask fully_visible --target albert
-```
-Finally, we convert the pre-trained model into Huggingface's format:
-```
-python3 scripts/convert_albert_from_uer_to_huggingface.py --input_model_path cluecorpussmall_albert_base_seq512_model.bin-250000 \
-                                                          --output_model_path pytorch_model.bin
-```
-### BibTeX entry and citation info
-```
-@article{lan2019albert,
-  title={Albert: A lite bert for self-supervised learning of language representations},
-  author={Lan, Zhenzhong and Chen, Mingda and Goodman, Sebastian and Gimpel, Kevin and Sharma, Piyush and Soricut, Radu},
-  journal={arXiv preprint arXiv:1909.11942},
-  year={2019}
-}
-@article{zhao2019uer,
-  title={UER: An Open-Source Toolkit for Pre-training Models},
-  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
-  journal={EMNLP-IJCNLP 2019},
-  pages={241},
-  year={2019}
-}
-```
-[base]:https://huggingface.co/uer/albert-base-chinese-cluecorpussmall
-[large]:https://huggingface.co/uer/albert-large-chinese-cluecorpussmall

config.json DELETED Viewed

@@ -1,29 +0,0 @@
-{
-  "_name_or_path": "albert",
-  "architectures": [
-    "AlbertForMaskedLM"
-  ],
-  "attention_probs_dropout_prob": 0,
-  "bos_token_id": 2,
-  "classifier_dropout_prob": 0.1,
-  "embedding_size": 128,
-  "eos_token_id": 3,
-  "hidden_act": "relu",
-  "hidden_dropout_prob": 0,
-  "hidden_size": 768,
-  "initializer_range": 0.02,
-  "inner_group_num": 1,
-  "intermediate_size": 3072,
-  "layer_norm_eps": 1e-12,
-  "max_position_embeddings": 512,
-  "model_type": "albert",
-  "num_attention_heads": 12,
-  "num_hidden_groups": 1,
-  "num_hidden_layers": 12,
-  "pad_token_id": 0,
-  "position_embedding_type": "absolute",
-  "tokenizer_class": "BertTokenizer",
-  "transformers_version": "4.6.0",
-  "type_vocab_size": 2,
-  "vocab_size": 21128
-}

pytorch_model.bin DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:4e90c5f6b64fda667d9a10a8065878a4790515a0df171e361787354b25526141
-size 40325143

special_tokens_map.json DELETED Viewed

	@@ -1 +0,0 @@
1	- {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tf_model.h5 DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:00b2f0b8fa2b513f5dde4fe14f25978c459e1381cb7ff0fd259fc98c4a6b4d61
-size 51528256

tokenizer_config.json DELETED Viewed

	@@ -1 +0,0 @@
1	- {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512}

vocab.txt DELETED Viewed

The diff for this file is too large to render. See raw diff