File size: 6,097 Bytes
4647ecc 4695f43 4647ecc b5fe765 4647ecc 761550d 4647ecc 12fafe3 4647ecc 12fafe3 4647ecc 761550d 4647ecc 761550d 4647ecc 761550d 4647ecc 761550d 4647ecc b5fe765 4647ecc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
language: zh
datasets: CLUECorpusSmall
widget:
- text: "作为电子extra0的平台,京东绝对是领先者。如今的刘强extra1已经是身价过extra2的老板。"
---
# Chinese T5 Version 1.1
## Model description
This is the set of Chinese T5 Version 1.1 models pre-trained by [UER-py](https://github.com/dbiir/UER-py/), which is introduced in [this paper](https://arxiv.org/abs/1909.05658).
**Version 1.1**
Chinese T5 Version 1.1 includes the following improvements compared to our Chinese T5 model:
- GEGLU activation in feed-forward hidden layer, rather than ReLU
- Dropout was turned off in pre-training
- no parameter sharing between embedding and classifier layer
You can download the set of Chinese T5 Version 1.1 models either from the [UER-py Modelzoo page](https://github.com/dbiir/UER-py/wiki/Modelzoo), or via HuggingFace from the links below:
| | Link |
| ----------------- | :----------------------------: |
| **T5-v1_1-Small** | [**L=8/H=512 (Small)**][small] |
| **T5-v1_1-Base** | [**L=12/H=768 (Base)**][base] |
In T5 Version 1.1, spans of the input sequence are masked by so-called sentinel token. Each sentinel token represents a unique mask token for the input sequence and should start with `<extra_id_0>`, `<extra_id_1>`, … up to `<extra_id_99>`. However, `<extra_id_xxx>` is separated into multiple parts in Huggingface's Hosted inference API. Therefore, we replace `<extra_id_xxx>` with `extraxxx` in vocabulary and BertTokenizer regards `extraxxx` as one sentinel token.
## How to use
You can use this model directly with a pipeline for text2text generation (take the case of T5-v1_1-Small):
```python
>>> from transformers import BertTokenizer, MT5ForConditionalGeneration, Text2TextGenerationPipeline
>>> tokenizer = BertTokenizer.from_pretrained("uer/t5-v1_1-small-chinese-cluecorpussmall")
>>> model = MT5ForConditionalGeneration.from_pretrained("uer/t5-v1_1-small-chinese-cluecorpussmall")
>>> text2text_generator = Text2TextGenerationPipeline(model, tokenizer)
>>> text2text_generator("中国的首都是extra0京", max_length=50, do_sample=False)
[{'generated_text': 'extra0 北 extra1 extra2 extra3 extra4 extra5'}]
```
## Training data
[CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data.
## Training procedure
The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.
Taking the case of T5-v1_1-Small
Stage1:
```
python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--dataset_path cluecorpussmall_t5-v1_1_seq128_dataset.pt \
--processes_num 32 --seq_length 128 \
--dynamic_masking --data_processor t5
```
```
python3 pretrain.py --dataset_path cluecorpussmall_t5-v1_1_seq128_dataset.pt \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5-v1_1/small_config.json \
--output_model_path models/cluecorpussmall_t5-v1_1_small_seq128_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
--learning_rate 1e-3 --batch_size 64 \
--span_masking --span_geo_prob 0.3 --span_max_length 5
```
Stage2:
```
python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--dataset_path cluecorpussmall_t5-v1_1_seq512_dataset.pt \
--processes_num 32 --seq_length 512 \
--dynamic_masking --data_processor t5
```
```
python3 pretrain.py --dataset_path cluecorpussmall_t5-v1_1_seq512_dataset.pt \
--pretrained_model_path models/cluecorpussmall_t5-v1_1_small_seq128_model.bin-1000000 \
--vocab_path models/google_zh_with_sentinel_vocab.txt \
--config_path models/t5-v1_1/small_config.json \
--output_model_path models/cluecorpussmall_t5-v1_1_small_seq512_model.bin \
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
--total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
--learning_rate 5e-4 --batch_size 16 \
--span_masking --span_geo_prob 0.3 --span_max_length 5
```
Finally, we convert the pre-trained model into Huggingface's format:
```
python3 scripts/convert_t5_from_uer_to_huggingface.py --input_model_path cluecorpussmall_t5_small_seq512_model.bin-250000 \
--output_model_path pytorch_model.bin \
--layers_num 8 \
--type t5-v1_1
```
### BibTeX entry and citation info
```
@article{2020t5,
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
journal = {Journal of Machine Learning Research},
pages = {1-67},
year = {2020}
}
@article{zhao2019uer,
title={UER: An Open-Source Toolkit for Pre-training Models},
author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
journal={EMNLP-IJCNLP 2019},
pages={241},
year={2019}
}
```
[small]:https://huggingface.co/uer/t5-v1_1-small-chinese-cluecorpussmall
[base]:https://huggingface.co/uer/t5-v1_1-base-chinese-cluecorpussmall |