File size: 6,407 Bytes
4647ecc
 
 
 
 
 
b5fe765
 
4647ecc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3ff021
 
4647ecc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3ff021
4647ecc
 
 
 
 
 
 
b5fe765
4647ecc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
language: Chinese
datasets: CLUECorpusSmall
widget: 
- text: "作为电子extra0的平台,京东绝对是领先者。如今的刘强extra1已经是身价过extra2的老板。"



---

# Chinese T5 Version 1.1

## Model description

This is the set of Chinese T5 Version 1.1 models pre-trained by [UER-py](https://arxiv.org/abs/1909.05658).

**Version 1.1**

Chinese T5 Version 1.1 includes the following improvements compared to our Chinese T5 model:

- GEGLU activation in feed-forward hidden layer, rather than ReLU
- Dropout was turned off in pre-training
- no parameter sharing between embedding and classifier layer

|                   |              Link              |
| ----------------- | :----------------------------: |
| **T5-v1_1-Small** | [**L=8/H=512 (Small)**][small] |
| **T5-v1_1-Base**  | [**L=12/H=768 (Base)**][base]  |

In T5 Version 1.1, spans of the input sequence are masked by so-called sentinel token. Each sentinel token represents a unique mask token for the input sequence and should start with `<extra_id_0>`, `<extra_id_1>`, … up to `<extra_id_99>`. However, `<extra_id_xxx>` is separated into multiple parts in Huggingface's Hosted inference API. Therefore, we replace `<extra_id_xxx>` with `extraxxx` in vocabulary and BertTokenizer regards `extraxxx` as one sentinel token.

## How to use

You can use this model directly with a pipeline for text2text generation (take the case of T5-v1_1-Small):

```python
>>> from transformers import BertTokenizer, MT5ForConditionalGeneration, Text2TextGenerationPipeline
>>> tokenizer = BertTokenizer.from_pretrained("uer/t5-v1_1-small-chinese-cluecorpussmall")
>>> model = MT5ForConditionalGeneration.from_pretrained("uer/t5-v1_1-small-chinese-cluecorpussmall")
>>> text2text_generator = Text2TextGenerationPipeline(model, tokenizer)  
>>> text2text_generator("中国的首都是extra0京", max_length=50, do_sample=False)
    [{'generated_text': 'extra0 北 extra1 extra2 extra3 extra4 extra5'}]
```

## Training data

[CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data. 

## Training procedure

The model is pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We pre-train 1,000,000 steps with a sequence length of 128 and then pre-train 250,000 additional steps with a sequence length of 512. We use the same hyper-parameters on different model sizes.

Taking the case of T5-v1_1-Small

Stage1:

```
python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
                      --vocab_path models/google_zh_with_sentinel_vocab.txt \
                      --dataset_path cluecorpussmall_t5-v1_1_seq128_dataset.pt \
                      --processes_num 32 --seq_length 128 \
                      --dynamic_masking --target t5 
```

```
python3 pretrain.py --dataset_path cluecorpussmall_t5-v1_1_seq128_dataset.pt \
                    --vocab_path models/google_zh_with_sentinel_vocab.txt \
                    --config_path models/t5-v1_1/small_config.json \
                    --output_model_path models/cluecorpussmall_t5-v1_1_small_seq128_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
                    --learning_rate 1e-3 --batch_size 64 \
                    --span_masking --span_geo_prob 0.3 --span_max_length 5 \
                    --embedding word --relative_position_embedding --remove_embedding_layernorm --tgt_embedding word \
                    --encoder transformer --mask fully_visible --layernorm_positioning pre \
                    --feed_forward gated --decoder transformer --target t5

```

Stage2:

```
python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
                      --vocab_path models/google_zh_with_sentinel_vocab.txt \
                      --dataset_path cluecorpussmall_t5-v1_1_seq512_dataset.pt \
                      --processes_num 32 --seq_length 512 \
                      --dynamic_masking --target t5 
```

```
python3 pretrain.py --dataset_path cluecorpussmall_t5-v1_1_seq512_dataset.pt \
                    --pretrained_model_path models/cluecorpussmall_t5-v1_1_small_seq128_model.bin-1000000 \
                    --vocab_path models/google_zh_with_sentinel_vocab.txt \
                    --config_path models/t5-v1_1/small_config.json \
                    --output_model_path models/cluecorpussmall_t5-v1_1_small_seq512_model.bin \
                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
                    --total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
                    --learning_rate 5e-4 --batch_size 16 \
                    --span_masking --span_geo_prob 0.3 --span_max_length 5 \
                    --embedding word --relative_position_embedding --remove_embedding_layernorm --tgt_embedding word \
                    --encoder transformer --mask fully_visible --layernorm_positioning pre \
                    --feed_forward gated --decoder transformer --target t5
```

Finally, we convert the pre-trained model into Huggingface's format:

```
python3 scripts/convert_t5_from_uer_to_huggingface.py --input_model_path cluecorpussmall_t5_small_seq512_model.bin-250000 \
                                                      --output_model_path pytorch_model.bin \
                                                      --layers_num 8 \
                                                      --type t5-v1_1
```


### BibTeX entry and citation info

```
@article{2020t5,
  title   = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  author  = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
  journal = {Journal of Machine Learning Research},
  pages   = {1-67},
  year    = {2020}
}

@article{zhao2019uer,
  title={UER: An Open-Source Toolkit for Pre-training Models},
  author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
  journal={EMNLP-IJCNLP 2019},
  pages={241},
  year={2019}
}
```

[small]:https://huggingface.co/uer/t5-v1_1-small-chinese-cluecorpussmall
[base]:https://huggingface.co/uer/t5-v1_1-base-chinese-cluecorpussmall