File size: 3,505 Bytes
345db51 8919838 2f66290 8919838 a384336 8919838 345db51 8919838 434e394 8919838 2f66290 8944088 8919838 2f66290 8919838 c81c677 8919838 c81c677 8919838 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
language:
- zh
tags:
- t5
- pytorch
- zh
- Text2Text-Generation
license: "apache-2.0"
widget:
- text: "对联:丹枫江冷人初去"
---
# T5 for Chinese Couplet(t5-chinese-couplet) Model
T5中文对联生成模型
`t5-chinese-couplet` evaluate couplet test data:
The overall performance of T5 on couplet **test**:
|prefix|input_text|target_text|pred|
|:-- |:--- |:--- |:-- |
|对联:|春回大地,对对黄莺鸣暖树|日照神州,群群紫燕衔新泥|福至人间,家家紫燕舞和风|
在Couplet测试集上生成结果满足字数相同、词性对齐、词面对齐、形似要求,而语义对仗工整和平仄合律还不满足。
T5的网络结构(原生T5):
![arch](t5.png)
## Usage
本项目开源在文本生成项目:[textgen](https://github.com/shibing624/textgen),可支持T5模型,通过如下命令调用:
Install package:
```shell
pip install -U textgen
```
```python
from textgen import T5Model
model = T5Model("t5", "shibing624/t5-chinese-couplet")
r = model.predict(["对联:丹枫江冷人初去"])
print(r) # ['白石矶寒客不归']
```
## Usage (HuggingFace Transformers)
Without [textgen](https://github.com/shibing624/textgen), you can use the model like this:
First, you pass your input through the transformer model, then you get the generated sentence.
Install package:
```
pip install transformers
```
```python
from transformers import T5ForConditionalGeneration, T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("shibing624/t5-chinese-couplet")
model = T5ForConditionalGeneration.from_pretrained("shibing624/t5-chinese-couplet")
def batch_generate(input_texts, max_length=64):
features = tokenizer(input_texts, return_tensors='pt')
outputs = model.generate(input_ids=features['input_ids'],
attention_mask=features['attention_mask'],
max_length=max_length)
return tokenizer.batch_decode(outputs, skip_special_tokens=True)
r = batch_generate(["对联:丹枫江冷人初去"])
print(r)
```
output:
```shell
['白石矶寒客不归']
```
模型文件组成:
```
t5-chinese-couplet
├── config.json
├── model_args.json
├── pytorch_model.bin
├── special_tokens_map.json
├── tokenizer_config.json
├── spiece.model
└── vocab.txt
```
### 训练数据集
#### 中文对联数据集
- 数据:[对联github](https://github.com/wb14123/couplet-dataset)、[清洗过的对联github](https://github.com/v-zich/couplet-clean-dataset)
- 相关内容
- [Huggingface](https://huggingface.co/)
- LangZhou Chinese [MengZi T5 pretrained Model](https://huggingface.co/Langboat/mengzi-t5-base) and [paper](https://arxiv.org/pdf/2110.06696.pdf)
- [textgen](https://github.com/shibing624/textgen)
数据格式:
```text
head -n 1 couplet_files/couplet/train/in.txt
晚 风 摇 树 树 还 挺
head -n 1 couplet_files/couplet/train/out.txt
晨 露 润 花 花 更 红
```
如果需要训练T5模型,请参考[https://github.com/shibing624/textgen/blob/main/docs/%E5%AF%B9%E8%81%94%E7%94%9F%E6%88%90%E6%A8%A1%E5%9E%8B%E5%AF%B9%E6%AF%94.md](https://github.com/shibing624/textgen/blob/main/docs/%E5%AF%B9%E8%81%94%E7%94%9F%E6%88%90%E6%A8%A1%E5%9E%8B%E5%AF%B9%E6%AF%94.md)
## Citation
```latex
@software{textgen,
author = {Xu Ming},
title = {textgen: Implementation of Text Generation models},
year = {2022},
url = {https://github.com/shibing624/textgen},
}
```
|