File size: 3,505 Bytes

---
language: 
- zh
tags:
- t5
- pytorch
- zh
- Text2Text-Generation
license: "apache-2.0"
widget:
- text: "对联：丹枫江冷人初去"

---

# T5 for Chinese Couplet(t5-chinese-couplet) Model
T5中文对联生成模型

`t5-chinese-couplet` evaluate couplet test data：

The overall performance of T5 on couplet **test**:

|prefix|input_text|target_text|pred|
|:-- |:--- |:--- |:-- |
|对联：|春回大地，对对黄莺鸣暖树|日照神州，群群紫燕衔新泥|福至人间,家家紫燕舞和风|

在Couplet测试集上生成结果满足字数相同、词性对齐、词面对齐、形似要求，而语义对仗工整和平仄合律还不满足。

T5的网络结构(原生T5)：

![arch](t5.png)

## Usage

本项目开源在文本生成项目：[textgen](https://github.com/shibing624/textgen)，可支持T5模型，通过如下命令调用：

Install package:
```shell
pip install -U textgen
```

```python
from textgen import T5Model
model = T5Model("t5", "shibing624/t5-chinese-couplet")
r = model.predict(["对联：丹枫江冷人初去"])
print(r) # ['白石矶寒客不归']
```

## Usage (HuggingFace Transformers)
Without [textgen](https://github.com/shibing624/textgen), you can use the model like this: 

First, you pass your input through the transformer model, then you get the generated sentence.

Install package:
```
pip install transformers 
```

```python
from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("shibing624/t5-chinese-couplet")
model = T5ForConditionalGeneration.from_pretrained("shibing624/t5-chinese-couplet")


def batch_generate(input_texts, max_length=64):
    features = tokenizer(input_texts, return_tensors='pt')
    outputs = model.generate(input_ids=features['input_ids'],
                             attention_mask=features['attention_mask'],
                             max_length=max_length)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)


r = batch_generate(["对联：丹枫江冷人初去"])
print(r)
```

output:
```shell
['白石矶寒客不归']
```

模型文件组成：
```
t5-chinese-couplet
    ├── config.json
    ├── model_args.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    ├── spiece.model
    └── vocab.txt
```


### 训练数据集
#### 中文对联数据集

- 数据：[对联github](https://github.com/wb14123/couplet-dataset)、[清洗过的对联github](https://github.com/v-zich/couplet-clean-dataset)
- 相关内容
  - [Huggingface](https://huggingface.co/)
  - LangZhou Chinese [MengZi T5 pretrained Model](https://huggingface.co/Langboat/mengzi-t5-base) and [paper](https://arxiv.org/pdf/2110.06696.pdf)
  - [textgen](https://github.com/shibing624/textgen)
  
  
数据格式：

```text
head -n 1 couplet_files/couplet/train/in.txt
晚 风 摇 树 树 还 挺 

head -n 1 couplet_files/couplet/train/out.txt
晨 露 润 花 花 更 红 
```


如果需要训练T5模型，请参考[https://github.com/shibing624/textgen/blob/main/docs/%E5%AF%B9%E8%81%94%E7%94%9F%E6%88%90%E6%A8%A1%E5%9E%8B%E5%AF%B9%E6%AF%94.md](https://github.com/shibing624/textgen/blob/main/docs/%E5%AF%B9%E8%81%94%E7%94%9F%E6%88%90%E6%A8%A1%E5%9E%8B%E5%AF%B9%E6%AF%94.md)


## Citation

```latex
@software{textgen,
  author = {Xu Ming},
  title = {textgen: Implementation of Text Generation models},
  year = {2022},
  url = {https://github.com/shibing624/textgen},
}
```