File size: 3,505 Bytes
345db51
8919838
 
 
 
 
 
2f66290
8919838
 
a384336
8919838
345db51
8919838
 
 
 
 
 
434e394
8919838
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f66290
 
 
 
 
8944088
 
 
 
 
8919838
 
2f66290
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8919838
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c81c677
8919838
 
c81c677
8919838
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
language: 
- zh
tags:
- t5
- pytorch
- zh
- Text2Text-Generation
license: "apache-2.0"
widget:
- text: "对联:丹枫江冷人初去"

---

# T5 for Chinese Couplet(t5-chinese-couplet) Model
T5中文对联生成模型

`t5-chinese-couplet` evaluate couplet test data:

The overall performance of T5 on couplet **test**:

|prefix|input_text|target_text|pred|
|:-- |:--- |:--- |:-- |
|对联:|春回大地,对对黄莺鸣暖树|日照神州,群群紫燕衔新泥|福至人间,家家紫燕舞和风|

在Couplet测试集上生成结果满足字数相同、词性对齐、词面对齐、形似要求,而语义对仗工整和平仄合律还不满足。

T5的网络结构(原生T5):

![arch](t5.png)

## Usage

本项目开源在文本生成项目:[textgen](https://github.com/shibing624/textgen),可支持T5模型,通过如下命令调用:

Install package:
```shell
pip install -U textgen
```

```python
from textgen import T5Model
model = T5Model("t5", "shibing624/t5-chinese-couplet")
r = model.predict(["对联:丹枫江冷人初去"])
print(r) # ['白石矶寒客不归']
```

## Usage (HuggingFace Transformers)
Without [textgen](https://github.com/shibing624/textgen), you can use the model like this: 

First, you pass your input through the transformer model, then you get the generated sentence.

Install package:
```
pip install transformers 
```

```python
from transformers import T5ForConditionalGeneration, T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("shibing624/t5-chinese-couplet")
model = T5ForConditionalGeneration.from_pretrained("shibing624/t5-chinese-couplet")


def batch_generate(input_texts, max_length=64):
    features = tokenizer(input_texts, return_tensors='pt')
    outputs = model.generate(input_ids=features['input_ids'],
                             attention_mask=features['attention_mask'],
                             max_length=max_length)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)


r = batch_generate(["对联:丹枫江冷人初去"])
print(r)
```

output:
```shell
['白石矶寒客不归']
```

模型文件组成:
```
t5-chinese-couplet
    ├── config.json
    ├── model_args.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    ├── spiece.model
    └── vocab.txt
```


### 训练数据集
#### 中文对联数据集

- 数据:[对联github](https://github.com/wb14123/couplet-dataset)、[清洗过的对联github](https://github.com/v-zich/couplet-clean-dataset)
- 相关内容
  - [Huggingface](https://huggingface.co/)
  - LangZhou Chinese [MengZi T5 pretrained Model](https://huggingface.co/Langboat/mengzi-t5-base) and [paper](https://arxiv.org/pdf/2110.06696.pdf)
  - [textgen](https://github.com/shibing624/textgen)
  
  
数据格式:

```text
head -n 1 couplet_files/couplet/train/in.txt
晚 风 摇 树 树 还 挺 

head -n 1 couplet_files/couplet/train/out.txt
晨 露 润 花 花 更 红 
```


如果需要训练T5模型,请参考[https://github.com/shibing624/textgen/blob/main/docs/%E5%AF%B9%E8%81%94%E7%94%9F%E6%88%90%E6%A8%A1%E5%9E%8B%E5%AF%B9%E6%AF%94.md](https://github.com/shibing624/textgen/blob/main/docs/%E5%AF%B9%E8%81%94%E7%94%9F%E6%88%90%E6%A8%A1%E5%9E%8B%E5%AF%B9%E6%AF%94.md)


## Citation

```latex
@software{textgen,
  author = {Xu Ming},
  title = {textgen: Implementation of Text Generation models},
  year = {2022},
  url = {https://github.com/shibing624/textgen},
}
```