File size: 2,150 Bytes
751be9e
f480fc2
 
 
 
 
 
df1d5aa
9e2212b
f480fc2
 
774a2c8
 
 
3715def
d12860a
 
774a2c8
3715def
774a2c8
d12860a
824e6bf
774a2c8
 
 
 
 
 
ab93ca6
d012232
ac545ce
ab93ca6
1090b08
04893f2
5d4a67f
1090b08
 
 
ab93ca6
 
 
 
85e7694
ab93ca6
 
85e7694
 
e000310
71bb42a
e000310
85e7694
fa9d65b
b5f5536
 
 
 
d4e5d7d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
---
language: 
  - zh
thumbnail: "url to a thumbnail used in social sharing"
tags:
- bart-large-chinese
datasets:
- lccc
- kd_conv
---

# dialogue-bart-large-chinese
This is a seq2seq model fine-tuned on several Chinese dialogue datasets, from bart-large-chinese.


# Spaces
Now you can experience our model on HuggingFace Spaces [HIT-TMG/dialogue-bart-large-chinese](https://huggingface.co/spaces/HIT-TMG/dialogue-bart-large-chinese) .


# Datasets
We utilize 4 Chinese dialogue datasets from [LUGE](https://www.luge.ai/#/) .

|                              |            |                       |
| ----                         | ----       | ----                  |
|                              | Count      | Domain                |
| Chinese Persona Chat (CPC)   | 23,000     | Open                  | 
| LCCC                         | 11,987,759 | Open                  |
| Emotional STC (ESTC)         | 899,207    | Open                  |
| KdConv                       | 3,000      | Movie, Music, Travel  |
|                              |            |                       |


# Data format
Input: `[CLS] 对话历史:<history> [SEP] 知识:<knowledge> [SEP]`

Output: `[CLS] <response> [SEP]`


# Example
```python
from transformers import BertTokenizer, BartForConditionalGeneration

# Note that tokenizer is an object of BertTokenizer, instead of BartTokenizer
tokenizer = BertTokenizer.from_pretrained("HIT-TMG/dialogue-bart-large-chinese")
model = BartForConditionalGeneration.from_pretrained("HIT-TMG/dialogue-bart-large-chinese")

# an example from CPC dev data
history = ["可以 认识 一下 吗 ?", "当然 可以 啦 , 你好 。", "嘿嘿 你好 , 请问 你 最近 在 忙 什么 呢 ?", "我 最近 养 了 一只 狗狗 , 我 在 训练 它 呢 。"]
history_str = "对话历史:" + tokenizer.sep_token.join(history)
input_ids = tokenizer(history_str, return_tensors='pt').input_ids
output_ids = model.generate(input_ids)[0]
print(tokenizer.decode(output_ids, skip_special_tokens=True))
 ```
 
 
 # Contact
 If you encounter any issue, feel free to contact us via the email: <u>yanshekwoo@foxmail.com</u>