File size: 2,684 Bytes
2672894
 
 
 
4aebcc9
 
 
 
 
 
e5764c8
 
75e1d14
4aebcc9
a14facc
 
 
 
 
 
b350d30
 
 
 
 
32c6ec5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4aebcc9
 
 
 
 
 
 
 
 
 
 
 
 
c8b7a69
4aebcc9
c8b7a69
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
language: ko
---

# Pretrained BART in Korean

This is pretrained BART model with multiple Korean Datasets.

I used multiple datasets for generalizing the model for both colloquial and written texts.

The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.

The script which is used to pre-train model is [here](https://github.com/cosmoquester/transformers-bart-pretrain).

When you use the reference API, you must wrap the sentence with `[BOS]` and `[EOS]` like below example.

```
[BOS] ์•ˆ๋…•ํ•˜์„ธ์š”? ๋ฐ˜๊ฐ€์›Œ์š”~~ [EOS]
```

You can also test mask filling performance using `[MASK]` token like this.
```
[BOS] [MASK] ๋จน์—ˆ์–ด? [EOS]
```

## Benchmark

<style>
table {
  border-collapse: collapse;
  border-style: hidden;
  width: 100%;
}

td, th {
  border: 1px solid #4d5562;
  padding: 8px;
}
</style>

<table>
 <tr>
  <th>Dataset</th>
  
  <td>KLUE NLI dev</th>
  <td>NSMC test</td>
  <td>QuestionPair test</td>
  <td colspan="2">KLUE TC dev</td>
  <td colspan="3">KLUE STS dev</td>
  <td colspan="3">KorSTS dev</td>
  <td colspan="2">HateSpeech dev</td>
 </tr>
 <tr>
  <th>Metric</th>
  
  <!-- KLUE NLI -->
  <td>Acc</th>
  
  <!-- NSMC -->
  <td>Acc</td>
  
  <!-- QuestionPair -->
  <td>Acc</td>
  
  <!-- KLUE TC -->
  <td>Acc</td>
  <td>F1</td>
  
  <!-- KLUE STS -->
  <td>F1</td>
  <td>Pearson</td>
  <td>Spearman</td>
  
  <!-- KorSTS -->
  <td>F1</td>
  <td>Pearson</td>
  <td>Spearman</td>
  
  <!-- HateSpeech -->
  <td>Bias Acc</td>
  <td>Hate Acc</td>
 </tr>
 
 <tr>
  <th>Score</th>
  
  <!-- KLUE NLI -->
  <td>0.639</th>
  
  <!-- NSMC -->
  <td>0.8721</td>
  
  <!-- QuestionPair -->
  <td>0.905</td>
  
  <!-- KLUE TC -->
  <td>0.8551</td>
  <td>0.8515</td>
  
  <!-- KLUE STS -->
  <td>0.7406</td>
  <td>0.7593</td>
  <td>0.7551</td>
  
  <!-- KorSTS -->
  <td>0.7897</td>
  <td>0.7269</td>
  <td>0.7037</td>
  
  <!-- HateSpeech -->
  <td>0.8068</td>
  <td>0.5966</td>
 </tr>
</table>

- The performance was measured using [the notebooks here](https://github.com/cosmoquester/transformers-bart-finetune) with colab.

## Used Datasets

### [๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜](https://corpus.korean.go.kr/)
- ์ผ์ƒ ๋Œ€ํ™” ๋ง๋ญ‰์น˜ 2020
- ๊ตฌ์–ด ๋ง๋ญ‰์น˜
- ๋ฌธ์–ด ๋ง๋ญ‰์น˜
- ์‹ ๋ฌธ ๋ง๋ญ‰์น˜

### AIhub
- [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ์ „๋ฌธ๋ถ„์•ผ๋ง๋ญ‰์น˜](https://aihub.or.kr/aidata/30717)
- [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ํ•œ๊ตญ์–ด๋Œ€ํ™”์š”์•ฝ](https://aihub.or.kr/aidata/30714)
- [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ๊ฐ์„ฑ ๋Œ€ํ™” ๋ง๋ญ‰์น˜](https://aihub.or.kr/aidata/7978)
- [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ํ•œ๊ตญ์–ด ์Œ์„ฑ](https://aihub.or.kr/aidata/105)
- [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ํ•œ๊ตญ์–ด SNS](https://aihub.or.kr/aidata/30718)

### [์„ธ์ข… ๋ง๋ญ‰์น˜](https://ithub.korean.go.kr/)