File size: 2,533 Bytes
808ffb8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ea066e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
808ffb8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
language: ko
---

# Pretrained BART in Korean

This is pretrained BART model with multiple Korean Datasets.

I used multiple datasets for generalizing the model for both colloquial and written texts.

The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.

The script which is used to pre-train model is [here](https://github.com/cosmoquester/transformers-bart-pretrain).

When you use the reference API, you must wrap the sentence with `[BOS]` and `[EOS]` like below example.

```
[BOS] ์•ˆ๋…•ํ•˜์„ธ์š”? ๋ฐ˜๊ฐ€์›Œ์š”~~ [EOS]
```

You can also test mask filling performance using `[MASK]` token like this.
```
[BOS] [MASK] ๋จน์—ˆ์–ด? [EOS]
```

## Benchmark

<table>
 <tr>
  <th>Dataset</th>
  
  <td>KLUE NLI dev</th>
  <td>NSMC test</td>
  <td>QuestionPair test</td>
  <td colspan="2">KLUE TC dev</td>
  <td colspan="3">KLUE STS dev</td>
  <td colspan="3">KorSTS dev</td>
  <td colspan="2">HateSpeech dev</td>
 </tr>
 <tr>
  <th>Metric</th>
  
  <!-- KLUE NLI -->
  <td>Acc</th>
  
  <!-- NSMC -->
  <td>Acc</td>
  
  <!-- QuestionPair -->
  <td>Acc</td>
  
  <!-- KLUE TC -->
  <td>Acc</td>
  <td>F1</td>
  
  <!-- KLUE STS -->
  <td>F1</td>
  <td>Pearson</td>
  <td>Spearman</td>
  
  <!-- KorSTS -->
  <td>F1</td>
  <td>Pearson</td>
  <td>Spearman</td>
  
  <!-- HateSpeech -->
  <td>Bias Acc</td>
  <td>Hate Acc</td>
 </tr>
 
 <tr>
  <th>Score</th>
  
  <!-- KLUE NLI -->
  <td>0.5253</th>
  
  <!-- NSMC -->
  <td>0.8425</td>
  
  <!-- QuestionPair -->
  <td>0.8945</td>
  
  <!-- KLUE TC -->
  <td>0.8047</td>
  <td>0.7988</td>
  
  <!-- KLUE STS -->
  <td>0.7411</td>
  <td>0.7471</td>
  <td>0.7399</td>
  
  <!-- KorSTS -->
  <td>0.7725</td>
  <td>0.6503</td>
  <td>0.6191</td>
  
  <!-- HateSpeech -->
  <td>0.7537</td>
  <td>0.5605</td>
 </tr>
</table>

- The performance was measured using [the notebooks here](https://github.com/cosmoquester/transformers-bart-finetune) with colab.

## Used Datasets

### [๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜](https://corpus.korean.go.kr/)
- ์ผ์ƒ ๋Œ€ํ™” ๋ง๋ญ‰์น˜ 2020
- ๊ตฌ์–ด ๋ง๋ญ‰์น˜
- ๋ฌธ์–ด ๋ง๋ญ‰์น˜
- ์‹ ๋ฌธ ๋ง๋ญ‰์น˜

### AIhub
- [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ์ „๋ฌธ๋ถ„์•ผ๋ง๋ญ‰์น˜](https://aihub.or.kr/aidata/30717)
- [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ํ•œ๊ตญ์–ด๋Œ€ํ™”์š”์•ฝ](https://aihub.or.kr/aidata/30714)
- [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ๊ฐ์„ฑ ๋Œ€ํ™” ๋ง๋ญ‰์น˜](https://aihub.or.kr/aidata/7978)
- [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ํ•œ๊ตญ์–ด ์Œ์„ฑ](https://aihub.or.kr/aidata/105)
- [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ํ•œ๊ตญ์–ด SNS](https://aihub.or.kr/aidata/30718)

### [์„ธ์ข… ๋ง๋ญ‰์น˜](https://ithub.korean.go.kr/)