File size: 2,684 Bytes
2672894 4aebcc9 e5764c8 75e1d14 4aebcc9 a14facc b350d30 32c6ec5 4aebcc9 c8b7a69 4aebcc9 c8b7a69 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
---
language: ko
---
# Pretrained BART in Korean
This is pretrained BART model with multiple Korean Datasets.
I used multiple datasets for generalizing the model for both colloquial and written texts.
The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
The script which is used to pre-train model is [here](https://github.com/cosmoquester/transformers-bart-pretrain).
When you use the reference API, you must wrap the sentence with `[BOS]` and `[EOS]` like below example.
```
[BOS] ์๋
ํ์ธ์? ๋ฐ๊ฐ์์~~ [EOS]
```
You can also test mask filling performance using `[MASK]` token like this.
```
[BOS] [MASK] ๋จน์์ด? [EOS]
```
## Benchmark
<style>
table {
border-collapse: collapse;
border-style: hidden;
width: 100%;
}
td, th {
border: 1px solid #4d5562;
padding: 8px;
}
</style>
<table>
<tr>
<th>Dataset</th>
<td>KLUE NLI dev</th>
<td>NSMC test</td>
<td>QuestionPair test</td>
<td colspan="2">KLUE TC dev</td>
<td colspan="3">KLUE STS dev</td>
<td colspan="3">KorSTS dev</td>
<td colspan="2">HateSpeech dev</td>
</tr>
<tr>
<th>Metric</th>
<!-- KLUE NLI -->
<td>Acc</th>
<!-- NSMC -->
<td>Acc</td>
<!-- QuestionPair -->
<td>Acc</td>
<!-- KLUE TC -->
<td>Acc</td>
<td>F1</td>
<!-- KLUE STS -->
<td>F1</td>
<td>Pearson</td>
<td>Spearman</td>
<!-- KorSTS -->
<td>F1</td>
<td>Pearson</td>
<td>Spearman</td>
<!-- HateSpeech -->
<td>Bias Acc</td>
<td>Hate Acc</td>
</tr>
<tr>
<th>Score</th>
<!-- KLUE NLI -->
<td>0.639</th>
<!-- NSMC -->
<td>0.8721</td>
<!-- QuestionPair -->
<td>0.905</td>
<!-- KLUE TC -->
<td>0.8551</td>
<td>0.8515</td>
<!-- KLUE STS -->
<td>0.7406</td>
<td>0.7593</td>
<td>0.7551</td>
<!-- KorSTS -->
<td>0.7897</td>
<td>0.7269</td>
<td>0.7037</td>
<!-- HateSpeech -->
<td>0.8068</td>
<td>0.5966</td>
</tr>
</table>
- The performance was measured using [the notebooks here](https://github.com/cosmoquester/transformers-bart-finetune) with colab.
## Used Datasets
### [๋ชจ๋์ ๋ง๋ญ์น](https://corpus.korean.go.kr/)
- ์ผ์ ๋ํ ๋ง๋ญ์น 2020
- ๊ตฌ์ด ๋ง๋ญ์น
- ๋ฌธ์ด ๋ง๋ญ์น
- ์ ๋ฌธ ๋ง๋ญ์น
### AIhub
- [๊ฐ๋ฐฉ๋ฐ์ดํฐ ์ ๋ฌธ๋ถ์ผ๋ง๋ญ์น](https://aihub.or.kr/aidata/30717)
- [๊ฐ๋ฐฉ๋ฐ์ดํฐ ํ๊ตญ์ด๋ํ์์ฝ](https://aihub.or.kr/aidata/30714)
- [๊ฐ๋ฐฉ๋ฐ์ดํฐ ๊ฐ์ฑ ๋ํ ๋ง๋ญ์น](https://aihub.or.kr/aidata/7978)
- [๊ฐ๋ฐฉ๋ฐ์ดํฐ ํ๊ตญ์ด ์์ฑ](https://aihub.or.kr/aidata/105)
- [๊ฐ๋ฐฉ๋ฐ์ดํฐ ํ๊ตญ์ด SNS](https://aihub.or.kr/aidata/30718)
### [์ธ์ข
๋ง๋ญ์น](https://ithub.korean.go.kr/)
|