File size: 4,454 Bytes
bfac658 97e34e2 bfac658 97e34e2 bfac658 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
# π KoChatBART
[**BART**](https://arxiv.org/pdf/1910.13461.pdf)(**B**idirectional and **A**uto-**R**egressive **T**ransformers)λ μ
λ ₯ ν
μ€νΈ μΌλΆμ λ
Έμ΄μ¦λ₯Ό μΆκ°νμ¬ μ΄λ₯Ό λ€μ μλ¬ΈμΌλ‘ 볡ꡬνλ `autoencoder`μ ννλ‘ νμ΅μ΄ λ©λλ€. νκ΅μ΄ μ±ν
BART(μ΄ν **KoChatBART**) λ λ
Όλ¬Έμμ μ¬μ©λ `Text Infilling` λ
Έμ΄μ¦ ν¨μλ₯Ό μ¬μ©νμ¬ μ½ **10GB** μ΄μμ νκ΅μ΄ λν ν
μ€νΈμ λν΄μ νμ΅ν νκ΅μ΄ `encoder-decoder` μΈμ΄ λͺ¨λΈμ
λλ€. μ΄λ₯Ό ν΅ν΄ λμΆλ λν μμ±μ κ°κ±΄ν `KoChatBART-base`λ₯Ό λ°°ν¬ν©λλ€.
<img src=https://user-images.githubusercontent.com/55969260/205434343-b72641e9-d0f9-4b88-a334-9f904e0a35c5.png>
## Quick tour
```python
from transformers import AutoTokenizer, BartForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("BM-K/KoChatBART")
model = BartForConditionalGeneration.from_pretrained("BM-K/KoChatBART")
inputs = tokenizer("μλ
μΈμμ!", return_tensors="pt")
outputs = model(**inputs)
```
## μ¬μ νμ΅ λ°μ΄ν° μ μ²λ¦¬
μ¬μ©ν λ°μ΄ν°μ
- [μ£Όμ λ³ ν
μ€νΈ μΌμ λν λ°μ΄ν°](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=543)
- [μμκ³΅μΈ κ³ κ° μ£Όλ¬Έ μ§μ-μλ΅ ν
μ€νΈ](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=102)
- [νκ΅μ΄ SNS](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=114)
- [λ―Όμ μ
무 μλν μΈκ³΅μ§λ₯ μΈμ΄ λ°μ΄ν°](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=619)
KoChatBARTλ₯Ό νμ΅μν€κΈ° μνμ¬ νκ΅μ΄ λν λ°μ΄ν°μ
λ€μ μ μ²λ¦¬ ν ν©μ³ λλμ νκ΅μ΄ λν λ§λμΉλ₯Ό λ§λ€μμ΅λλ€.
1. λ°μ΄ν°μ μ€λ³΅μ μ€μ΄κΈ° μν΄ 'γ
γ
γ
γ
γ
γ
'μ κ°μ μ€λ³΅λ ννμ΄ 2λ² μ΄μ λ°λ³΅λ λλ 'γ
γ
'μ κ°μ΄ 2λ²μΌλ‘ λ°κΏ¨μ΅λλ€.
2. λ무 짧μ λ°μ΄ν°λ νμ΅μ λ°©ν΄κ° λ μ μκΈ° λλ¬Έμ KoBART ν ν¬λμ΄μ κΈ°μ€ μ 체 ν ν° κΈΈμ΄κ° 3μ λλ λ°μ΄ν°λ§μ μ λ³νμ΅λλ€.
3. κ°λͺ
μ²λ¦¬λ λ°μ΄ν°λ μ κ±°νμμ΅λλ€.
## Model
| Model | # of params | vocab size | Type | # of layers | # of heads | ffn_dim | hidden_dims |
| ------------- | :---------: | :-----: | :----------: | ---------: | ------: | ----------: | ----------: |
| `KoChatBART` | 139M | 50265 | Encoder | 6 | 16 | 3072 | 768 |
| | | | Decoder | 6 | 16 | 3072 | 768 |
## λν μμ± μ±λ₯ μΈ‘μ
λ€μ μ½λ[(Dialogue Generator)](https://github.com/2unju/KoBART_Dialogue_Generator)λ₯Ό κΈ°λ°μΌλ‘ κ° λͺ¨λΈμ fine-tuning νμμ΅λλ€. λν μμ± μ±λ₯ μΈ‘μ μ μν΄ μΆλ‘ μ ν ν¬λμ΄μ§λμ΄ μμ±λ μλ΅μ 볡μν ν, BPE tokenizerλ₯Ό μ¬μ©νμ¬ μ€μ μλ΅κ³Ό μμ±λ μλ΅ μ¬μ΄μ overlap λ° distinctλ₯Ό μΈ‘μ νμμ΅λλ€.
> **Warning** <br>
> μΌλ°μ μΌλ‘ 짧μ λν λ°μ΄ν°λ‘ λͺ¨λΈμ μ¬μ νμ΅νμκΈ° λλ¬Έμ κΈ΄ λ¬Έμ₯ μ²λ¦¬κ° μꡬλλ νμ€ν¬(μμ½) λ±μ λν΄μλ μ½ν λͺ¨μ΅μ 보μ
λλ€.
### μ€ν κ²°κ³Ό
- [κ°μ± λν λ°μ΄ν°](https://github.com/songys/Chatbot_data)
|Training|Validation|Test|
|:----:|:----:|:----:|
|9,458|1,182|1,183|
| Model | Param | BLEU-3 | BLEU-4 | Dist-1 | Dist-2 |
|------------------------|:----:|:----:|:----:|:----:|:----:|
| KoBART | 124M | 8.73 | 7.12 | 16.85 | 34.89 |
| KoChatBART | 139M | **12.97** | **11.23** | **19.64** | **44.53** |
| KoT5-ETRI | 324M | 12.10 | 10.14 | 16.97 | 40.09 |
- [μμκ³΅μΈ λν λ°μ΄ν°](https://github.com/2unju/AIHub_Chitchat_dataset_parser)
|Training|Validation|Test|
|:----:|:----:|:----:|
|29,093|1,616|1,616|
| Model | Param | BLEU-3 | BLEU-4 | Dist-1 | Dist-2 |
|------------------------|:----:|:----:|:----:|:----:|:----:|
| KoBART | 124M | 10.04 | 7.24 | 13.76| 42.09 |
| KoChatBART | 139M | **10.11** | **7.26** | **15.12** | **46.08** |
| KoT5-ETRI | 324M | 9.45 | 6.66 | 14.50 | 45.46 |
## Contributors
<a href="https://github.com/BM-K/KoChatBART/graphs/contributors">
<img src="https://contrib.rocks/image?repo=BM-K/KoChatBART" />
</a>
## Reference
- [KoBART](https://github.com/SKT-AI/KoBART)
|