# 😎 KoChatBART [**BART**](https://arxiv.org/pdf/1910.13461.pdf)(**B**idirectional and **A**uto-**R**egressive **T**ransformers)λŠ” μž…λ ₯ ν…μŠ€νŠΈ 일뢀에 λ…Έμ΄μ¦ˆλ₯Ό μΆ”κ°€ν•˜μ—¬ 이λ₯Ό λ‹€μ‹œ μ›λ¬ΈμœΌλ‘œ λ³΅κ΅¬ν•˜λŠ” `autoencoder`의 ν˜•νƒœλ‘œ ν•™μŠ΅μ΄ λ©λ‹ˆλ‹€. ν•œκ΅­μ–΄ μ±„νŒ… BART(μ΄ν•˜ **KoChatBART**) λŠ” λ…Όλ¬Έμ—μ„œ μ‚¬μš©λœ `Text Infilling` λ…Έμ΄μ¦ˆ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ μ•½ **10GB** μ΄μƒμ˜ ν•œκ΅­μ–΄ λŒ€ν™” ν…μŠ€νŠΈμ— λŒ€ν•΄μ„œ ν•™μŠ΅ν•œ ν•œκ΅­μ–΄ `encoder-decoder` μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이λ₯Ό 톡해 λ„μΆœλœ λŒ€ν™” 생성에 κ°•κ±΄ν•œ `KoChatBART-base`λ₯Ό λ°°ν¬ν•©λ‹ˆλ‹€. ## Quick tour ```python from transformers import AutoTokenizer, BartForConditionalGeneration tokenizer = AutoTokenizer.from_pretrained("BM-K/KoChatBART") model = BartForConditionalGeneration.from_pretrained("BM-K/KoChatBART") inputs = tokenizer("μ•ˆλ…• 세상아!", return_tensors="pt") outputs = model(**inputs) ``` ## 사전 ν•™μŠ΅ 데이터 μ „μ²˜λ¦¬ μ‚¬μš©ν•œ 데이터셋 - [μ£Όμ œλ³„ ν…μŠ€νŠΈ 일상 λŒ€ν™” 데이터](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=543) - [μ†Œμƒκ³΅μΈ 고객 μ£Όλ¬Έ 질의-응닡 ν…μŠ€νŠΈ](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=102) - [ν•œκ΅­μ–΄ SNS](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=114) - [민원 업무 μžλ™ν™” 인곡지λŠ₯ μ–Έμ–΄ 데이터](https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=619) KoChatBARTλ₯Ό ν•™μŠ΅μ‹œν‚€κΈ° μœ„ν•˜μ—¬ ν•œκ΅­μ–΄ λŒ€ν™” 데이터셋듀을 μ „μ²˜λ¦¬ ν›„ 합쳐 λŒ€λŸ‰μ˜ ν•œκ΅­μ–΄ λŒ€ν™” λ§λ­‰μΉ˜λ₯Ό λ§Œλ“€μ—ˆμŠ΅λ‹ˆλ‹€. 1. λ°μ΄ν„°μ˜ 쀑볡을 쀄이기 μœ„ν•΄ 'γ…‹γ…‹γ…‹γ…‹γ…‹γ…‹'와 같은 μ€‘λ³΅λœ ν‘œν˜„μ΄ 2번 이상 반볡될 λ•ŒλŠ” 'γ…‹γ…‹'와 같이 2번으둜 λ°”κΏ¨μŠ΅λ‹ˆλ‹€. 2. λ„ˆλ¬΄ 짧은 λ°μ΄ν„°λŠ” ν•™μŠ΅μ— λ°©ν•΄κ°€ 될 수 있기 λ•Œλ¬Έμ— KoBART ν† ν¬λ‚˜μ΄μ € κΈ°μ€€ 전체 토큰 길이가 3을 λ„˜λŠ” λ°μ΄ν„°λ§Œμ„ μ„ λ³„ν–ˆμŠ΅λ‹ˆλ‹€. 3. κ°€λͺ…μ²˜λ¦¬λœ λ°μ΄ν„°λŠ” μ œκ±°ν•˜μ˜€μŠ΅λ‹ˆλ‹€. ## Model | Model | # of params | vocab size | Type | # of layers | # of heads | ffn_dim | hidden_dims | | ------------- | :---------: | :-----: | :----------: | ---------: | ------: | ----------: | ----------: | | `KoChatBART` | 139M | 50265 | Encoder | 6 | 16 | 3072 | 768 | | | | | Decoder | 6 | 16 | 3072 | 768 | ## λŒ€ν™” 생성 μ„±λŠ₯ μΈ‘μ • λ‹€μŒ μ½”λ“œ[(Dialogue Generator)](https://github.com/2unju/KoBART_Dialogue_Generator)λ₯Ό 기반으둜 각 λͺ¨λΈμ„ fine-tuning ν•˜μ˜€μŠ΅λ‹ˆλ‹€. λŒ€ν™” 생성 μ„±λŠ₯ 츑정을 μœ„ν•΄ μΆ”λ‘  μ‹œ ν† ν¬λ‚˜μ΄μ§•λ˜μ–΄ μƒμ„±λœ 응닡을 λ³΅μ›ν•œ ν›„, BPE tokenizerλ₯Ό μ‚¬μš©ν•˜μ—¬ μ‹€μ œ 응닡과 μƒμ„±λœ 응닡 μ‚¬μ΄μ˜ overlap 및 distinctλ₯Ό μΈ‘μ •ν•˜μ˜€μŠ΅λ‹ˆλ‹€. > **Warning**
> 일반적으둜 짧은 λŒ€ν™” λ°μ΄ν„°λ‘œ λͺ¨λΈμ„ μ‚¬μ „ν•™μŠ΅ν•˜μ˜€κΈ° λ•Œλ¬Έμ— κΈ΄ λ¬Έμž₯ μ²˜λ¦¬κ°€ μš”κ΅¬λ˜λŠ” νƒœμŠ€ν¬(μš”μ•½) 등에 λŒ€ν•΄μ„œλŠ” μ•½ν•œ λͺ¨μŠ΅μ„ λ³΄μž…λ‹ˆλ‹€. ### μ‹€ν—˜ κ²°κ³Ό - [감성 λŒ€ν™” 데이터](https://github.com/songys/Chatbot_data) |Training|Validation|Test| |:----:|:----:|:----:| |9,458|1,182|1,183| | Model | Param | BLEU-3 | BLEU-4 | Dist-1 | Dist-2 | |------------------------|:----:|:----:|:----:|:----:|:----:| | KoBART | 124M | 8.73 | 7.12 | 16.85 | 34.89 | | KoChatBART | 139M | **12.97** | **11.23** | **19.64** | **44.53** | | KoT5-ETRI | 324M | 12.10 | 10.14 | 16.97 | 40.09 | - [μ†Œμƒκ³΅μΈ λŒ€ν™” 데이터](https://github.com/2unju/AIHub_Chitchat_dataset_parser) |Training|Validation|Test| |:----:|:----:|:----:| |29,093|1,616|1,616| | Model | Param | BLEU-3 | BLEU-4 | Dist-1 | Dist-2 | |------------------------|:----:|:----:|:----:|:----:|:----:| | KoBART | 124M | 10.04 | 7.24 | 13.76| 42.09 | | KoChatBART | 139M | **10.11** | **7.26** | **15.12** | **46.08** | | KoT5-ETRI | 324M | 9.45 | 6.66 | 14.50 | 45.46 | ## Contributors ## Reference - [KoBART](https://github.com/SKT-AI/KoBART)