metadata
license: mit
datasets:
- kakaobrain/kor_nli
- kakaobrain/kor_nlu
- klue/klue
language:
- ko
metrics:
- spearmanr
- pearsonr
pipeline_tag: sentence-similarity
π SimCSE-KO
1. Intro
νκ΅μ΄ SimCSE(BERT, Supervised) λͺ¨λΈμ
λλ€.
Princeton NLPμ μ½λκ° μλ μλ‘μ΄ μ½λλ₯Ό μ΄μ©ν΄ νμ΅λμμ΅λλ€.
λ λ¬Έμ₯ μ¬μ΄μ μ½μ¬μΈ μ μ¬λλ₯Ό κ³μ°ν΄ μλ―Έμ κ΄λ ¨μ±μ νλ¨ν μ μμ΅λλ€.
- Github: https://github.com/snumin44/SimCSE-KO
- Original Code: https://github.com/princeton-nlp/SimCSE
2. Experiments Settings
- Model: klue/bert-base
- Dataset: KorNLI-train (supervised training), KorSTS-dev (evaluation)
- epoch: 1
- max length: 64
- batch size: 256
- learning rate: 5e-5
- drop out: 0.1
- temp: 0.05
- pooler: cls
- 1 A100 GPU
3. Performance
(1) KorSTS-test
Model | AVG | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhatten Pearson | Manhatten Spearman | Dot Pearson | Dot Spearman |
---|---|---|---|---|---|---|---|---|---|
SimCSE-BERT-KO (unsup) |
72.85 | 73.00 | 72.77 | 72.96 | 72.92 | 72.93 | 72.86 | 72.80 | 72.53 |
SimCSE-BERT-KO (sup) |
85.98 | 86.05 | 86.00 | 85.88 | 86.08 | 85.90 | 86.08 | 85.96 | 85.89 |
SimCSE-RoBERTa-KO (unsup) |
75.79 | 76.39 | 75.57 | 75.71 | 75.52 | 75.65 | 75.42 | 76.41 | 75.63 |
SimCSE-RoBERTa-KO (sup) |
83.06 | 82.67 | 83.21 | 83.22 | 83.27 | 83.24 | 83.28 | 82.54 | 83.03 |
(2) Klue-dev
Model | AVG | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhatten Pearson | Manhatten Spearman | Dot Pearson | Dot Spearman |
---|---|---|---|---|---|---|---|---|---|
SimCSE-BERT-KO (unsup) |
65.27 | 66.27 | 64.31 | 66.18 | 64.05 | 66.00 | 63.77 | 66.64 | 64.93 |
SimCSE-BERT-KO (sup) |
83.96 | 82.98 | 84.32 | 84.32 | 84.30 | 84.28 | 84.20 | 83.00 | 84.29 |
SimCSE-RoBERTa-KO (unsup) |
80.78 | 81.20 | 80.35 | 81.27 | 80.36 | 81.28 | 80.40 | 81.13 | 80.26 |
SimCSE-RoBERTa-KO (sup) |
85.31 | 84.14 | 85.64 | 86.09 | 85.68 | 86.04 | 85.65 | 83.94 | 85.30 |
Citing
@article{gao2021simcse,
title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
year={2021}
}
@article{ham2020kornli,
title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
journal={arXiv preprint arXiv:2004.03289},
year={2020}
}