File size: 7,292 Bytes
4073b61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: mit
language:
- ko
pipeline_tag: feature-extraction
---

# KoSimCSE Training on Amazon SageMaker


## Usage

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoConfig, PretrainedConfig, PreTrainedModel
from transformers import AutoModel, AutoTokenizer, logging

class SimCSEConfig(PretrainedConfig):
    def __init__(self, version=1.0, **kwargs):
        self.version = version
        super().__init__(**kwargs)

class SimCSEModel(PreTrainedModel):
    config_class = SimCSEConfig

    def __init__(self, config):
        super().__init__(config)
        self.backbone = AutoModel.from_pretrained(config.base_model)
        self.hidden_size: int = self.backbone.config.hidden_size
        self.dense = nn.Linear(self.hidden_size, self.hidden_size)
        self.activation = nn.Tanh()

    def forward(
        self,
        input_ids: Tensor,
        attention_mask: Tensor = None,
        # RoBERTa variants don't have token_type_ids, so this argument is optional
        token_type_ids: Tensor = None,
    ) -> Tensor:
        # shape of input_ids: (batch_size, seq_len)
        # shape of attention_mask: (batch_size, seq_len)
        outputs: BaseModelOutputWithPoolingAndCrossAttentions = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )

        emb = outputs.last_hidden_state[:, 0]

        if self.training:
            emb = self.dense(emb)
            emb = self.activation(emb)

        return emb
    
def show_embedding_score(tokenizer, model, sentences):
    inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
    embeddings = model(**inputs)
    score01 = cal_score(embeddings[0,:], embeddings[1,:])
    score02 = cal_score(embeddings[0,:], embeddings[2,:])
    print(score01, score02)

def cal_score(a, b):
    if len(a.shape) == 1: a = a.unsqueeze(0)
    if len(b.shape) == 1: b = b.unsqueeze(0)
    a_norm = a / a.norm(dim=1)[:, None]
    b_norm = b / b.norm(dim=1)[:, None]
    return torch.mm(a_norm, b_norm.transpose(0, 1)) * 100 

# Load pre-trained model
model = SimCSEModel.from_pretrained("daekeun-ml/KoSimCSE-supervised-roberta-large")
tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/KoSimCSE-supervised-roberta-large")

# Inference example
sentences = ['์ด๋ฒˆ ์ฃผ ์ผ์š”์ผ์— ๋ถ„๋‹น ์ด๋งˆํŠธ ์ ์€ ๋ฌธ์„ ์—ฌ๋‚˜์š”?',
             '์ผ์š”์ผ์— ๋ถ„๋‹น ์ด๋งˆํŠธ๋Š” ๋ฌธ ์—ด์–ด์š”?',
             '๋ถ„๋‹น ์ด๋งˆํŠธ ์ ์€ ํ† ์š”์ผ์— ๋ช‡ ์‹œ๊นŒ์ง€ ํ•˜๋‚˜์š”']

show_embedding_score(tokenizer, model.cpu(), sentences)
```
    

## Introduction

[SimCSE](https://aclanthology.org/2021.emnlp-main.552/) is a highly efficient and innovative embedding technique based on the concept of contrastive learning. Unsupervised learning can be performed without the need to prepare ground-truth labels, and high-performance supervised learning can be performed if a good NLI (Natural Language Inference) dataset is prepared. The concept is very simple and the psudeo-code is intuitive, so the implementation is not difficult, but I have seen many people still struggle to train this model.

The official implementation code from the authors of the paper is publicly available, but it is not suitable for a step-by-step implementation. Therefore, we have reorganized the code based on [Simple-SIMCSE's GitHub](https://github.com/hppRC/simple-simcse) so that even ML beginners can train the model from the scratch with a step-by-step implementation. It's minimalist code for beginners, but data scientists and ML engineers can also make good use of it.

### Added over Simple-SimCSE
- Added the Supervised Learning part, which shows you step-by-step how to construct the training dataset.
- Added Distributed Learning Logic. If you have a multi-GPU setup, you can train faster.
- Added SageMaker Training. `ml.g4dn.xlarge` trains well, but we recommend `ml.g4dn.12xlarge` or` ml.g5.12xlarge` for faster training.

## Requirements
We recommend preparing an Amazon SageMaker instance with the specifications below to perform this hands-on.

### SageMaker Notebook instance
- `ml.g4dn.xlarge`

### SageMaker Training instance
- `ml.g4dn.xlarge` (Minimum)
- `ml.g5.12xlarge` (Recommended)

## Datasets

For supervised learning, you need an NLI dataset that specifies the relationship between the two sentences. For unsupervised learning, we recommend using wikipedia raw data separated into sentences. This hands-on uses the dataset registered with huggingface, but you can also configure your own dataset.

The datasets used in this hands-on are as follows

#### Supervised
- [Klue-NLI](https://huggingface.co/datasets/klue/viewer/nli/)
- [Kor-NLI](https://huggingface.co/datasets/kor_nli)

#### Unsupervised 
- [kowiki-sentences](https://huggingface.co/datasets/heegyu/kowiki-sentences): Data from 20221001 Korean wiki split into sentences using kss (backend=mecab) morphological analyzer.

## How to train
- See https://github.com/daekeun-ml/KoSimCSE-SageMaker

## Performance 
We trained with parameters similar to those in the paper and did not perform any parameter tuning. Higher max sequence length does not guarantee higher performance; building a good NLI dataset is more important

```json
{
  "batch_size": 64,
  "num_epochs": 1 (for unsupervised training), 3 (for supervised training)
  "lr": 3e-05,
  "num_warmup_steps": 0,
  "temperature": 0.05,
  "lr_scheduler_type": "linear",
  "max_seq_len": 32,
  "use_fp16": "True",
}
```


### KLUE-STS
| Model                  | Avg | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhattan Pearson | Manhattan Spearman | Dot Pearson | Dot Spearman |
|------------------------|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| KoSimCSE-RoBERTa-base (Unsupervised) | 81.17 | 81.27 | 80.96 | 81.70 | 80.97 | 81.63 | 80.89 | 81.12 | 80.81 |
| KoSimCSE-RoBERTa-base (Supervised) | 84.19 | 83.04 | 84.46 | 84.97 | 84.50 | 84.95 | 84.45 | 82.88 | 84.28 |
| KoSimCSE-RoBERTa-large (Unsupervised) | 81.96 | 82.09 | 81.71 | 82.45 | 81.73 | 82.42 | 81.69 | 81.98 | 81.58 |
| KoSimCSE-RoBERTa-large (Supervised) | 85.37 | 84.38 | 85.99 | 85.97 | 85.81 | 86.00 | 85.79 | 83.87 | 85.15 |

### Kor-STS
| Model                  | Avg | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhattan Pearson | Manhattan Spearman | Dot Pearson | Dot Spearman |
|------------------------|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
| KoSimCSE-RoBERTa-base (Unsupervised) | 81.20 | 81.53 | 81.17 | 80.89 | 81.20 | 80.93 | 81.22 | 81.48 | 81.14 |
| KoSimCSE-RoBERTa-base (Supervised) | 85.33 | 85.16 | 85.46 | 85.37 | 85.45 | 85.31 | 85.37 | 85.13 | 85.41 |
| KoSimCSE-RoBERTa-large (Unsupervised) | 81.71 | 82.10 | 81.78 | 81.12 | 81.78 | 81.15 | 81.80 | 82.15 | 81.80 |
| KoSimCSE-RoBERTa-large (Supervised) | 85.54 | 85.41 | 85.78 | 85.18 | 85.51 | 85.26 | 85.61 | 85.70 | 85.90 |

  
## References
- Simple-SimCSE: https://github.com/hppRC/simple-simcse
- KoSimCSE: https://github.com/BM-K/KoSimCSE-SKT
- SimCSE (official): https://github.com/princeton-nlp/SimCSE
- SimCSE paper: https://aclanthology.org/2021.emnlp-main.552