|
--- |
|
license: mit |
|
language: |
|
- ko |
|
pipeline_tag: feature-extraction |
|
--- |
|
|
|
# KoSimCSE Training on Amazon SageMaker |
|
|
|
|
|
## Usage |
|
|
|
```python |
|
import torch |
|
import torch.nn as nn |
|
import torch.nn.functional as F |
|
from torch import Tensor |
|
from transformers import AutoConfig, PretrainedConfig, PreTrainedModel |
|
from transformers import AutoModel, AutoTokenizer, logging |
|
|
|
class SimCSEConfig(PretrainedConfig): |
|
def __init__(self, version=1.0, **kwargs): |
|
self.version = version |
|
super().__init__(**kwargs) |
|
|
|
class SimCSEModel(PreTrainedModel): |
|
config_class = SimCSEConfig |
|
|
|
def __init__(self, config): |
|
super().__init__(config) |
|
self.backbone = AutoModel.from_pretrained(config.base_model) |
|
self.hidden_size: int = self.backbone.config.hidden_size |
|
self.dense = nn.Linear(self.hidden_size, self.hidden_size) |
|
self.activation = nn.Tanh() |
|
|
|
def forward( |
|
self, |
|
input_ids: Tensor, |
|
attention_mask: Tensor = None, |
|
# RoBERTa variants don't have token_type_ids, so this argument is optional |
|
token_type_ids: Tensor = None, |
|
) -> Tensor: |
|
# shape of input_ids: (batch_size, seq_len) |
|
# shape of attention_mask: (batch_size, seq_len) |
|
outputs: BaseModelOutputWithPoolingAndCrossAttentions = self.backbone( |
|
input_ids=input_ids, |
|
attention_mask=attention_mask, |
|
token_type_ids=token_type_ids, |
|
) |
|
|
|
emb = outputs.last_hidden_state[:, 0] |
|
|
|
if self.training: |
|
emb = self.dense(emb) |
|
emb = self.activation(emb) |
|
|
|
return emb |
|
|
|
def show_embedding_score(tokenizer, model, sentences): |
|
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") |
|
embeddings = model(**inputs) |
|
score01 = cal_score(embeddings[0,:], embeddings[1,:]) |
|
score02 = cal_score(embeddings[0,:], embeddings[2,:]) |
|
print(score01, score02) |
|
|
|
def cal_score(a, b): |
|
if len(a.shape) == 1: a = a.unsqueeze(0) |
|
if len(b.shape) == 1: b = b.unsqueeze(0) |
|
a_norm = a / a.norm(dim=1)[:, None] |
|
b_norm = b / b.norm(dim=1)[:, None] |
|
return torch.mm(a_norm, b_norm.transpose(0, 1)) * 100 |
|
|
|
# Load pre-trained model |
|
model = SimCSEModel.from_pretrained("daekeun-ml/KoSimCSE-unsupervised-roberta-base") |
|
tokenizer = AutoTokenizer.from_pretrained("daekeun-ml/KoSimCSE-unsupervised-roberta-base") |
|
|
|
# Inference example |
|
sentences = ['์ด๋ฒ ์ฃผ ์ผ์์ผ์ ๋ถ๋น ์ด๋งํธ ์ ์ ๋ฌธ์ ์ฌ๋์?', |
|
'์ผ์์ผ์ ๋ถ๋น ์ด๋งํธ๋ ๋ฌธ ์ด์ด์?', |
|
'๋ถ๋น ์ด๋งํธ ์ ์ ํ ์์ผ์ ๋ช ์๊น์ง ํ๋์'] |
|
|
|
show_embedding_score(tokenizer, model.cpu(), sentences) |
|
``` |
|
|
|
|
|
## Introduction |
|
|
|
[SimCSE](https://aclanthology.org/2021.emnlp-main.552/) is a highly efficient and innovative embedding technique based on the concept of contrastive learning. Unsupervised learning can be performed without the need to prepare ground-truth labels, and high-performance supervised learning can be performed if a good NLI (Natural Language Inference) dataset is prepared. The concept is very simple and the psudeo-code is intuitive, so the implementation is not difficult, but I have seen many people still struggle to train this model. |
|
|
|
The official implementation code from the authors of the paper is publicly available, but it is not suitable for a step-by-step implementation. Therefore, we have reorganized the code based on [Simple-SIMCSE's GitHub](https://github.com/hppRC/simple-simcse) so that even ML beginners can train the model from the scratch with a step-by-step implementation. It's minimalist code for beginners, but data scientists and ML engineers can also make good use of it. |
|
|
|
### Added over Simple-SimCSE |
|
- Added the Supervised Learning part, which shows you step-by-step how to construct the training dataset. |
|
- Added Distributed Learning Logic. If you have a multi-GPU setup, you can train faster. |
|
- Added SageMaker Training. `ml.g4dn.xlarge` trains well, but we recommend `ml.g4dn.12xlarge` or` ml.g5.12xlarge` for faster training. |
|
|
|
## Requirements |
|
We recommend preparing an Amazon SageMaker instance with the specifications below to perform this hands-on. |
|
|
|
### SageMaker Notebook instance |
|
- `ml.g4dn.xlarge` |
|
|
|
### SageMaker Training instance |
|
- `ml.g4dn.xlarge` (Minimum) |
|
- `ml.g5.12xlarge` (Recommended) |
|
|
|
## Datasets |
|
|
|
For supervised learning, you need an NLI dataset that specifies the relationship between the two sentences. For unsupervised learning, we recommend using wikipedia raw data separated into sentences. This hands-on uses the dataset registered with huggingface, but you can also configure your own dataset. |
|
|
|
The datasets used in this hands-on are as follows |
|
|
|
#### Supervised |
|
- [Klue-NLI](https://huggingface.co/datasets/klue/viewer/nli/) |
|
- [Kor-NLI](https://huggingface.co/datasets/kor_nli) |
|
|
|
#### Unsupervised |
|
- [kowiki-sentences](https://huggingface.co/datasets/heegyu/kowiki-sentences): Data from 20221001 Korean wiki split into sentences using kss (backend=mecab) morphological analyzer. |
|
|
|
## How to train |
|
- See https://github.com/daekeun-ml/KoSimCSE-SageMaker |
|
|
|
## Performance |
|
We trained with parameters similar to those in the paper and did not perform any parameter tuning. Higher max sequence length does not guarantee higher performance; building a good NLI dataset is more important |
|
|
|
```json |
|
{ |
|
"batch_size": 64, |
|
"num_epochs": 1 (for unsupervised training), 3 (for supervised training) |
|
"lr": 3e-05, |
|
"num_warmup_steps": 0, |
|
"temperature": 0.05, |
|
"lr_scheduler_type": "linear", |
|
"max_seq_len": 32, |
|
"use_fp16": "True", |
|
} |
|
``` |
|
|
|
|
|
### KLUE-STS |
|
| Model | Avg | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhattan Pearson | Manhattan Spearman | Dot Pearson | Dot Spearman | |
|
|------------------------|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:| |
|
| KoSimCSE-RoBERTa-base (Unsupervised) | 81.17 | 81.27 | 80.96 | 81.70 | 80.97 | 81.63 | 80.89 | 81.12 | 80.81 | |
|
| KoSimCSE-RoBERTa-base (Supervised) | 84.19 | 83.04 | 84.46 | 84.97 | 84.50 | 84.95 | 84.45 | 82.88 | 84.28 | |
|
| KoSimCSE-RoBERTa-large (Unsupervised) | 81.96 | 82.09 | 81.71 | 82.45 | 81.73 | 82.42 | 81.69 | 81.98 | 81.58 | |
|
| KoSimCSE-RoBERTa-large (Supervised) | 85.37 | 84.38 | 85.99 | 85.97 | 85.81 | 86.00 | 85.79 | 83.87 | 85.15 | |
|
|
|
### Kor-STS |
|
| Model | Avg | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhattan Pearson | Manhattan Spearman | Dot Pearson | Dot Spearman | |
|
|------------------------|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:| |
|
| KoSimCSE-RoBERTa-base (Unsupervised) | 81.20 | 81.53 | 81.17 | 80.89 | 81.20 | 80.93 | 81.22 | 81.48 | 81.14 | |
|
| KoSimCSE-RoBERTa-base (Supervised) | 85.33 | 85.16 | 85.46 | 85.37 | 85.45 | 85.31 | 85.37 | 85.13 | 85.41 | |
|
| KoSimCSE-RoBERTa-large (Unsupervised) | 81.71 | 82.10 | 81.78 | 81.12 | 81.78 | 81.15 | 81.80 | 82.15 | 81.80 | |
|
| KoSimCSE-RoBERTa-large (Supervised) | 85.54 | 85.41 | 85.78 | 85.18 | 85.51 | 85.26 | 85.61 | 85.70 | 85.90 | |
|
|
|
|
|
## References |
|
- Simple-SimCSE: https://github.com/hppRC/simple-simcse |
|
- KoSimCSE: https://github.com/BM-K/KoSimCSE-SKT |
|
- SimCSE (official): https://github.com/princeton-nlp/SimCSE |
|
- SimCSE paper: https://aclanthology.org/2021.emnlp-main.552 |