File size: 8,251 Bytes
e91a73e
 
 
 
 
 
79e7122
 
e91a73e
 
79e7122
e91a73e
79e7122
 
 
4a822a6
79e7122
 
 
4a822a6
e91a73e
 
79e7122
e91a73e
 
 
 
 
79e7122
e91a73e
 
79e7122
e91a73e
 
79e7122
 
 
 
 
 
 
 
 
 
 
 
 
 
e91a73e
 
 
 
 
79e7122
 
 
e91a73e
199153e
79e7122
e91a73e
79e7122
 
 
 
 
e91a73e
79e7122
 
 
 
 
 
 
4a822a6
79e7122
 
 
e91a73e
79e7122
 
e91a73e
79e7122
e91a73e
79e7122
 
e91a73e
79e7122
 
e91a73e
79e7122
 
e91a73e
79e7122
 
ab84e0f
dbd019c
 
 
199153e
dbd019c
79e7122
 
 
 
 
 
dc35e72
79e7122
dc35e72
79e7122
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e91a73e
 
 
 
dc35e72
e91a73e
 
 
 
 
 
 
 
 
 
 
 
79e7122
e91a73e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79e7122
e91a73e
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- transformers
- TAACO
language: ko
---

# TAACO_Similarity

λ³Έ λͺ¨λΈμ€ [Sentence-transformers](https://www.SBERT.net)λ₯Ό 기반으둜 ν•˜λ©° KLUE의 STS(Sentence Textual Similarity) 데이터셋을 톡해 ν›ˆλ ¨μ„ μ§„ν–‰ν•œ λͺ¨λΈμž…λ‹ˆλ‹€.    
ν•„μžκ°€ μ œμž‘ν•˜κ³  μžˆλŠ” ν•œκ΅­μ–΄ λ¬Έμž₯κ°„ 결속성 μΈ‘μ • 도ꡬ인 K-TAACO(κ°€μ œ)의 μ§€ν‘œ 쀑 ν•˜λ‚˜μΈ λ¬Έμž₯ κ°„ 의미적 결속성을 μΈ‘μ •ν•˜κΈ° μœ„ν•΄ λ³Έ λͺ¨λΈμ„ μ œμž‘ν•˜μ˜€μŠ΅λ‹ˆλ‹€.    
λ˜ν•œ λͺ¨λ‘μ˜ λ§λ­‰μΉ˜μ˜ λ¬Έμž₯κ°„ μœ μ‚¬λ„ 데이터 λ“± λ‹€μ–‘ν•œ 데이터λ₯Ό ꡬ해 μΆ”κ°€ ν›ˆλ ¨μ„ 진행할 μ˜ˆμ •μž…λ‹ˆλ‹€.

## Train Data
KLUE-sts-v1.1._train.json
NLI-sts-train.tsv

## Usage (Sentence-Transformers)

λ³Έ λͺ¨λΈμ„ μ‚¬μš©ν•˜κΈ° μœ„ν•΄μ„œλŠ” [Sentence-transformers](https://www.SBERT.net)λ₯Ό μ„€μΉ˜ν•˜μ—¬μ•Ό ν•©λ‹ˆλ‹€.

```
pip install -U sentence-transformers
```

λͺ¨λΈμ„ μ‚¬μš©ν•˜κΈ° μœ„ν•΄μ„œλŠ” μ•„λž˜ μ½”λ“œλ₯Ό μ°Έμ‘°ν•˜μ‹œκΈΈ λ°”λžλ‹ˆλ‹€.

```python
from sentence_transformers import SentenceTransformer, models
sentences = ["This is an example sentence", "Each sentence is converted"]

embedding_model = models.Transformer(
    model_name_or_path="KDHyun08/TAACO_STS", 
    max_seq_length=256,
    do_lower_case=True
)

pooling_model = models.Pooling(
    embedding_model.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True,
    pooling_mode_cls_token=False,
    pooling_mode_max_tokens=False,
)
model = SentenceTransformer(modules=[embedding_model, pooling_model])

embeddings = model.encode(sentences)
print(embeddings)
```


## Usage (μ‹€μ œ λ¬Έμž₯ κ°„ μœ μ‚¬λ„ 비ꡐ)
[Sentence-transformers](https://www.SBERT.net) λ₯Ό μ„€μΉ˜ν•œ ν›„ μ•„λž˜ λ‚΄μš©κ³Ό 같이 λ¬Έμž₯ κ°„ μœ μ‚¬λ„λ₯Ό 비ꡐ할 수 μžˆμŠ΅λ‹ˆλ‹€.   
query λ³€μˆ˜λŠ” 비ꡐ 기쀀이 λ˜λŠ” λ¬Έμž₯(Source Sentence)이고 비ꡐλ₯Ό 진행할 λ¬Έμž₯은 docs에 list ν˜•μ‹μœΌλ‘œ κ΅¬μ„±ν•˜μ‹œλ©΄ λ©λ‹ˆλ‹€.

```python
from sentence_transformers import SentenceTransformer, models

embedding_model = models.Transformer(
    model_name_or_path="KDHyun08/TAACO_STS", 
    max_seq_length=256,
    do_lower_case=True
)

pooling_model = models.Pooling(
    embedding_model.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=True,
    pooling_mode_cls_token=False,
    pooling_mode_max_tokens=False,
)
model = SentenceTransformer(modules=[embedding_model, pooling_model])

docs = ['μ–΄μ œλŠ” μ•„λ‚΄μ˜ μƒμΌμ΄μ—ˆλ‹€', '생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€. 주된 λ©”λ‰΄λŠ” μŠ€ν…Œμ΄ν¬μ™€ λ‚™μ§€λ³ΆμŒ, λ―Έμ—­κ΅­, μž‘μ±„, μ†Œμ•Ό λ“±μ΄μ—ˆλ‹€', 'μŠ€ν…Œμ΄ν¬λŠ” 자주 ν•˜λŠ” μŒμ‹μ΄μ–΄μ„œ μžμ‹ μ΄ μ€€λΉ„ν•˜λ €κ³  ν–ˆλ‹€', 'μ•žλ’€λ„ 1λΆ„μ”© 3번 뒀집고 λž˜μŠ€νŒ…μ„ 잘 ν•˜λ©΄ μœ‘μ¦™μ΄ κ°€λ“ν•œ μŠ€ν…Œμ΄ν¬κ°€ μ€€λΉ„λ˜λ‹€', '아내도 그런 μŠ€ν…Œμ΄ν¬λ₯Ό μ’‹μ•„ν•œλ‹€. 그런데 상상도 λͺ»ν•œ 일이 λ²Œμ΄μ§€κ³  λ§μ•˜λ‹€', '보톡 μ‹œμ¦ˆλ‹μ΄ λ˜μ§€ μ•Šμ€ μ›μœ‘μ„ μ‚¬μ„œ μŠ€ν…Œμ΄ν¬λ₯Ό ν–ˆλŠ”λ°, μ΄λ²ˆμ—λŠ” μ‹œμ¦ˆλ‹μ΄ 된 뢀챗살을 κ΅¬μž…ν•΄μ„œ ν–ˆλ‹€', '그런데 μΌ€μ΄μŠ€ μ•ˆμ— λ°©λΆ€μ œκ°€ λ“€μ–΄μžˆλŠ” 것을 μΈμ§€ν•˜μ§€ λͺ»ν•˜κ³  λ°©λΆ€μ œμ™€ λ™μ‹œμ— ν”„λΌμ΄νŒ¬μ— μ˜¬λ €λ†“μ„ 것이닀', '그것도 인지 λͺ»ν•œ 체... μ•žλ©΄μ„ μ„Ό λΆˆμ— 1뢄을 κ΅½κ³  λ’€μ§‘λŠ” μˆœκ°„ λ°©λΆ€μ œκ°€ ν•¨κ»˜ ꡬ어진 것을 μ•Œμ•˜λ‹€', 'μ•„λ‚΄μ˜ 생일이라 λ§›μžˆκ²Œ κ΅¬μ›Œλ³΄κ³  μ‹Άμ—ˆλŠ”λ° μ–΄μ²˜κ΅¬λ‹ˆμ—†λŠ” 상황이 λ°œμƒν•œ 것이닀', 'λ°©λΆ€μ œκ°€ μ„Ό λΆˆμ— λ…Ήμ•„μ„œ κ·ΈλŸ°μ§€ 물처럼 ν˜λŸ¬λ‚΄λ Έλ‹€', ' 고민을 ν–ˆλ‹€. λ°©λΆ€μ œκ°€ 묻은 λΆ€λ¬Έλ§Œ μ œκ±°ν•˜κ³  λ‹€μ‹œ ꡬ울까 ν–ˆλŠ”λ° λ°©λΆ€μ œμ— μ ˆλŒ€ 먹지 λ§λΌλŠ” 문ꡬ가 μžˆμ–΄μ„œ μ•„κΉμ§€λ§Œ λ²„λ¦¬λŠ” λ°©ν–₯을 ν–ˆλ‹€', 'λ„ˆλ¬΄λ‚˜ μ•ˆνƒ€κΉŒμ› λ‹€', 'μ•„μΉ¨ 일찍 μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ” μŠ€ν…Œμ΄ν¬λ₯Ό μ€€λΉ„ν•˜κ³  그것을 λ§›μžˆκ²Œ λ¨ΉλŠ” μ•„λ‚΄μ˜ λͺ¨μŠ΅μ„ 보고 μ‹Άμ—ˆλŠ”λ° μ „ν˜€ 생각지도 λͺ»ν•œ 상황이 λ°œμƒν•΄μ„œ... ν•˜μ§€λ§Œ 정신을 μΆ”μŠ€λ₯΄κ³  λ°”λ‘œ λ‹€λ₯Έ λ©”λ‰΄λ‘œ λ³€κ²½ν–ˆλ‹€', 'μ†Œμ•Ό, μ†Œμ‹œμ§€ μ•Όμ±„λ³ΆμŒ..', 'μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ”μ§€ λͺ¨λ₯΄κ² μ§€λ§Œ 냉μž₯κ³  μ•ˆμ— μžˆλŠ” ν›„λž‘ν¬μ†Œμ„Έμ§€λ₯Ό λ³΄λ‹ˆ λ°”λ‘œ μ†Œμ•Όλ₯Ό ν•΄μ•Όκ² λ‹€λŠ” 생각이 λ“€μ—ˆλ‹€. μŒμ‹μ€ μ„±κ³΅μ μœΌλ‘œ 완성이 λ˜μ—ˆλ‹€', '40번째λ₯Ό λ§žμ΄ν•˜λŠ” μ•„λ‚΄μ˜ 생일은 μ„±κ³΅μ μœΌλ‘œ μ€€λΉ„κ°€ λ˜μ—ˆλ‹€', 'λ§›μžˆκ²Œ λ¨Ήμ–΄ μ€€ μ•„λ‚΄μ—κ²Œλ„ κ°μ‚¬ν–ˆλ‹€', '맀년 μ•„λ‚΄μ˜ 생일에 λ§žμ΄ν•˜λ©΄ μ•„μΉ¨λ§ˆλ‹€ 생일을 μ°¨λ €μ•Όκ² λ‹€. μ˜€λŠ˜λ„ 즐거운 ν•˜λ£¨κ°€ λ˜μ—ˆμœΌλ©΄ μ’‹κ² λ‹€', 'μƒμΌμ΄λ‹ˆκΉŒ~']
#각 λ¬Έμž₯의 vectorκ°’ encoding
document_embeddings = model.encode(docs)

query = '생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€'
query_embedding = model.encode(query)

top_k = min(10, len(docs))

# 코사인 μœ μ‚¬λ„ 계산 ν›„,
cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0]

# 코사인 μœ μ‚¬λ„ 순으둜 λ¬Έμž₯ μΆ”μΆœ
top_results = torch.topk(cos_scores, k=top_k)

print(f"μž…λ ₯ λ¬Έμž₯: {query}")
print(f"\n<μž…λ ₯ λ¬Έμž₯κ³Ό μœ μ‚¬ν•œ {top_k} 개의 λ¬Έμž₯>\n")

for i, (score, idx) in enumerate(zip(top_results[0], top_results[1])):
    print(f"{i+1}: {docs[idx]} {'(μœ μ‚¬λ„: {:.4f})'.format(score)}\n")
```



## Evaluation Results

μœ„ Usageλ₯Ό μ‹€ν–‰ν•˜κ²Œ 되면 μ•„λž˜μ™€ 같은 κ²°κ³Όκ°€ λ„μΆœλ©λ‹ˆλ‹€. 1에 κ°€κΉŒμšΈμˆ˜λ‘ μœ μ‚¬ν•œ λ¬Έμž₯μž…λ‹ˆλ‹€.

```
μž…λ ₯ λ¬Έμž₯: 생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€

<μž…λ ₯ λ¬Έμž₯κ³Ό μœ μ‚¬ν•œ 10 개의 λ¬Έμž₯>

1: 생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€. 주된 λ©”λ‰΄λŠ” μŠ€ν…Œμ΄ν¬μ™€ λ‚™μ§€λ³ΆμŒ, λ―Έμ—­κ΅­, μž‘μ±„, μ†Œμ•Ό λ“±μ΄μ—ˆλ‹€ (μœ μ‚¬λ„: 0.6687)

2: 맀년 μ•„λ‚΄μ˜ 생일에 λ§žμ΄ν•˜λ©΄ μ•„μΉ¨λ§ˆλ‹€ 생일을 μ°¨λ €μ•Όκ² λ‹€. μ˜€λŠ˜λ„ 즐거운 ν•˜λ£¨κ°€ λ˜μ—ˆμœΌλ©΄ μ’‹κ² λ‹€ (μœ μ‚¬λ„: 0.6468)

3: 40번째λ₯Ό λ§žμ΄ν•˜λŠ” μ•„λ‚΄μ˜ 생일은 μ„±κ³΅μ μœΌλ‘œ μ€€λΉ„κ°€ λ˜μ—ˆλ‹€ (μœ μ‚¬λ„: 0.4647)

4: μ•„λ‚΄μ˜ 생일이라 λ§›μžˆκ²Œ κ΅¬μ›Œλ³΄κ³  μ‹Άμ—ˆλŠ”λ° μ–΄μ²˜κ΅¬λ‹ˆμ—†λŠ” 상황이 λ°œμƒν•œ 것이닀 (μœ μ‚¬λ„: 0.4469)

5: μƒμΌμ΄λ‹ˆκΉŒ~ (μœ μ‚¬λ„: 0.4218)

6: μ–΄μ œλŠ” μ•„λ‚΄μ˜ μƒμΌμ΄μ—ˆλ‹€ (μœ μ‚¬λ„: 0.4192)

7: μ•„μΉ¨ 일찍 μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ” μŠ€ν…Œμ΄ν¬λ₯Ό μ€€λΉ„ν•˜κ³  그것을 λ§›μžˆκ²Œ λ¨ΉλŠ” μ•„λ‚΄μ˜ λͺ¨μŠ΅μ„ 보고 μ‹Άμ—ˆλŠ”λ° μ „ν˜€ 생각지도 λͺ»ν•œ 상황이 λ°œμƒν•΄μ„œ... ν•˜μ§€λ§Œ 정신을 μΆ”μŠ€λ₯΄κ³  λ°”λ‘œ λ‹€λ₯Έ λ©”λ‰΄λ‘œ λ³€κ²½ν–ˆλ‹€ (μœ μ‚¬λ„: 0.4156)

8: λ§›μžˆκ²Œ λ¨Ήμ–΄ μ€€ μ•„λ‚΄μ—κ²Œλ„ κ°μ‚¬ν–ˆλ‹€ (μœ μ‚¬λ„: 0.3093)

9: μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ”μ§€ λͺ¨λ₯΄κ² μ§€λ§Œ 냉μž₯κ³  μ•ˆμ— μžˆλŠ” ν›„λž‘ν¬μ†Œμ„Έμ§€λ₯Ό λ³΄λ‹ˆ λ°”λ‘œ μ†Œμ•Όλ₯Ό ν•΄μ•Όκ² λ‹€λŠ” 생각이 λ“€μ—ˆλ‹€. μŒμ‹μ€ μ„±κ³΅μ μœΌλ‘œ 완성이 λ˜μ—ˆλ‹€ (μœ μ‚¬λ„: 0.2259)

10: 아내도 그런 μŠ€ν…Œμ΄ν¬λ₯Ό μ’‹μ•„ν•œλ‹€. 그런데 상상도 λͺ»ν•œ 일이 λ²Œμ΄μ§€κ³  λ§μ•˜λ‹€ (μœ μ‚¬λ„: 0.1967)
```


**DataLoader**:

`torch.utils.data.dataloader.DataLoader` of length 142 with parameters:
```
{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
```

**Loss**:

`sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss` 

Parameters of the fit()-Method:
```
{
    "epochs": 4,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'transformers.optimization.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 10000,
    "weight_decay": 0.01
}
```


## Full Model Architecture
```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
```

## Citing & Authors

<!--- Describe where people can find more information -->