KDHyun08 commited on
Commit
752d6bc
โ€ข
1 Parent(s): 8fffa58

Upload with huggingface_hub

Browse files
README.md CHANGED
@@ -2,130 +2,85 @@
2
  pipeline_tag: sentence-similarity
3
  tags:
4
  - sentence-transformers
 
5
  - sentence-similarity
6
  - transformers
7
- - TAACO
8
- language: ko
9
  ---
10
 
11
- # TAACO_Similarity
12
 
13
- ๋ณธ ๋ชจ๋ธ์€ [Sentence-transformers](https://www.SBERT.net)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ KLUE์˜ STS(Sentence Textual Similarity) ๋ฐ์ดํ„ฐ์…‹์„ ํ†ตํ•ด ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
14
- ํ•„์ž๊ฐ€ ์ œ์ž‘ํ•˜๊ณ  ์žˆ๋Š” ํ•œ๊ตญ์–ด ๋ฌธ์žฅ๊ฐ„ ๊ฒฐ์†์„ฑ ์ธก์ • ๋„๊ตฌ์ธ K-TAACO(๊ฐ€์ œ)์˜ ์ง€ํ‘œ ์ค‘ ํ•˜๋‚˜์ธ ๋ฌธ์žฅ ๊ฐ„ ์˜๋ฏธ์  ๊ฒฐ์†์„ฑ์„ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋ชจ๋ธ์„ ์ œ์ž‘ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
15
- ๋˜ํ•œ ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜์˜ ๋ฌธ์žฅ๊ฐ„ ์œ ์‚ฌ๋„ ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌํ•ด ์ถ”๊ฐ€ ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.
16
 
17
- ## Train Data
18
- KLUE-sts-v1.1._train.json
19
- NLI-sts-train.tsv
20
 
21
  ## Usage (Sentence-Transformers)
22
 
23
- ๋ณธ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” [Sentence-transformers](https://www.SBERT.net)๋ฅผ ์„ค์น˜ํ•˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
24
 
25
  ```
26
  pip install -U sentence-transformers
27
  ```
28
 
29
- ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์•„๋ž˜ ์ฝ”๋“œ๋ฅผ ์ฐธ์กฐํ•˜์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค.
30
 
31
  ```python
32
- from sentence_transformers import SentenceTransformer, models
33
  sentences = ["This is an example sentence", "Each sentence is converted"]
34
 
35
- embedding_model = models.Transformer(
36
- model_name_or_path="KDHyun08/TAACO_STS",
37
- max_seq_length=256,
38
- do_lower_case=True
39
- )
40
-
41
- pooling_model = models.Pooling(
42
- embedding_model.get_word_embedding_dimension(),
43
- pooling_mode_mean_tokens=True,
44
- pooling_mode_cls_token=False,
45
- pooling_mode_max_tokens=False,
46
- )
47
- model = SentenceTransformer(modules=[embedding_model, pooling_model])
48
-
49
  embeddings = model.encode(sentences)
50
  print(embeddings)
51
  ```
52
 
53
 
54
- ## Usage (์‹ค์ œ ๋ฌธ์žฅ ๊ฐ„ ์œ ์‚ฌ๋„ ๋น„๊ต)
55
- [Sentence-transformers](https://www.SBERT.net) ๋ฅผ ์„ค์น˜ํ•œ ํ›„ ์•„๋ž˜ ๋‚ด์šฉ๊ณผ ๊ฐ™์ด ๋ฌธ์žฅ ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
56
- query ๋ณ€์ˆ˜๋Š” ๋น„๊ต ๊ธฐ์ค€์ด ๋˜๋Š” ๋ฌธ์žฅ(Source Sentence)์ด๊ณ  ๋น„๊ต๋ฅผ ์ง„ํ–‰ํ•  ๋ฌธ์žฅ์€ docs์— list ํ˜•์‹์œผ๋กœ ๊ตฌ์„ฑํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.
57
 
58
  ```python
59
- from sentence_transformers import SentenceTransformer, models
 
60
 
61
- embedding_model = models.Transformer(
62
- model_name_or_path="KDHyun08/TAACO_STS",
63
- max_seq_length=256,
64
- do_lower_case=True
65
- )
66
 
67
- pooling_model = models.Pooling(
68
- embedding_model.get_word_embedding_dimension(),
69
- pooling_mode_mean_tokens=True,
70
- pooling_mode_cls_token=False,
71
- pooling_mode_max_tokens=False,
72
- )
73
- model = SentenceTransformer(modules=[embedding_model, pooling_model])
74
 
75
- docs = ['์–ด์ œ๋Š” ์•„๋‚ด์˜ ์ƒ์ผ์ด์—ˆ๋‹ค', '์ƒ์ผ์„ ๋งž์ดํ•˜์—ฌ ์•„์นจ์„ ์ค€๋น„ํ•˜๊ฒ ๋‹ค๊ณ  ์˜ค์ „ 8์‹œ 30๋ถ„๋ถ€ํ„ฐ ์Œ์‹์„ ์ค€๋น„ํ•˜์˜€๋‹ค. ์ฃผ๋œ ๋ฉ”๋‰ด๋Š” ์Šคํ…Œ์ดํฌ์™€ ๋‚™์ง€๋ณถ์Œ, ๋ฏธ์—ญ๊ตญ, ์žก์ฑ„, ์†Œ์•ผ ๋“ฑ์ด์—ˆ๋‹ค', '์Šคํ…Œ์ดํฌ๋Š” ์ž์ฃผ ํ•˜๋Š” ์Œ์‹์ด์–ด์„œ ์ž์‹ ์ด ์ค€๋น„ํ•˜๋ ค๊ณ  ํ–ˆ๋‹ค', '์•ž๋’ค๋„ 1๋ถ„์”ฉ 3๋ฒˆ ๋’ค์ง‘๊ณ  ๋ž˜์ŠคํŒ…์„ ์ž˜ ํ•˜๋ฉด ์œก์ฆ™์ด ๊ฐ€๋“ํ•œ ์Šคํ…Œ์ดํฌ๊ฐ€ ์ค€๋น„๋˜๋‹ค', '์•„๋‚ด๋„ ๊ทธ๋Ÿฐ ์Šคํ…Œ์ดํฌ๋ฅผ ์ข‹์•„ํ•œ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ƒ์ƒ๋„ ๋ชปํ•œ ์ผ์ด ๋ฒŒ์ด์ง€๊ณ  ๋ง์•˜๋‹ค', '๋ณดํ†ต ์‹œ์ฆˆ๋‹์ด ๋˜์ง€ ์•Š์€ ์›์œก์„ ์‚ฌ์„œ ์Šคํ…Œ์ดํฌ๋ฅผ ํ–ˆ๋Š”๋ฐ, ์ด๋ฒˆ์—๋Š” ์‹œ์ฆˆ๋‹์ด ๋œ ๋ถ€์ฑ—์‚ด์„ ๊ตฌ์ž…ํ•ด์„œ ํ–ˆ๋‹ค', '๊ทธ๋Ÿฐ๋ฐ ์ผ€์ด์Šค ์•ˆ์— ๋ฐฉ๋ถ€์ œ๊ฐ€ ๋“ค์–ด์žˆ๋Š” ๊ฒƒ์„ ์ธ์ง€ํ•˜์ง€ ๋ชปํ•˜๊ณ  ๋ฐฉ๋ถ€์ œ์™€ ๋™์‹œ์— ํ”„๋ผ์ดํŒฌ์— ์˜ฌ๋ ค๋†“์„ ๊ฒƒ์ด๋‹ค', '๊ทธ๊ฒƒ๋„ ์ธ์ง€ ๋ชปํ•œ ์ฒด... ์•ž๋ฉด์„ ์„ผ ๋ถˆ์— 1๋ถ„์„ ๊ตฝ๊ณ  ๋’ค์ง‘๋Š” ์ˆœ๊ฐ„ ๋ฐฉ๋ถ€์ œ๊ฐ€ ํ•จ๊ป˜ ๊ตฌ์–ด์ง„ ๊ฒƒ์„ ์•Œ์•˜๋‹ค', '์•„๋‚ด์˜ ์ƒ์ผ์ด๋ผ ๋ง›์žˆ๊ฒŒ ๊ตฌ์›Œ๋ณด๊ณ  ์‹ถ์—ˆ๋Š”๋ฐ ์–ด์ฒ˜๊ตฌ๋‹ˆ์—†๋Š” ์ƒํ™ฉ์ด ๋ฐœ์ƒํ•œ ๊ฒƒ์ด๋‹ค', '๋ฐฉ๋ถ€์ œ๊ฐ€ ์„ผ ๋ถˆ์— ๋…น์•„์„œ ๊ทธ๋Ÿฐ์ง€ ๋ฌผ์ฒ˜๋Ÿผ ํ˜๋Ÿฌ๋‚ด๋ ธ๋‹ค', ' ๊ณ ๋ฏผ์„ ํ–ˆ๋‹ค. ๋ฐฉ๋ถ€์ œ๊ฐ€ ๋ฌป์€ ๋ถ€๋ฌธ๋งŒ ์ œ๊ฑฐํ•˜๊ณ  ๋‹ค์‹œ ๊ตฌ์šธ๊นŒ ํ–ˆ๋Š”๋ฐ ๋ฐฉ๋ถ€์ œ์— ์ ˆ๋Œ€ ๋จน์ง€ ๋ง๋ผ๋Š” ๋ฌธ๊ตฌ๊ฐ€ ๏ฟฝ๏ฟฝ์–ด์„œ ์•„๊น์ง€๋งŒ ๋ฒ„๋ฆฌ๋Š” ๋ฐฉํ–ฅ์„ ํ–ˆ๋‹ค', '๋„ˆ๋ฌด๋‚˜ ์•ˆํƒ€๊นŒ์› ๋‹ค', '์•„์นจ ์ผ์ฐ ์•„๋‚ด๊ฐ€ ์ข‹์•„ํ•˜๋Š” ์Šคํ…Œ์ดํฌ๋ฅผ ์ค€๋น„ํ•˜๊ณ  ๊ทธ๊ฒƒ์„ ๋ง›์žˆ๊ฒŒ ๋จน๋Š” ์•„๋‚ด์˜ ๋ชจ์Šต์„ ๋ณด๊ณ  ์‹ถ์—ˆ๋Š”๋ฐ ์ „ํ˜€ ์ƒ๊ฐ์ง€๋„ ๋ชปํ•œ ์ƒํ™ฉ์ด ๋ฐœ์ƒํ•ด์„œ... ํ•˜์ง€๋งŒ ์ •์‹ ์„ ์ถ”์Šค๋ฅด๊ณ  ๋ฐ”๋กœ ๋‹ค๋ฅธ ๋ฉ”๋‰ด๋กœ ๋ณ€๊ฒฝํ–ˆ๋‹ค', '์†Œ์•ผ, ์†Œ์‹œ์ง€ ์•ผ์ฑ„๋ณถ์Œ..', '์•„๋‚ด๊ฐ€ ์ข‹์•„ํ•˜๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ ๋ƒ‰์žฅ๊ณ  ์•ˆ์— ์žˆ๋Š” ํ›„๋ž‘ํฌ์†Œ์„ธ์ง€๋ฅผ ๋ณด๋‹ˆ ๋ฐ”๋กœ ์†Œ์•ผ๋ฅผ ํ•ด์•ผ๊ฒ ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ๋‹ค. ์Œ์‹์€ ์„ฑ๊ณต์ ์œผ๋กœ ์™„์„ฑ์ด ๋˜์—ˆ๋‹ค', '40๋ฒˆ์งธ๋ฅผ ๋งž์ดํ•˜๋Š” ์•„๋‚ด์˜ ์ƒ์ผ์€ ์„ฑ๊ณต์ ์œผ๋กœ ์ค€๋น„๊ฐ€ ๋˜์—ˆ๋‹ค', '๋ง›์žˆ๊ฒŒ ๋จน์–ด ์ค€ ์•„๋‚ด์—๊ฒŒ๋„ ๊ฐ์‚ฌํ–ˆ๋‹ค', '๋งค๋…„ ์•„๋‚ด์˜ ์ƒ์ผ์— ๋งž์ดํ•˜๋ฉด ์•„์นจ๋งˆ๋‹ค ์ƒ์ผ์„ ์ฐจ๋ ค์•ผ๊ฒ ๋‹ค. ์˜ค๋Š˜๋„ ์ฆ๊ฑฐ์šด ํ•˜๋ฃจ๊ฐ€ ๋˜์—ˆ์œผ๋ฉด ์ข‹๊ฒ ๋‹ค', '์ƒ์ผ์ด๋‹ˆ๊นŒ~']
76
- #๊ฐ ๋ฌธ์žฅ์˜ vector๊ฐ’ encoding
77
- document_embeddings = model.encode(docs)
78
 
79
- query = '์ƒ์ผ์„ ๋งž์ดํ•˜์—ฌ ์•„์นจ์„ ์ค€๋น„ํ•˜๊ฒ ๋‹ค๊ณ  ์˜ค์ „ 8์‹œ 30๋ถ„๋ถ€ํ„ฐ ์Œ์‹์„ ์ค€๋น„ํ•˜์˜€๋‹ค'
80
- query_embedding = model.encode(query)
81
 
82
- top_k = min(10, len(docs))
 
 
83
 
84
- # ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ ํ›„,
85
- cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0]
86
 
87
- # ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ์ˆœ์œผ๋กœ ๋ฌธ์žฅ ์ถ”์ถœ
88
- top_results = torch.topk(cos_scores, k=top_k)
 
89
 
90
- print(f"์ž…๋ ฅ ๋ฌธ์žฅ: {query}")
91
- print(f"\n<์ž…๋ ฅ ๋ฌธ์žฅ๊ณผ ์œ ์‚ฌํ•œ {top_k} ๊ฐœ์˜ ๋ฌธ์žฅ>\n")
92
 
93
- for i, (score, idx) in enumerate(zip(top_results[0], top_results[1])):
94
- print(f"{i+1}: {docs[idx]} {'(์œ ์‚ฌ๋„: {:.4f})'.format(score)}\n")
95
  ```
96
 
97
 
98
 
99
  ## Evaluation Results
100
 
101
- ์œ„ Usage๋ฅผ ์‹คํ–‰ํ•˜๊ฒŒ ๋˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ๊ฒฐ๊ณผ๊ฐ€ ๋„์ถœ๋ฉ๋‹ˆ๋‹ค. 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์œ ์‚ฌํ•œ ๋ฌธ์žฅ์ž…๋‹ˆ๋‹ค.
102
-
103
- ```
104
- ์ž…๋ ฅ ๋ฌธ์žฅ: ์ƒ์ผ์„ ๋งž์ดํ•˜์—ฌ ์•„์นจ์„ ์ค€๋น„ํ•˜๊ฒ ๋‹ค๊ณ  ์˜ค์ „ 8์‹œ 30๋ถ„๋ถ€ํ„ฐ ์Œ์‹์„ ์ค€๋น„ํ•˜์˜€๋‹ค
105
-
106
- <์ž…๋ ฅ ๋ฌธ์žฅ๊ณผ ์œ ์‚ฌํ•œ 10 ๊ฐœ์˜ ๋ฌธ์žฅ>
107
 
108
- 1: ์ƒ์ผ์„ ๋งž์ดํ•˜์—ฌ ์•„์นจ์„ ์ค€๋น„ํ•˜๊ฒ ๋‹ค๊ณ  ์˜ค์ „ 8์‹œ 30๋ถ„๋ถ€ํ„ฐ ์Œ์‹์„ ์ค€๋น„ํ•˜์˜€๋‹ค. ์ฃผ๋œ ๋ฉ”๋‰ด๋Š” ์Šคํ…Œ์ดํฌ์™€ ๋‚™์ง€๋ณถ์Œ, ๋ฏธ์—ญ๊ตญ, ์žก์ฑ„, ์†Œ์•ผ ๋“ฑ์ด์—ˆ๋‹ค (์œ ์‚ฌ๋„: 0.6687)
109
 
110
- 2: ๋งค๋…„ ์•„๋‚ด์˜ ์ƒ์ผ์— ๋งž์ดํ•˜๋ฉด ์•„์นจ๋งˆ๋‹ค ์ƒ์ผ์„ ์ฐจ๋ ค์•ผ๊ฒ ๋‹ค. ์˜ค๋Š˜๋„ ์ฆ๊ฑฐ์šด ํ•˜๋ฃจ๊ฐ€ ๋˜์—ˆ์œผ๋ฉด ์ข‹๊ฒ ๋‹ค (์œ ์‚ฌ๋„: 0.6468)
111
-
112
- 3: 40๋ฒˆ์งธ๋ฅผ ๋งž์ดํ•˜๋Š” ์•„๋‚ด์˜ ์ƒ์ผ์€ ์„ฑ๊ณต์ ์œผ๋กœ ์ค€๋น„๊ฐ€ ๋˜์—ˆ๋‹ค (์œ ์‚ฌ๋„: 0.4647)
113
-
114
- 4: ์•„๋‚ด์˜ ์ƒ์ผ์ด๋ผ ๋ง›์žˆ๊ฒŒ ๊ตฌ์›Œ๋ณด๊ณ  ์‹ถ์—ˆ๋Š”๋ฐ ์–ด์ฒ˜๊ตฌ๋‹ˆ์—†๋Š” ์ƒํ™ฉ์ด ๋ฐœ์ƒํ•œ ๊ฒƒ์ด๋‹ค (์œ ์‚ฌ๋„: 0.4469)
115
-
116
- 5: ์ƒ์ผ์ด๋‹ˆ๊นŒ~ (์œ ์‚ฌ๋„: 0.4218)
117
-
118
- 6: ์–ด์ œ๋Š” ์•„๋‚ด์˜ ์ƒ์ผ์ด์—ˆ๋‹ค (์œ ์‚ฌ๋„: 0.4192)
119
-
120
- 7: ์•„์นจ ์ผ์ฐ ์•„๋‚ด๊ฐ€ ์ข‹์•„ํ•˜๋Š” ์Šคํ…Œ์ดํฌ๋ฅผ ์ค€๋น„ํ•˜๊ณ  ๊ทธ๊ฒƒ์„ ๋ง›์žˆ๊ฒŒ ๋จน๋Š” ์•„๋‚ด์˜ ๋ชจ์Šต์„ ๋ณด๊ณ  ์‹ถ์—ˆ๋Š”๋ฐ ์ „ํ˜€ ์ƒ๊ฐ์ง€๋„ ๋ชปํ•œ ์ƒํ™ฉ์ด ๋ฐœ์ƒํ•ด์„œ... ํ•˜์ง€๋งŒ ์ •์‹ ์„ ์ถ”์Šค๋ฅด๊ณ  ๋ฐ”๋กœ ๋‹ค๋ฅธ ๋ฉ”๋‰ด๋กœ ๋ณ€๊ฒฝํ–ˆ๋‹ค (์œ ์‚ฌ๋„: 0.4156)
121
-
122
- 8: ๋ง›์žˆ๊ฒŒ ๋จน์–ด ์ค€ ์•„๋‚ด์—๊ฒŒ๋„ ๊ฐ์‚ฌํ–ˆ๋‹ค (์œ ์‚ฌ๋„: 0.3093)
123
-
124
- 9: ์•„๋‚ด๊ฐ€ ์ข‹์•„ํ•˜๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ ๋ƒ‰์žฅ๊ณ  ์•ˆ์— ์žˆ๋Š” ํ›„๋ž‘ํฌ์†Œ์„ธ์ง€๋ฅผ ๋ณด๋‹ˆ ๋ฐ”๋กœ ์†Œ์•ผ๋ฅผ ํ•ด์•ผ๊ฒ ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ๋‹ค. ์Œ์‹์€ ์„ฑ๊ณต์ ์œผ๋กœ ์™„์„ฑ์ด ๋˜์—ˆ๋‹ค (์œ ์‚ฌ๋„: 0.2259)
125
-
126
- 10: ์•„๋‚ด๋„ ๊ทธ๋Ÿฐ ์Šคํ…Œ์ดํฌ๋ฅผ ์ข‹์•„ํ•œ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ƒ์ƒ๋„ ๋ชปํ•œ ์ผ์ด ๋ฒŒ์ด์ง€๊ณ  ๋ง์•˜๋‹ค (์œ ์‚ฌ๋„: 0.1967)
127
- ```
128
 
 
 
129
 
130
  **DataLoader**:
131
 
@@ -142,7 +97,7 @@ Parameters of the fit()-Method:
142
  ```
143
  {
144
  "epochs": 4,
145
- "evaluation_steps": 1000,
146
  "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
147
  "max_grad_norm": 1,
148
  "optimizer_class": "<class 'transformers.optimization.AdamW'>",
@@ -160,7 +115,7 @@ Parameters of the fit()-Method:
160
  ## Full Model Architecture
161
  ```
162
  SentenceTransformer(
163
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
164
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
165
  )
166
  ```
 
2
  pipeline_tag: sentence-similarity
3
  tags:
4
  - sentence-transformers
5
+ - feature-extraction
6
  - sentence-similarity
7
  - transformers
 
 
8
  ---
9
 
10
+ # {MODEL_NAME}
11
 
12
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 
 
13
 
14
+ <!--- Describe your model here -->
 
 
15
 
16
  ## Usage (Sentence-Transformers)
17
 
18
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
19
 
20
  ```
21
  pip install -U sentence-transformers
22
  ```
23
 
24
+ Then you can use the model like this:
25
 
26
  ```python
27
+ from sentence_transformers import SentenceTransformer
28
  sentences = ["This is an example sentence", "Each sentence is converted"]
29
 
30
+ model = SentenceTransformer('{MODEL_NAME}')
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  embeddings = model.encode(sentences)
32
  print(embeddings)
33
  ```
34
 
35
 
36
+
37
+ ## Usage (HuggingFace Transformers)
38
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
39
 
40
  ```python
41
+ from transformers import AutoTokenizer, AutoModel
42
+ import torch
43
 
 
 
 
 
 
44
 
45
+ #Mean Pooling - Take attention mask into account for correct averaging
46
+ def mean_pooling(model_output, attention_mask):
47
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
48
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
49
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
 
 
50
 
 
 
 
51
 
52
+ # Sentences we want sentence embeddings for
53
+ sentences = ['This is an example sentence', 'Each sentence is converted']
54
 
55
+ # Load model from HuggingFace Hub
56
+ tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
57
+ model = AutoModel.from_pretrained('{MODEL_NAME}')
58
 
59
+ # Tokenize sentences
60
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
61
 
62
+ # Compute token embeddings
63
+ with torch.no_grad():
64
+ model_output = model(**encoded_input)
65
 
66
+ # Perform pooling. In this case, mean pooling.
67
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
68
 
69
+ print("Sentence embeddings:")
70
+ print(sentence_embeddings)
71
  ```
72
 
73
 
74
 
75
  ## Evaluation Results
76
 
77
+ <!--- Describe how your model was evaluated -->
 
 
 
 
 
78
 
79
+ For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
+ ## Training
83
+ The model was trained with the parameters:
84
 
85
  **DataLoader**:
86
 
 
97
  ```
98
  {
99
  "epochs": 4,
100
+ "evaluation_steps": 4538,
101
  "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
102
  "max_grad_norm": 1,
103
  "optimizer_class": "<class 'transformers.optimization.AdamW'>",
 
115
  ## Full Model Architecture
116
  ```
117
  SentenceTransformer(
118
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': True}) with Transformer model: BertModel
119
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
120
  )
121
  ```
config.json CHANGED
@@ -1,5 +1,5 @@
1
  {
2
- "_name_or_path": "klue/bert-base",
3
  "architectures": [
4
  "BertModel"
5
  ],
 
1
  {
2
+ "_name_or_path": "KDHyun08/TAACO_STS",
3
  "architectures": [
4
  "BertModel"
5
  ],
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:173ff08c5273eed03b1809a9bfb99dcfb4f42f8d2024cea8cb561bd60ff46656
3
  size 442543599
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:70d3bfc7b45f4c5d3d33353cfd2d8c6fa3c40943b630cf1edaa8b69c30e79834
3
  size 442543599
sentence_bert_config.json CHANGED
@@ -1,4 +1,4 @@
1
  {
2
- "max_seq_length": 512,
3
- "do_lower_case": false
4
  }
 
1
  {
2
+ "max_seq_length": 256,
3
+ "do_lower_case": true
4
  }
tokenizer.json CHANGED
@@ -2,7 +2,7 @@
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
- "max_length": 512,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
 
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
+ "max_length": 256,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
tokenizer_config.json CHANGED
@@ -1 +1 @@
1
- {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "do_basic_tokenize": true, "never_split": null, "model_max_length": 512, "special_tokens_map_file": "C:\\Users\\DESKTOP/.cache\\huggingface\\transformers\\aeaaa3afd086a040be912f92ffe7b5f85008b744624f4517c4216bcc32b51cf0.054ece8d16bd524c8a00f0e8a976c00d5de22a755ffb79e353ee2954d9289e26", "name_or_path": "klue/bert-base", "tokenizer_class": "BertTokenizer"}
 
1
+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "do_basic_tokenize": true, "never_split": null, "model_max_length": 512, "special_tokens_map_file": "C:\\Users\\DESKTOP/.cache\\huggingface\\transformers\\aeaaa3afd086a040be912f92ffe7b5f85008b744624f4517c4216bcc32b51cf0.054ece8d16bd524c8a00f0e8a976c00d5de22a755ffb79e353ee2954d9289e26", "name_or_path": "KDHyun08/TAACO_STS", "tokenizer_class": "BertTokenizer"}