TAACO_STS / README.md
KDHyun08's picture
Upload with huggingface_hub
24fab3b
|
raw
history blame
5.64 kB
metadata
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - transformers
lan: ko

TAACO_Similarity

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME}')
embeddings = model.encode(sentences)
print(embeddings)

Usage (μ‹€μ œ λ¬Έμž₯ κ°„ μœ μ‚¬λ„ 비ꡐ)

Sentence-transformers sentence-transformers λ₯Ό μ„€μΉ˜ν•œ ν›„ μ•„λž˜ λ‚΄μš©κ³Ό 같이 λ¬Έμž₯ κ°„ μœ μ‚¬λ„λ₯Ό 비ꡐ할 수 μžˆμŠ΅λ‹ˆλ‹€. query λ³€μˆ˜λŠ” 비ꡐ 기쀀이 λ˜λŠ” λ¬Έμž₯(Source Sentence)이고 비ꡐλ₯Ό 진행할 λ¬Έμž₯은 docs에 list ν˜•μ‹μœΌλ‘œ κ΅¬μ„±ν•˜μ‹œλ©΄ λ©λ‹ˆλ‹€.

model = SentenceTransformer("KDHyun08/TAACO_STS")

docs = ['μ–΄μ œλŠ” μ•„λ‚΄μ˜ μƒμΌμ΄μ—ˆλ‹€', '생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€. 주된 λ©”λ‰΄λŠ” μŠ€ν…Œμ΄ν¬μ™€ λ‚™μ§€λ³ΆμŒ, λ―Έμ—­κ΅­, μž‘μ±„, μ†Œμ•Ό λ“±μ΄μ—ˆλ‹€', 'μŠ€ν…Œμ΄ν¬λŠ” 자주 ν•˜λŠ” μŒμ‹μ΄μ–΄μ„œ μžμ‹ μ΄ μ€€λΉ„ν•˜λ €κ³  ν–ˆλ‹€', 'μ•žλ’€λ„ 1λΆ„μ”© 3번 뒀집고 λž˜μŠ€νŒ…μ„ 잘 ν•˜λ©΄ μœ‘μ¦™μ΄ κ°€λ“ν•œ μŠ€ν…Œμ΄ν¬κ°€ μ€€λΉ„λ˜λ‹€', '아내도 그런 μŠ€ν…Œμ΄ν¬λ₯Ό μ’‹μ•„ν•œλ‹€. 그런데 상상도 λͺ»ν•œ 일이 λ²Œμ΄μ§€κ³  λ§μ•˜λ‹€', '보톡 μ‹œμ¦ˆλ‹μ΄ λ˜μ§€ μ•Šμ€ μ›μœ‘μ„ μ‚¬μ„œ μŠ€ν…Œμ΄ν¬λ₯Ό ν–ˆλŠ”λ°, μ΄λ²ˆμ—λŠ” μ‹œμ¦ˆλ‹μ΄ 된 뢀챗살을 κ΅¬μž…ν•΄μ„œ ν–ˆλ‹€', '그런데 μΌ€μ΄μŠ€ μ•ˆμ— λ°©λΆ€μ œκ°€ λ“€μ–΄μžˆλŠ” 것을 μΈμ§€ν•˜μ§€ λͺ»ν•˜κ³  λ°©λΆ€μ œμ™€ λ™μ‹œμ— ν”„λΌμ΄νŒ¬μ— μ˜¬λ €λ†“μ„ 것이닀', '그것도 인지 λͺ»ν•œ 체... μ•žλ©΄μ„ μ„Ό λΆˆμ— 1뢄을 κ΅½κ³  λ’€μ§‘λŠ” μˆœκ°„ λ°©λΆ€μ œκ°€ ν•¨κ»˜ ꡬ어진 것을 μ•Œμ•˜λ‹€', 'μ•„λ‚΄μ˜ 생일이라 λ§›μžˆκ²Œ κ΅¬μ›Œλ³΄κ³  μ‹Άμ—ˆλŠ”λ° μ–΄μ²˜κ΅¬λ‹ˆμ—†λŠ” 상황이 λ°œμƒν•œ 것이닀', 'λ°©λΆ€μ œκ°€ μ„Ό λΆˆμ— λ…Ήμ•„μ„œ κ·ΈλŸ°μ§€ 물처럼 ν˜λŸ¬λ‚΄λ Έλ‹€', ' 고민을 ν–ˆλ‹€. λ°©λΆ€μ œκ°€ 묻은 λΆ€λ¬Έλ§Œ μ œκ±°ν•˜κ³  λ‹€μ‹œ ꡬ울까 ν–ˆλŠ”λ° λ°©λΆ€μ œμ— μ ˆλŒ€ 먹지 λ§λΌλŠ” 문ꡬ가 μžˆμ–΄μ„œ μ•„κΉμ§€λ§Œ λ²„λ¦¬λŠ” λ°©ν–₯을 ν–ˆλ‹€', 'λ„ˆλ¬΄λ‚˜ μ•ˆνƒ€κΉŒμ› λ‹€', 'μ•„μΉ¨ 일찍 μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ” μŠ€ν…Œμ΄ν¬λ₯Ό μ€€λΉ„ν•˜κ³  그것을 λ§›μžˆκ²Œ λ¨ΉλŠ” μ•„λ‚΄μ˜ λͺ¨μŠ΅μ„ 보고 μ‹Άμ—ˆλŠ”λ° μ „ν˜€ 생각지도 λͺ»ν•œ 상황이 λ°œμƒν•΄μ„œ... ν•˜μ§€λ§Œ 정신을 μΆ”μŠ€λ₯΄κ³  λ°”λ‘œ λ‹€λ₯Έ λ©”λ‰΄λ‘œ λ³€κ²½ν–ˆλ‹€', 'μ†Œμ•Ό, μ†Œμ‹œμ§€ μ•Όμ±„λ³ΆμŒ..', 'μ•„λ‚΄κ°€ μ’‹μ•„ν•˜λŠ”μ§€ λͺ¨λ₯΄κ² μ§€λ§Œ 냉μž₯κ³  μ•ˆμ— μžˆλŠ” ν›„λž‘ν¬μ†Œμ„Έμ§€λ₯Ό λ³΄λ‹ˆ λ°”λ‘œ μ†Œμ•Όλ₯Ό ν•΄μ•Όκ² λ‹€λŠ” 생각이 λ“€μ—ˆλ‹€. μŒμ‹μ€ μ„±κ³΅μ μœΌλ‘œ 완성이 λ˜μ—ˆλ‹€', '40번째λ₯Ό λ§žμ΄ν•˜λŠ” μ•„λ‚΄μ˜ 생일은 μ„±κ³΅μ μœΌλ‘œ μ€€λΉ„κ°€ λ˜μ—ˆλ‹€', 'λ§›μžˆκ²Œ λ¨Ήμ–΄ μ€€ μ•„λ‚΄μ—κ²Œλ„ κ°μ‚¬ν–ˆλ‹€', '맀년 μ•„λ‚΄μ˜ 생일에 λ§žμ΄ν•˜λ©΄ μ•„μΉ¨λ§ˆλ‹€ 생일을 μ°¨λ €μ•Όκ² λ‹€. μ˜€λŠ˜λ„ 즐거운 ν•˜λ£¨κ°€ λ˜μ—ˆμœΌλ©΄ μ’‹κ² λ‹€', 'μƒμΌμ΄λ‹ˆκΉŒ~']
#각 λ¬Έμž₯의 vectorκ°’ encoding
document_embeddings = model.encode(docs)

query = '생일을 λ§žμ΄ν•˜μ—¬ 아침을 μ€€λΉ„ν•˜κ² λ‹€κ³  μ˜€μ „ 8μ‹œ 30λΆ„λΆ€ν„° μŒμ‹μ„ μ€€λΉ„ν•˜μ˜€λ‹€'
query_embedding = model.encode(query)

top_k = min(10, len(docs))

# 코사인 μœ μ‚¬λ„ 계산 ν›„,
cos_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)[0]

# 코사인 μœ μ‚¬λ„ 순으둜 λ¬Έμž₯ μΆ”μΆœ
top_results = torch.topk(cos_scores, k=top_k)

print(f"μž…λ ₯ λ¬Έμž₯: {query}")
print(f"\n<μž…λ ₯ λ¬Έμž₯κ³Ό μœ μ‚¬ν•œ {top_k} 개의 λ¬Έμž₯>\n")

for i, (score, idx) in enumerate(zip(top_results[0], top_results[1])):
    print(f"{i+1}: {docs[idx]} {'(μœ μ‚¬λ„: {:.4f})'.format(score)}\n")

Evaluation Results

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

The model was trained with the parameters:

DataLoader:

torch.utils.data.dataloader.DataLoader of length 142 with parameters:

{'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss:

sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

Parameters of the fit()-Method:

{
    "epochs": 4,
    "evaluation_steps": 1000,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'transformers.optimization.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 10000,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors