hellonlp's picture
Update README.md
d818702 verified
|
raw
history blame
No virus
1.93 kB
metadata
language:
  - zh
license: mit
pipeline_tag: sentence-similarity

SimCSE(sup)

Model List

The evaluation dataset is in Chinese, and we used the same language model RoBERTa base on different methods.

Model STS-B(w-avg) ATEC BQ LCQMC PAWSX Avg.
BERT-Whitening 65.27 - - - - -
SimBERT 70.01 - - - - -
SBERT-Whitening 71.75 - - - - -
BAAI/bge-base-zh 78.61 - - - - -
hellonlp/simcse-base-zh(sup) 80.96 - - - - -

Uses

You can use our model for encoding sentences into embeddings

import torch
from transformers import BertTokenizer
from transformers import BertModel
from sklearn.metrics.pairwise import cosine_similarity

# model
simcse_sup_path = "hellonlp/simcse-roberta-base-zh"
tokenizer = BertTokenizer.from_pretrained(simcse_sup_path)
MODEL = BertModel.from_pretrained(simcse_sup_path)

def get_vector_simcse(sentence):
    """
    预测simcse的语义向量。
    """
    input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)
    output = MODEL(input_ids)
    return output.last_hidden_state[:, 0].squeeze(0)

embeddings = get_vector_simcse("武汉是一个美丽的城市。")
print(embeddings.shape)
#torch.Size([768])

You can also compute the cosine similarities between two sentences

def get_similarity_two(sentence1, sentence2):
    vec1 = get_vector_simcse(sentence1).tolist()
    vec2 = get_vector_simcse(sentence2).tolist()
    similarity_list = cosine_similarity([vec1], [vec2]).tolist()[0][0]
    return similarity_list

sentence1 = '你好吗'
sentence2 = '你还好吗'
result = get_similarity_two(sentence1,sentence2)
print(result)
#0.7996