--- language: - zh license: mit pipeline_tag: sentence-similarity --- # SimCSE(sup) ## Model List The evaluation dataset is in Chinese, and we used the same language model **RoBERTa large** on different methods. | Model | STS-B(w-avg) | ATEC | BQ | LCQMC | PAWSX | Avg. | |:-----------------------:|:------------:|:-----------:|:----------|:----------|:----------:|:----------:| | [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) | 78.61| -| -| -| -| -| | [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) | 79.07| -| -| -| -| -| | [hellonlp/simcse-large-zh](https://huggingface.co/hellonlp/simcse-roberta-large-zh) | 81.32| -| -| -| -| -| ## Data List The following data are all in Chinese. | Data | Link | size(train) | size(valid) | size(test) | |:-----------------------:|:------------:|:------------:|:------------:|:------------:| | STS-B | [STS-B](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/10yfKfTtcmLQ70-jzHIln1A%3Fpwd%3Dgf8y)| 5231| 1458| 1361| | ATEC | [ATEC](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1gmnyz9emqOXwaHhSM9CCUA%3Fpwd%3Db17c)| 62477| 20000| 20000| | BQ | [BQ](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1M-e01yyy5NacVPrph9fbaQ%3Fpwd%3Dtis9)| 100000| 10000| 10000| | LCQMC | [LCQMC](https://pan.baidu.com/s/16DfE7fHrCkk4e8a2j3SYUg?pwd=bc8w )| 238766| 8802| 12500| | PAWSX | [PAWSX](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1ox0tJY3ZNbevHDeAqDBOPQ%3Fpwd%3Dmgjn)| 49401| 2000| 2000| | SNLI | [SNLI](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1NOgA7JwWghiauwGAUvcm7w%3Fpwd%3Ds75v)| 146828| 2699| 2618| | MNLI | [MNLI](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1xjZKtWk3MAbJ6HX4pvXJ-A%3Fpwd%3D2kte)| 122547| 2932| 2397| ## Uses You can use our model for encoding sentences into embeddings ```python import torch from transformers import BertTokenizer from transformers import BertModel from sklearn.metrics.pairwise import cosine_similarity # model simcse_sup_path = "hellonlp/simcse-roberta-large-zh" tokenizer = BertTokenizer.from_pretrained(simcse_sup_path) MODEL = BertModel.from_pretrained(simcse_sup_path) def get_vector_simcse(sentence): """ 预测simcse的语义向量。 """ input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0) output = MODEL(input_ids) return output.last_hidden_state[:, 0].squeeze(0) embeddings = get_vector_simcse("武汉是一个美丽的城市。") print(embeddings.shape) #torch.Size([768]) ``` You can also compute the cosine similarities between two sentences ```python def get_similarity_two(sentence1, sentence2): vec1 = get_vector_simcse(sentence1).tolist() vec2 = get_vector_simcse(sentence2).tolist() similarity_list = cosine_similarity([vec1], [vec2]).tolist()[0][0] return similarity_list sentence1 = '你好吗' sentence2 = '你还好吗' result = get_similarity_two(sentence1,sentence2) print(result) #0.848331 ```