simcse-roberta-base-zh / README.md

hellonlp

Update README.md

76683a2 verified 3 months ago

preview code

raw

history blame contribute delete

No virus

3.7 kB

	---
	language:
	- zh
	license: mit
	pipeline_tag: sentence-similarity
	---

	# SimCSE(sup)


	## Data List
	The following datasets are all in Chinese.
	\| Data \| size(train) \| size(valid) \| size(test) \|
	\|:----------------------:\|:----------:\|:----------:\|:----------:\|
	\| [ATEC](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1gmnyz9emqOXwaHhSM9CCUA%3Fpwd%3Db17c) \| 62477\| 20000\| 20000\|
	\| [BQ](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1M-e01yyy5NacVPrph9fbaQ%3Fpwd%3Dtis9) \| 100000\| 10000\| 10000\|
	\| [LCQMC](https://pan.baidu.com/s/16DfE7fHrCkk4e8a2j3SYUg?pwd=bc8w ) \| 238766\| 8802\| 12500\|
	\| [PAWSX](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1ox0tJY3ZNbevHDeAqDBOPQ%3Fpwd%3Dmgjn) \| 49401\| 2000\| 2000\|
	\| [STS-B](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/10yfKfTtcmLQ70-jzHIln1A%3Fpwd%3Dgf8y) \| 5231\| 1458\| 1361\|
	\| [SNLI](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1NOgA7JwWghiauwGAUvcm7w%3Fpwd%3Ds75v) \| 146828\| 2699\| 2618\|
	\| [MNLI](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1xjZKtWk3MAbJ6HX4pvXJ-A%3Fpwd%3D2kte) \| 122547\| 2932\| 2397\|




	## Model List
	The evaluation dataset is in Chinese, and we used the same language model RoBERTa base on different methods. In addition, considering that the test set of some datasets is small, which may lead to a large deviation in evaluation accuracy, the evaluation data here uses train, valid and test at the same time, and the final evaluation result adopts the weighted average (w-avg) method.
	\| Model \| STS-B(w-avg) \| ATEC \| BQ \| LCQMC \| PAWSX \| Avg. \|
	\|:-----------------------:\|:------------:\|:-----------:\|:----------\|:-------------\|:------------:\|:----------:\|
	\| BERT-Whitening \| 65.27\| -\| -\| -\| -\| -\|
	\| SimBERT \| 70.01\| -\| -\| -\| -\| -\|
	\| SBERT-Whitening \| 71.75\| -\| -\| -\| -\| -\|
	\| [BAAI/bge-base-zh](https://huggingface.co/BAAI/bge-base-zh) \| 78.61\| -\| -\| -\| -\| -\|
	\| [hellonlp/simcse-base-zh(sup)](https://huggingface.co/hellonlp/simcse-roberta-base-zh) \| 80.96\| -\| -\| -\| -\| -\|





	## Uses
	You can use our model for encoding sentences into embeddings
	```python
	import torch
	from transformers import BertTokenizer
	from transformers import BertModel
	from sklearn.metrics.pairwise import cosine_similarity

	# model
	simcse_sup_path = "hellonlp/simcse-roberta-base-zh"
	tokenizer = BertTokenizer.from_pretrained(simcse_sup_path)
	MODEL = BertModel.from_pretrained(simcse_sup_path)

	def get_vector_simcse(sentence):
	"""
	预测simcse的语义向量。
	"""
	input_ids = torch.tensor(tokenizer.encode(sentence)).unsqueeze(0)
	output = MODEL(input_ids)
	return output.last_hidden_state[:, 0].squeeze(0)

	embeddings = get_vector_simcse("武汉是一个美丽的城市。")
	print(embeddings.shape)
	#torch.Size([768])
	```

	You can also compute the cosine similarities between two sentences
	```python
	def get_similarity_two(sentence1, sentence2):
	vec1 = get_vector_simcse(sentence1).tolist()
	vec2 = get_vector_simcse(sentence2).tolist()
	similarity_list = cosine_similarity([vec1], [vec2]).tolist()[0][0]
	return similarity_list

	sentence1 = '你好吗'
	sentence2 = '你还好吗'
	result = get_similarity_two(sentence1,sentence2)
	print(result) #0.7996
	#(1.0, '你好吗')
	#(0.8247, '你好不好')
	#(0.8217, '你现在好吗')
	#(0.7976, '你还好吗')
	#(0.7918, '你好不好呢')
	#(0.712, '你过的好吗')
	#(0.6986, '你怎么样')
	#(0.6693, '你')
	#(0.5442, '你好个鬼')
	#(0.4516, '你吃饭了吗')
	#(0.4, '我好开心啊')
	#(0.29, '我不开心')
	#(0.2782, '我吃了一个苹果')
	```