hellonlp
/

promcse-bert-large-zh

Sentence Similarity

Inference Endpoints

text-embeddings-inference

Model card Files Files and versions Community

promcse-bert-large-zh / README.md

hellonlp's picture

Update README.md

d5cba75 verified 2 months ago

|

raw history blame contribute delete

No virus

3.58 kB

	---
	license: mit
	language:
	- zh
	pipeline_tag: sentence-similarity
	---


	# PromCSE(sup)





	## Data List
	The following datasets are all in Chinese.
	\| Data \| size(train) \| size(valid) \| size(test) \|
	\|:----------------------:\|:----------:\|:----------:\|:----------:\|
	\| [ATEC](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1gmnyz9emqOXwaHhSM9CCUA%3Fpwd%3Db17c) \| 62477\| 20000\| 20000\|
	\| [BQ](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1M-e01yyy5NacVPrph9fbaQ%3Fpwd%3Dtis9) \| 100000\| 10000\| 10000\|
	\| [LCQMC](https://pan.baidu.com/s/16DfE7fHrCkk4e8a2j3SYUg?pwd=bc8w ) \| 238766\| 8802\| 12500\|
	\| [PAWSX](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1ox0tJY3ZNbevHDeAqDBOPQ%3Fpwd%3Dmgjn) \| 49401\| 2000\| 2000\|
	\| [STS-B](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/10yfKfTtcmLQ70-jzHIln1A%3Fpwd%3Dgf8y) \| 5231\| 1458\| 1361\|
	\| [SNLI](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1NOgA7JwWghiauwGAUvcm7w%3Fpwd%3Ds75v) \| 146828\| 2699\| 2618\|
	\| [MNLI](https://link.zhihu.com/?target=https%3A//pan.baidu.com/s/1xjZKtWk3MAbJ6HX4pvXJ-A%3Fpwd%3D2kte) \| 122547\| 2932\| 2397\|






	## Model List
	The evaluation dataset is in Chinese, and we used the same language model RoBERTa Large on different methods. In addition, considering that the test set of some datasets is small, which may lead to a large deviation in evaluation accuracy, the evaluation data here uses train, valid and test at the same time, and the final evaluation result adopts the weighted average (w-avg) method.

	\| Model \| STS-B(w-avg) \| ATEC \| BQ \| LCQMC \| PAWSX \| Avg. \|
	\|:-----------------------:\|:------------:\|:-----------:\|:----------\|:----------\|:----------:\|:----------:\|
	\| [BAAI/bge-large-zh](https://huggingface.co/BAAI/bge-large-zh) \| 78.61\| -\| -\| -\| -\| -\|
	\| [BAAI/bge-large-zh-v1.5](https://huggingface.co/BAAI/bge-large-zh-v1.5) \| 79.07\| -\| -\| -\| -\| -\|
	\| [hellonlp/simcse-large-zh](https://huggingface.co/hellonlp/simcse-roberta-large-zh) \| 81.32\| -\| -\| -\| -\| -\|
	\| [hellonlp/promcse-large-zh](https://huggingface.co/hellonlp/promcse-bert-large-zh) \| 81.63\| -\| -\| -\| -\| -\|





	## Uses
	To use the tool, first install the `promcse` package from [PyPI](https://pypi.org/project/promcse/)
	```bash
	pip install promcse
	```

	After installing the package, you can load our model by two lines of code
	```python
	from promcse import PromCSE
	model = PromCSE("hellonlp/promcse-bert-base-zh", "cls", 10)
	```

	Then you can use our model for encoding sentences into embeddings
	```python
	embeddings = model.encode("武汉是一个美丽的城市。")
	print(embeddings.shape)
	#torch.Size([1024])
	```

	Compute the cosine similarities between two groups of sentences
	```python
	sentences_a = ['你好吗']
	sentences_b = ['你怎么样','我吃了一个苹果','你过的好吗','你还好吗','你',
	'你好不好','你好不好呢','我不开心','我好开心啊', '你吃饭了吗',
	'你好吗','你现在好吗','你好个鬼']
	similarities = model.similarity(sentences_a, sentences_b)
	print(similarities)
	# [(1.0, '你好吗'),
	# (0.9324, '你好不好'),
	# (0.8945, '你好不好呢'),
	# (0.8845, '你还好吗'),
	# (0.8382, '你现在好吗'),
	# (0.8072, '你过的好吗'),
	# (0.7648, '你怎么样'),
	# (0.6736, '你'),
	# (0.5706, '你吃饭了吗'),
	# (0.5417, '你好个鬼'),
	# (0.3747, '我好开心啊'),
	# (0.0777, '我不开心'),
	# (0.0624, '我吃了一个苹果')]
	```