YuxinJiang
commited on
Commit
·
78131e0
1
Parent(s):
2181da0
Update README.md
Browse files
README.md
CHANGED
@@ -1,13 +1,37 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
---
|
4 |
# PromCSE: Improved Universal Sentence Embeddings with Prompt-based Contrastive Learning and Energy-based Learning
|
5 |
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
|
6 |
|
7 |
-
|
8 |
-
Published in [**EMNLP 2022**](https://2022.emnlp.org/)
|
9 |
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
We have released our supervised and unsupervised models on huggingface, which acquire **Top 1** results on 1 domain-shifted STS task and 4 standard STS tasks:
|
13 |
|
@@ -27,31 +51,73 @@ We have released our supervised and unsupervised models on huggingface, which ac
|
|
27 |
|
28 |
<!-- <img src="https://github.com/YJiangcm/DCPCSE/blob/master/figure/leaderboard.png" width="700" height="380"> -->
|
29 |
|
|
|
30 |
| Model | STS12 | STS13 | STS14 | STS15 | STS16 | STS-B | SICK-R | Avg. |
|
31 |
|:-----------------------:|:-----:|:----------:|:---------:|:-----:|:-----:|:-----:|:-----:|:-----:|
|
32 |
-
| unsup-
|
33 |
-
| sup-
|
34 |
-
| sup-
|
35 |
|
36 |
-
|
37 |
|
38 |
|
39 |
|
40 |
-
##
|
|
|
41 |
|
42 |
-
|
43 |
-
[![Pytorch](https://img.shields.io/badge/pytorch-1.7.1-red?logo=pytorch)](https://pytorch.org/get-started/previous-versions/)
|
44 |
|
45 |
-
|
|
|
|
|
46 |
|
|
|
47 |
```bash
|
48 |
-
pip install
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
```
|
50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
## Train PromCSE
|
52 |
|
53 |
In the following section, we describe how to train a PromCSE model by using our code.
|
54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
55 |
|
56 |
### Evaluation
|
57 |
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
|
@@ -180,84 +246,6 @@ All our experiments are conducted on Nvidia 3090 GPUs.
|
|
180 |
| Valid steps | 125 | 125 | 125 | 125 |
|
181 |
|
182 |
|
183 |
-
## Usage
|
184 |
-
We provide [tool.py](https://github.com/YJiangcm/PromCSE/blob/master/tool.py) which contains the following functions:
|
185 |
-
|
186 |
-
**(1) encode sentences into embedding vectors;
|
187 |
-
(2) compute cosine simiarities between sentences;
|
188 |
-
(3) given queries, retrieval top-k semantically similar sentences for each query.**
|
189 |
-
|
190 |
-
You can have a try by runing
|
191 |
-
```bash
|
192 |
-
python tool.py \
|
193 |
-
--model_name_or_path YuxinJiang/unsup-promcse-bert-base-uncased \
|
194 |
-
--pooler_type cls_before_pooler \
|
195 |
-
--pre_seq_len 16
|
196 |
-
```
|
197 |
-
|
198 |
-
which is expected to output the following results.
|
199 |
-
```
|
200 |
-
=========Calculate cosine similarities between queries and sentences============
|
201 |
-
|
202 |
-
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.18it/s]100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 42.26it/s][[0.5904227 0.70516586 0.65185255 0.82756 0.6969594 0.85966974
|
203 |
-
0.58715546 0.8467339 0.6583321 0.6792214 ]
|
204 |
-
[0.6125869 0.73508096 0.61479807 0.6182762 0.6161849 0.59476817
|
205 |
-
0.595963 0.61386335 0.694822 0.938746 ]]
|
206 |
-
|
207 |
-
=========Naive brute force search============
|
208 |
-
|
209 |
-
2022-10-09 11:59:06,004 : Encoding embeddings for sentences...
|
210 |
-
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 46.03it/s]2022-10-09 11:59:06,029 : Building index...
|
211 |
-
2022-10-09 11:59:06,029 : Finished
|
212 |
-
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 95.40it/s]100%|████████████████████████████████████████████████████████████████���██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 115.25it/s]Retrieval results for query: A man is playing music.
|
213 |
-
A man plays the piano. (cosine similarity: 0.8597)
|
214 |
-
A man plays a guitar. (cosine similarity: 0.8467)
|
215 |
-
A man plays the violin. (cosine similarity: 0.8276)
|
216 |
-
A woman is reading. (cosine similarity: 0.7051)
|
217 |
-
A man is eating food. (cosine similarity: 0.6969)
|
218 |
-
A woman is taking a picture. (cosine similarity: 0.6792)
|
219 |
-
A woman is slicing a meat. (cosine similarity: 0.6583)
|
220 |
-
A man is lifting weights in a garage. (cosine similarity: 0.6518)
|
221 |
-
|
222 |
-
Retrieval results for query: A woman is making a photo.
|
223 |
-
A woman is taking a picture. (cosine similarity: 0.9387)
|
224 |
-
A woman is reading. (cosine similarity: 0.7351)
|
225 |
-
A woman is slicing a meat. (cosine similarity: 0.6948)
|
226 |
-
A man plays the violin. (cosine similarity: 0.6183)
|
227 |
-
A man is eating food. (cosine similarity: 0.6162)
|
228 |
-
A man is lifting weights in a garage. (cosine similarity: 0.6148)
|
229 |
-
A man plays a guitar. (cosine similarity: 0.6139)
|
230 |
-
An animal is biting a persons finger. (cosine similarity: 0.6126)
|
231 |
-
|
232 |
-
|
233 |
-
=========Search with Faiss backend============
|
234 |
-
|
235 |
-
2022-10-09 11:59:06,055 : Loading faiss with AVX2 support.
|
236 |
-
2022-10-09 11:59:06,092 : Successfully loaded faiss with AVX2 support.
|
237 |
-
2022-10-09 11:59:06,093 : Encoding embeddings for sentences...
|
238 |
-
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4.17it/s]2022-10-09 11:59:06,335 : Building index...
|
239 |
-
2022-10-09 11:59:06,335 : Use GPU-version faiss
|
240 |
-
2022-10-09 11:59:06,447 : Finished
|
241 |
-
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 101.44it/s]Retrieval results for query: A man is playing music.
|
242 |
-
A man plays the piano. (cosine similarity: 0.8597)
|
243 |
-
A man plays a guitar. (cosine similarity: 0.8467)
|
244 |
-
A man plays the violin. (cosine similarity: 0.8276)
|
245 |
-
A woman is reading. (cosine similarity: 0.7052)
|
246 |
-
A man is eating food. (cosine similarity: 0.6970)
|
247 |
-
A woman is taking a picture. (cosine similarity: 0.6792)
|
248 |
-
A woman is slicing a meat. (cosine similarity: 0.6583)
|
249 |
-
A man is lifting weights in a garage. (cosine similarity: 0.6519)
|
250 |
-
|
251 |
-
Retrieval results for query: A woman is making a photo.
|
252 |
-
A woman is taking a picture. (cosine similarity: 0.9387)
|
253 |
-
A woman is reading. (cosine similarity: 0.7351)
|
254 |
-
A woman is slicing a meat. (cosine similarity: 0.6948)
|
255 |
-
A man plays the violin. (cosine similarity: 0.6183)
|
256 |
-
A man is eating food. (cosine similarity: 0.6162)
|
257 |
-
A man is lifting weights in a garage. (cosine similarity: 0.6148)
|
258 |
-
A man plays a guitar. (cosine similarity: 0.6139)
|
259 |
-
An animal is biting a persons finger. (cosine similarity: 0.6126)
|
260 |
-
```
|
261 |
|
262 |
|
263 |
## Citation
|
|
|
|
|
|
|
|
|
1 |
# PromCSE: Improved Universal Sentence Embeddings with Prompt-based Contrastive Learning and Energy-based Learning
|
2 |
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
|
3 |
|
4 |
+
Our code is modified based on [SimCSE](https://github.com/princeton-nlp/SimCSE) and [P-tuning v2](https://github.com/THUDM/P-tuning-v2/). Here we would like to sincerely thank them for their excellent works.
|
|
|
5 |
|
6 |
+
**************************** **Updates** ****************************
|
7 |
+
* 2023/4/5: We released our sentence embedding [python package](#getting-started).
|
8 |
+
* 2022/3/3: We released a simple [colab notebook](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing) for a quick start!
|
9 |
+
* 2022/1/8: We released our model checkpoints on [huggingface](https://huggingface.co/YuxinJiang).
|
10 |
+
* 2022/10/9: We released the second verson of [our paper](https://arxiv.org/pdf/2203.06875v2.pdf). Check it out!
|
11 |
+
* 2022/10/6: Our paper has been accepted to [**EMNLP 2022**](https://2022.emnlp.org/).
|
12 |
+
* 2022/3/14: We released the first verson of [our paper](https://arxiv.org/pdf/2203.06875v1.pdf). Check it out!
|
13 |
+
|
14 |
+
|
15 |
+
|
16 |
+
|
17 |
+
## Quick Links
|
18 |
+
- [Overview](#overview)
|
19 |
+
- [Model List](#model-list)
|
20 |
+
- [Usage](#usage)
|
21 |
+
- [Train PromCSE](#train-promcse)
|
22 |
+
- [Setups](#setups)
|
23 |
+
- [Evaluation](#evaluation)
|
24 |
+
- [Training](#training)
|
25 |
+
- [Citation](#citation)
|
26 |
+
|
27 |
+
|
28 |
+
|
29 |
+
## Overview
|
30 |
+
<img src="https://github.com/YJiangcm/PromCSE/blob/master/figure/overview.jpg" width="700" height="320">
|
31 |
+
|
32 |
+
|
33 |
+
|
34 |
+
## Model List
|
35 |
|
36 |
We have released our supervised and unsupervised models on huggingface, which acquire **Top 1** results on 1 domain-shifted STS task and 4 standard STS tasks:
|
37 |
|
|
|
51 |
|
52 |
<!-- <img src="https://github.com/YJiangcm/DCPCSE/blob/master/figure/leaderboard.png" width="700" height="380"> -->
|
53 |
|
54 |
+
|
55 |
| Model | STS12 | STS13 | STS14 | STS15 | STS16 | STS-B | SICK-R | Avg. |
|
56 |
|:-----------------------:|:-----:|:----------:|:---------:|:-----:|:-----:|:-----:|:-----:|:-----:|
|
57 |
+
| [YuxinJiang/unsup-promcse-bert-base-uncased](https://huggingface.co/YuxinJiang/unsup-promcse-bert-base-uncased) | 73.03 |85.18| 76.70| 84.19 |79.69| 80.62| 70.00| 78.49|
|
58 |
+
| [YuxinJiang/sup-promcse-roberta-base](https://huggingface.co/YuxinJiang/sup-promcse-roberta-base) | 76.75 |85.86| 80.98| 86.51 |83.51| 86.58| 80.41| 82.94|
|
59 |
+
| [YuxinJiang/sup-promcse-roberta-large](https://huggingface.co/YuxinJiang/sup-promcse-roberta-large) | 79.14 |88.64| 83.73| 87.33 |84.57| 87.84| 82.07| 84.76|
|
60 |
|
61 |
+
**Naming rules**: `unsup` and `sup` represent "unsupervised" (trained on Wikipedia corpus) and "supervised" (trained on NLI datasets) respectively.
|
62 |
|
63 |
|
64 |
|
65 |
+
## Usage
|
66 |
+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
|
67 |
|
68 |
+
We provide an easy-to-use python package `promcse` which contains the following functions:
|
|
|
69 |
|
70 |
+
**(1) encode sentences into embedding vectors;
|
71 |
+
(2) compute cosine simiarities between sentences;
|
72 |
+
(3) given queries, retrieval top-k semantically similar sentences for each query.**
|
73 |
|
74 |
+
To use the tool, first install the `promcse` package from [PyPI](https://pypi.org/project/promcse/)
|
75 |
```bash
|
76 |
+
pip install promcse
|
77 |
+
```
|
78 |
+
After installing the package, you can load our model by two lines of code
|
79 |
+
```python
|
80 |
+
from promcse import PromCSE
|
81 |
+
model = PromCSE("YuxinJiang/unsup-promcse-bert-base-uncased", "cls_before_pooler", 16)
|
82 |
+
# model = PromCSE("YuxinJiang/sup-promcse-roberta-base")
|
83 |
+
# model = PromCSE("YuxinJiang/sup-promcse-roberta-large")
|
84 |
+
```
|
85 |
+
|
86 |
+
Then you can use our model for **encoding sentences into embeddings**
|
87 |
+
```python
|
88 |
+
embeddings = model.encode("A woman is reading.")
|
89 |
+
```
|
90 |
+
|
91 |
+
**Compute the cosine similarities** between two groups of sentences
|
92 |
+
```python
|
93 |
+
sentences_a = ['A woman is reading.', 'A man is playing a guitar.']
|
94 |
+
sentences_b = ['He plays guitar.', 'A woman is making a photo.']
|
95 |
+
similarities = model.similarity(sentences_a, sentences_b)
|
96 |
```
|
97 |
|
98 |
+
Or build index for a group of sentences and **search** among them
|
99 |
+
```python
|
100 |
+
sentences = ['A woman is reading.', 'A man is playing a guitar.']
|
101 |
+
model.build_index(sentences)
|
102 |
+
results = model.search("He plays guitar.")
|
103 |
+
```
|
104 |
+
|
105 |
+
|
106 |
+
|
107 |
## Train PromCSE
|
108 |
|
109 |
In the following section, we describe how to train a PromCSE model by using our code.
|
110 |
|
111 |
+
### Setups
|
112 |
+
|
113 |
+
[![Python](https://img.shields.io/badge/python-3.8.2-blue?logo=python&logoColor=FED643)](https://www.python.org/downloads/release/python-382/)
|
114 |
+
[![Pytorch](https://img.shields.io/badge/pytorch-1.7.1-red?logo=pytorch)](https://pytorch.org/get-started/previous-versions/)
|
115 |
+
|
116 |
+
Run the following script to install the remaining dependencies,
|
117 |
+
|
118 |
+
```bash
|
119 |
+
pip install -r requirements.txt
|
120 |
+
```
|
121 |
|
122 |
### Evaluation
|
123 |
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1lanXViJzbmGM1bwm8AflNUKmrvDidg_3?usp=sharing)
|
|
|
246 |
| Valid steps | 125 | 125 | 125 | 125 |
|
247 |
|
248 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
249 |
|
250 |
|
251 |
## Citation
|