|
--- |
|
language: |
|
- zh |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- PEG |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
- mteb |
|
model-index: |
|
- name: PEG |
|
results: |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/CMedQAv1-reranking |
|
name: MTEB CMedQAv1 |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 84.09137463267582 |
|
- type: mrr |
|
value: 86.6288888888889 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/CMedQAv2-reranking |
|
name: MTEB CMedQAv2 |
|
config: default |
|
split: test |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 86.55765031914974 |
|
- type: mrr |
|
value: 89.4325396825397 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/CmedqaRetrieval |
|
name: MTEB CmedqaRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 23.21 |
|
- type: map_at_10 |
|
value: 33.455 |
|
- type: map_at_100 |
|
value: 35.176 |
|
- type: map_at_1000 |
|
value: 35.332 |
|
- type: map_at_3 |
|
value: 30.058 |
|
- type: map_at_5 |
|
value: 31.863999999999997 |
|
- type: mrr_at_1 |
|
value: 35.634 |
|
- type: mrr_at_10 |
|
value: 42.277 |
|
- type: mrr_at_100 |
|
value: 43.281 |
|
- type: mrr_at_1000 |
|
value: 43.349 |
|
- type: mrr_at_3 |
|
value: 40.127 |
|
- type: mrr_at_5 |
|
value: 41.223 |
|
- type: ndcg_at_1 |
|
value: 35.634 |
|
- type: ndcg_at_10 |
|
value: 39.226 |
|
- type: ndcg_at_100 |
|
value: 46.538000000000004 |
|
- type: ndcg_at_1000 |
|
value: 49.342999999999996 |
|
- type: ndcg_at_3 |
|
value: 35.004999999999995 |
|
- type: ndcg_at_5 |
|
value: 36.492000000000004 |
|
- type: precision_at_1 |
|
value: 35.634 |
|
- type: precision_at_10 |
|
value: 8.591999999999999 |
|
- type: precision_at_100 |
|
value: 1.467 |
|
- type: precision_at_1000 |
|
value: 0.181 |
|
- type: precision_at_3 |
|
value: 19.613 |
|
- type: precision_at_5 |
|
value: 13.988 |
|
- type: recall_at_1 |
|
value: 23.21 |
|
- type: recall_at_10 |
|
value: 47.728 |
|
- type: recall_at_100 |
|
value: 78.669 |
|
- type: recall_at_1000 |
|
value: 97.506 |
|
- type: recall_at_3 |
|
value: 34.361000000000004 |
|
- type: recall_at_5 |
|
value: 39.486 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/CovidRetrieval |
|
name: MTEB CovidRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 70.759 |
|
- type: map_at_10 |
|
value: 79.314 |
|
- type: map_at_100 |
|
value: 79.557 |
|
- type: map_at_1000 |
|
value: 79.561 |
|
- type: map_at_3 |
|
value: 77.819 |
|
- type: map_at_5 |
|
value: 78.717 |
|
- type: mrr_at_1 |
|
value: 71.02199999999999 |
|
- type: mrr_at_10 |
|
value: 79.361 |
|
- type: mrr_at_100 |
|
value: 79.605 |
|
- type: mrr_at_1000 |
|
value: 79.609 |
|
- type: mrr_at_3 |
|
value: 77.959 |
|
- type: mrr_at_5 |
|
value: 78.82900000000001 |
|
- type: ndcg_at_1 |
|
value: 70.917 |
|
- type: ndcg_at_10 |
|
value: 82.931 |
|
- type: ndcg_at_100 |
|
value: 84.016 |
|
- type: ndcg_at_1000 |
|
value: 84.155 |
|
- type: ndcg_at_3 |
|
value: 79.994 |
|
- type: ndcg_at_5 |
|
value: 81.57799999999999 |
|
- type: precision_at_1 |
|
value: 70.917 |
|
- type: precision_at_10 |
|
value: 9.504999999999999 |
|
- type: precision_at_100 |
|
value: 0.9990000000000001 |
|
- type: precision_at_1000 |
|
value: 0.101 |
|
- type: precision_at_3 |
|
value: 28.908 |
|
- type: precision_at_5 |
|
value: 18.145 |
|
- type: recall_at_1 |
|
value: 70.759 |
|
- type: recall_at_10 |
|
value: 93.994 |
|
- type: recall_at_100 |
|
value: 98.84100000000001 |
|
- type: recall_at_1000 |
|
value: 100.0 |
|
- type: recall_at_3 |
|
value: 86.17 |
|
- type: recall_at_5 |
|
value: 89.96300000000001 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/DuRetrieval |
|
name: MTEB DuRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 26.344 |
|
- type: map_at_10 |
|
value: 81.26 |
|
- type: map_at_100 |
|
value: 84.011 |
|
- type: map_at_1000 |
|
value: 84.05 |
|
- type: map_at_3 |
|
value: 56.384 |
|
- type: map_at_5 |
|
value: 71.141 |
|
- type: mrr_at_1 |
|
value: 90.60000000000001 |
|
- type: mrr_at_10 |
|
value: 93.77 |
|
- type: mrr_at_100 |
|
value: 93.816 |
|
- type: mrr_at_1000 |
|
value: 93.82000000000001 |
|
- type: mrr_at_3 |
|
value: 93.542 |
|
- type: mrr_at_5 |
|
value: 93.687 |
|
- type: ndcg_at_1 |
|
value: 90.60000000000001 |
|
- type: ndcg_at_10 |
|
value: 88.352 |
|
- type: ndcg_at_100 |
|
value: 90.767 |
|
- type: ndcg_at_1000 |
|
value: 91.118 |
|
- type: ndcg_at_3 |
|
value: 87.21900000000001 |
|
- type: ndcg_at_5 |
|
value: 86.163 |
|
- type: precision_at_1 |
|
value: 90.60000000000001 |
|
- type: precision_at_10 |
|
value: 42.3 |
|
- type: precision_at_100 |
|
value: 4.82 |
|
- type: precision_at_1000 |
|
value: 0.48900000000000005 |
|
- type: precision_at_3 |
|
value: 78.25 |
|
- type: precision_at_5 |
|
value: 66.03999999999999 |
|
- type: recall_at_1 |
|
value: 26.344 |
|
- type: recall_at_10 |
|
value: 89.59 |
|
- type: recall_at_100 |
|
value: 97.584 |
|
- type: recall_at_1000 |
|
value: 99.46300000000001 |
|
- type: recall_at_3 |
|
value: 58.87 |
|
- type: recall_at_5 |
|
value: 75.81099999999999 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/EcomRetrieval |
|
name: MTEB EcomRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 54.1 |
|
- type: map_at_10 |
|
value: 63.478 |
|
- type: map_at_100 |
|
value: 64.048 |
|
- type: map_at_1000 |
|
value: 64.061 |
|
- type: map_at_3 |
|
value: 61.0 |
|
- type: map_at_5 |
|
value: 62.585 |
|
- type: mrr_at_1 |
|
value: 54.1 |
|
- type: mrr_at_10 |
|
value: 63.478 |
|
- type: mrr_at_100 |
|
value: 64.048 |
|
- type: mrr_at_1000 |
|
value: 64.061 |
|
- type: mrr_at_3 |
|
value: 61.0 |
|
- type: mrr_at_5 |
|
value: 62.585 |
|
- type: ndcg_at_1 |
|
value: 54.1 |
|
- type: ndcg_at_10 |
|
value: 68.14099999999999 |
|
- type: ndcg_at_100 |
|
value: 70.703 |
|
- type: ndcg_at_1000 |
|
value: 71.045 |
|
- type: ndcg_at_3 |
|
value: 63.164 |
|
- type: ndcg_at_5 |
|
value: 66.013 |
|
- type: precision_at_1 |
|
value: 54.1 |
|
- type: precision_at_10 |
|
value: 8.28 |
|
- type: precision_at_100 |
|
value: 0.943 |
|
- type: precision_at_1000 |
|
value: 0.097 |
|
- type: precision_at_3 |
|
value: 23.133 |
|
- type: precision_at_5 |
|
value: 15.260000000000002 |
|
- type: recall_at_1 |
|
value: 54.1 |
|
- type: recall_at_10 |
|
value: 82.8 |
|
- type: recall_at_100 |
|
value: 94.3 |
|
- type: recall_at_1000 |
|
value: 97.0 |
|
- type: recall_at_3 |
|
value: 69.39999999999999 |
|
- type: recall_at_5 |
|
value: 76.3 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/MMarcoRetrieval |
|
name: MTEB MMarcoRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 71.329 |
|
- type: map_at_10 |
|
value: 80.038 |
|
- type: map_at_100 |
|
value: 80.28 |
|
- type: map_at_1000 |
|
value: 80.286 |
|
- type: map_at_3 |
|
value: 78.435 |
|
- type: map_at_5 |
|
value: 79.509 |
|
- type: mrr_at_1 |
|
value: 73.711 |
|
- type: mrr_at_10 |
|
value: 80.53 |
|
- type: mrr_at_100 |
|
value: 80.743 |
|
- type: mrr_at_1000 |
|
value: 80.75 |
|
- type: mrr_at_3 |
|
value: 79.181 |
|
- type: mrr_at_5 |
|
value: 80.077 |
|
- type: ndcg_at_1 |
|
value: 73.711 |
|
- type: ndcg_at_10 |
|
value: 83.333 |
|
- type: ndcg_at_100 |
|
value: 84.375 |
|
- type: ndcg_at_1000 |
|
value: 84.56700000000001 |
|
- type: ndcg_at_3 |
|
value: 80.374 |
|
- type: ndcg_at_5 |
|
value: 82.145 |
|
- type: precision_at_1 |
|
value: 73.711 |
|
- type: precision_at_10 |
|
value: 9.894 |
|
- type: precision_at_100 |
|
value: 1.04 |
|
- type: precision_at_1000 |
|
value: 0.106 |
|
- type: precision_at_3 |
|
value: 30.067 |
|
- type: precision_at_5 |
|
value: 18.989 |
|
- type: recall_at_1 |
|
value: 71.329 |
|
- type: recall_at_10 |
|
value: 92.917 |
|
- type: recall_at_100 |
|
value: 97.543 |
|
- type: recall_at_1000 |
|
value: 99.08500000000001 |
|
- type: recall_at_3 |
|
value: 85.187 |
|
- type: recall_at_5 |
|
value: 89.36500000000001 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/MedicalRetrieval |
|
name: MTEB MedicalRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 54.1 |
|
- type: map_at_10 |
|
value: 59.227 |
|
- type: map_at_100 |
|
value: 59.811 |
|
- type: map_at_1000 |
|
value: 59.86 |
|
- type: map_at_3 |
|
value: 58.083 |
|
- type: map_at_5 |
|
value: 58.743 |
|
- type: mrr_at_1 |
|
value: 54.2 |
|
- type: mrr_at_10 |
|
value: 59.269000000000005 |
|
- type: mrr_at_100 |
|
value: 59.853 |
|
- type: mrr_at_1000 |
|
value: 59.902 |
|
- type: mrr_at_3 |
|
value: 58.099999999999994 |
|
- type: mrr_at_5 |
|
value: 58.785 |
|
- type: ndcg_at_1 |
|
value: 54.1 |
|
- type: ndcg_at_10 |
|
value: 61.732 |
|
- type: ndcg_at_100 |
|
value: 64.832 |
|
- type: ndcg_at_1000 |
|
value: 66.277 |
|
- type: ndcg_at_3 |
|
value: 59.372 |
|
- type: ndcg_at_5 |
|
value: 60.56399999999999 |
|
- type: precision_at_1 |
|
value: 54.1 |
|
- type: precision_at_10 |
|
value: 6.959999999999999 |
|
- type: precision_at_100 |
|
value: 0.847 |
|
- type: precision_at_1000 |
|
value: 0.096 |
|
- type: precision_at_3 |
|
value: 21.032999999999998 |
|
- type: precision_at_5 |
|
value: 13.200000000000001 |
|
- type: recall_at_1 |
|
value: 54.1 |
|
- type: recall_at_10 |
|
value: 69.6 |
|
- type: recall_at_100 |
|
value: 84.7 |
|
- type: recall_at_1000 |
|
value: 96.39999999999999 |
|
- type: recall_at_3 |
|
value: 63.1 |
|
- type: recall_at_5 |
|
value: 66.0 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/Mmarco-reranking |
|
name: MTEB MMarcoReranking |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 33.548800108363665 |
|
- type: mrr |
|
value: 32.529761904761905 |
|
- task: |
|
type: Reranking |
|
dataset: |
|
type: C-MTEB/T2Reranking |
|
name: MTEB T2Reranking |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map |
|
value: 69.43381583724414 |
|
- type: mrr |
|
value: 80.47879657392181 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/T2Retrieval |
|
name: MTEB T2Retrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 27.72 |
|
- type: map_at_10 |
|
value: 79.07300000000001 |
|
- type: map_at_100 |
|
value: 82.636 |
|
- type: map_at_1000 |
|
value: 82.68799999999999 |
|
- type: map_at_3 |
|
value: 55.276 |
|
- type: map_at_5 |
|
value: 68.13900000000001 |
|
- type: mrr_at_1 |
|
value: 90.834 |
|
- type: mrr_at_10 |
|
value: 93.215 |
|
- type: mrr_at_100 |
|
value: 93.28800000000001 |
|
- type: mrr_at_1000 |
|
value: 93.291 |
|
- type: mrr_at_3 |
|
value: 92.83200000000001 |
|
- type: mrr_at_5 |
|
value: 93.072 |
|
- type: ndcg_at_1 |
|
value: 90.834 |
|
- type: ndcg_at_10 |
|
value: 86.26899999999999 |
|
- type: ndcg_at_100 |
|
value: 89.538 |
|
- type: ndcg_at_1000 |
|
value: 90.042 |
|
- type: ndcg_at_3 |
|
value: 87.333 |
|
- type: ndcg_at_5 |
|
value: 86.098 |
|
- type: precision_at_1 |
|
value: 90.834 |
|
- type: precision_at_10 |
|
value: 43.009 |
|
- type: precision_at_100 |
|
value: 5.064 |
|
- type: precision_at_1000 |
|
value: 0.518 |
|
- type: precision_at_3 |
|
value: 76.507 |
|
- type: precision_at_5 |
|
value: 64.30900000000001 |
|
- type: recall_at_1 |
|
value: 27.72 |
|
- type: recall_at_10 |
|
value: 85.439 |
|
- type: recall_at_100 |
|
value: 96.231 |
|
- type: recall_at_1000 |
|
value: 98.844 |
|
- type: recall_at_3 |
|
value: 56.991 |
|
- type: recall_at_5 |
|
value: 71.48299999999999 |
|
- task: |
|
type: Retrieval |
|
dataset: |
|
type: C-MTEB/VideoRetrieval |
|
name: MTEB VideoRetrieval |
|
config: default |
|
split: dev |
|
revision: None |
|
metrics: |
|
- type: map_at_1 |
|
value: 64.3 |
|
- type: map_at_10 |
|
value: 73.585 |
|
- type: map_at_100 |
|
value: 73.935 |
|
- type: map_at_1000 |
|
value: 73.941 |
|
- type: map_at_3 |
|
value: 72.3 |
|
- type: map_at_5 |
|
value: 73.06 |
|
- type: mrr_at_1 |
|
value: 64.3 |
|
- type: mrr_at_10 |
|
value: 73.585 |
|
- type: mrr_at_100 |
|
value: 73.935 |
|
- type: mrr_at_1000 |
|
value: 73.941 |
|
- type: mrr_at_3 |
|
value: 72.3 |
|
- type: mrr_at_5 |
|
value: 73.06 |
|
- type: ndcg_at_1 |
|
value: 64.3 |
|
- type: ndcg_at_10 |
|
value: 77.434 |
|
- type: ndcg_at_100 |
|
value: 79.225 |
|
- type: ndcg_at_1000 |
|
value: 79.4 |
|
- type: ndcg_at_3 |
|
value: 74.824 |
|
- type: ndcg_at_5 |
|
value: 76.188 |
|
- type: precision_at_1 |
|
value: 64.3 |
|
- type: precision_at_10 |
|
value: 8.92 |
|
- type: precision_at_100 |
|
value: 0.9769999999999999 |
|
- type: precision_at_1000 |
|
value: 0.099 |
|
- type: precision_at_3 |
|
value: 27.367 |
|
- type: precision_at_5 |
|
value: 17.080000000000002 |
|
- type: recall_at_1 |
|
value: 64.3 |
|
- type: recall_at_10 |
|
value: 89.2 |
|
- type: recall_at_100 |
|
value: 97.7 |
|
- type: recall_at_1000 |
|
value: 99.1 |
|
- type: recall_at_3 |
|
value: 82.1 |
|
- type: recall_at_5 |
|
value: 85.39999999999999 |
|
--- |
|
license: apache-2.0 |
|
library_name: transformers |
|
--- |
|
|
|
<h1 align="center">PEG: Towards Robust Text Retrieval with Progressive Learning</h1> |
|
|
|
## Model Details |
|
We propose the PEG model (a Progressively Learned Textual Embedding), which progressively adjusts the weights of samples contributing to the loss within an extremely large batch, based on the difficulty levels of negative samples. |
|
we have amassed an extensive collection of over 110 million data, spanning a wide range of fields such as general knowledge, finance, tourism, medicine, and more. |
|
|
|
Our technical report is available at [Paper](https://arxiv.org/pdf/2311.11691.pdf) |
|
|
|
## Usage (HuggingFace Transformers) |
|
|
|
Install transformers: |
|
``` |
|
pip install transformers |
|
``` |
|
|
|
Then load model and predict: |
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
import torch |
|
|
|
|
|
# Load model from HuggingFace Hub |
|
tokenizer = AutoTokenizer.from_pretrained('TownsWu/PEG') |
|
model = AutoModel.from_pretrained('TownsWu/PEG') |
|
sentences = ['如何更换花呗绑定银行卡', '花呗更改绑定银行卡'] |
|
# Tokenize sentences |
|
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
|
|
|
# Compute token embeddings |
|
with torch.no_grad(): |
|
last_hidden_state = model(**inputs, return_dict=True).last_hidden_state |
|
embeddings = last_hidden_state[:, 0] |
|
print("embeddings:") |
|
print(embeddings) |
|
``` |
|
|
|
## Contact |
|
If you have any question or suggestion related to this project, feel free to open an issue or pull request. |
|
You also can email Tong Wu(townswu@tencent.com). |
|
|
|
|
|
## Citation |
|
|
|
If you find our work helpful for your research, please consider citing the following BibTeX entry: |
|
|
|
``` |
|
|
|
@article{wu2023towards, |
|
title={Towards Robust Text Retrieval with Progressive Learning}, |
|
author={Wu, Tong and Qin, Yulei and Zhang, Enwei and Xu, Zihan and Gao, Yuting and Li, Ke and Sun, Xing}, |
|
journal={arXiv preprint arXiv:2311.11691}, |
|
year={2023} |
|
} |
|
|
|
``` |