avsolatorio
commited on
Commit
•
bf6b2e5
1
Parent(s):
71128f9
Update README.md
Browse files
README.md
CHANGED
@@ -4580,28 +4580,29 @@ model-index:
|
|
4580 |
---
|
4581 |
<h1 align="center">GIST Embedding v0</h1>
|
4582 |
|
4583 |
-
*
|
4584 |
|
4585 |
The model is fine-tuned on top of the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
|
4586 |
|
4587 |
The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
|
4588 |
|
4589 |
-
Technical
|
|
|
4590 |
|
4591 |
# Data
|
4592 |
|
4593 |
-
The dataset used is a compilation of the MEDI
|
4594 |
|
4595 |
- Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
|
4596 |
- Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
|
4597 |
|
4598 |
-
The dataset contains a `task_type` key which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
|
4599 |
|
4600 |
The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
|
4601 |
|
4602 |
The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
|
4603 |
|
4604 |
-
The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID, which could have caused the observed performance degradation.
|
4605 |
|
4606 |
# Usage
|
4607 |
|
@@ -4611,7 +4612,7 @@ The model can be easily loaded using the Sentence Transformers library.
|
|
4611 |
import torch.nn.functional as F
|
4612 |
from sentence_transformers import SentenceTransformer
|
4613 |
|
4614 |
-
revision = None # Replace with the specific revision to ensure reproducibility
|
4615 |
|
4616 |
model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)
|
4617 |
|
@@ -4643,13 +4644,29 @@ Checkpoint step = 103500
|
|
4643 |
Contrastive loss temperature = 0.01
|
4644 |
```
|
4645 |
|
4646 |
-
Specific training details and strategies will be published shortly.
|
4647 |
|
4648 |
# Evaluation
|
4649 |
|
4650 |
The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
|
4651 |
|
4652 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4653 |
# Acknowledgements
|
4654 |
|
4655 |
This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.
|
|
|
4580 |
---
|
4581 |
<h1 align="center">GIST Embedding v0</h1>
|
4582 |
|
4583 |
+
*GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning*
|
4584 |
|
4585 |
The model is fine-tuned on top of the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
|
4586 |
|
4587 |
The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
|
4588 |
|
4589 |
+
Technical paper: [GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning](https://arxiv.org/abs/2402.16829)
|
4590 |
+
|
4591 |
|
4592 |
# Data
|
4593 |
|
4594 |
+
The dataset used is a compilation of the MEDI and MTEB Classification training datasets. Third-party datasets may be subject to additional terms and conditions under their associated licenses. A HuggingFace Dataset version of the compiled dataset, and the specific revision used to train the model, is available:
|
4595 |
|
4596 |
- Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
|
4597 |
- Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
|
4598 |
|
4599 |
+
The dataset contains a `task_type` key, which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
|
4600 |
|
4601 |
The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
|
4602 |
|
4603 |
The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
|
4604 |
|
4605 |
+
The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID-19, which could have caused the observed performance degradation. We found some evidence, detailed in the paper, that thematic coverage of the fine-tuning data can affect downstream performance.
|
4606 |
|
4607 |
# Usage
|
4608 |
|
|
|
4612 |
import torch.nn.functional as F
|
4613 |
from sentence_transformers import SentenceTransformer
|
4614 |
|
4615 |
+
revision = None # Replace with the specific revision to ensure reproducibility if the model is updated.
|
4616 |
|
4617 |
model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)
|
4618 |
|
|
|
4644 |
Contrastive loss temperature = 0.01
|
4645 |
```
|
4646 |
|
|
|
4647 |
|
4648 |
# Evaluation
|
4649 |
|
4650 |
The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
|
4651 |
|
4652 |
|
4653 |
+
# Citation
|
4654 |
+
|
4655 |
+
Please cite our work if you use GISTEmbed or the datasets we published in your projects or research. 🤗
|
4656 |
+
|
4657 |
+
```
|
4658 |
+
@article{solatorio2024gistembed,
|
4659 |
+
title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
|
4660 |
+
author={Aivin V. Solatorio},
|
4661 |
+
journal={arXiv preprint arXiv:2402.16829},
|
4662 |
+
year={2024},
|
4663 |
+
URL={https://arxiv.org/abs/2402.16829}
|
4664 |
+
eprint={2402.16829},
|
4665 |
+
archivePrefix={arXiv},
|
4666 |
+
primaryClass={cs.LG}
|
4667 |
+
}
|
4668 |
+
```
|
4669 |
+
|
4670 |
# Acknowledgements
|
4671 |
|
4672 |
This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.
|