avsolatorio
/

GIST-Embedding-v0

@@ -4580,28 +4580,29 @@ model-index:
 ---
 <h1 align="center">GIST Embedding v0</h1>
-*GIST Embedding: Guided In-sample Selection of Training Negatives for Text Embedding*
 The model is fine-tuned on top of the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
 The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
-Technical details of the model will be published shortly.
 # Data
-The dataset used is a compilation of the MEDI dataset and the MTEB Classification training dataset. Third-party datasets may be subject to additional terms and conditions under their associated licenses. A HuggingFace Dataset version of the compiled dataset, and the specific revision used to train the model, is available:
 - Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
 - Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
-The dataset contains a `task_type` key which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
 The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
 The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
-The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID, which could have caused the observed performance degradation. Further work is currently being undertaken to validate this hypothesis.
 # Usage
@@ -4611,7 +4612,7 @@ The model can be easily loaded using the Sentence Transformers library.
 import torch.nn.functional as F
 from sentence_transformers import SentenceTransformer
-revision = None  # Replace with the specific revision to ensure reproducibility in  case the model is updated.
 model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)
@@ -4643,13 +4644,29 @@ Checkpoint step = 103500
 Contrastive loss temperature = 0.01
 ```
-Specific training details and strategies will be published shortly.
 # Evaluation
 The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
 # Acknowledgements
 This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.

 ---
 <h1 align="center">GIST Embedding v0</h1>
+*GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning*
 The model is fine-tuned on top of the [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) using the [MEDI dataset](https://github.com/xlang-ai/instructor-embedding.git) augmented with mined triplets from the [MTEB Classification](https://huggingface.co/mteb) training dataset (excluding data from the Amazon Polarity Classification task).
 The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.
+Technical paper: [GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning](https://arxiv.org/abs/2402.16829)
 # Data
+The dataset used is a compilation of the MEDI and MTEB Classification training datasets. Third-party datasets may be subject to additional terms and conditions under their associated licenses. A HuggingFace Dataset version of the compiled dataset, and the specific revision used to train the model, is available:
 - Dataset: [avsolatorio/medi-data-mteb_avs_triplets](https://huggingface.co/datasets/avsolatorio/medi-data-mteb_avs_triplets)
 - Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb
+The dataset contains a `task_type` key, which can be used to select only the mteb classification tasks (prefixed with `mteb_`).
 The **MEDI Dataset** is published in the following paper: [One Embedder, Any Task: Instruction-Finetuned Text Embeddings](https://arxiv.org/abs/2212.09741).
 The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.
+The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID-19, which could have caused the observed performance degradation. We found some evidence, detailed in the paper, that thematic coverage of the fine-tuning data can affect downstream performance.
 # Usage
 import torch.nn.functional as F
 from sentence_transformers import SentenceTransformer
+revision = None  # Replace with the specific revision to ensure reproducibility if the model is updated.
 model = SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)
 Contrastive loss temperature = 0.01
 ```
 # Evaluation
 The model was evaluated using the [MTEB Evaluation](https://huggingface.co/mteb) suite.
+# Citation
+Please cite our work if you use GISTEmbed or the datasets we published in your projects or research. 🤗
+```
+@article{solatorio2024gistembed,
+    title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
+    author={Aivin V. Solatorio},
+    journal={arXiv preprint arXiv:2402.16829},
+    year={2024},
+    URL={https://arxiv.org/abs/2402.16829}
+    eprint={2402.16829},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG}
+}
+```
 # Acknowledgements
 This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the [Knowledge for Change Program (KCP)](https://www.worldbank.org/en/programs/knowledge-for-change) of the World Bank - RA-P503405-RESE-TF0C3444.