Gecko: Versatile Text Embeddings Distilled from Large Language Models

Community Article Published April 1, 2024

The paper introduces Gecko, a compact and versatile text embedding model that leverages the knowledge of large language models (LLMs) through a two-step distillation process. The main idea is to generate diverse synthetic data using LLMs and then refine the data quality by retrieving and relabeling positive and negative passages using the same LLMs.

Method Overview

Gecko uses a two-step distillation process that leverages large language models (LLMs) to generate high-quality synthetic training data. The first step focuses on generating diverse queries and task descriptions using an LLM in a few-shot prompting setup. The LLM reads a large sampled web passage and generates both a task description and a relevant query for that task. This process results in a wide variety of query-task pairs spanning different domains and linguistic patterns.

In the second step, a pre-trained embedding model is used to retrieve the top-k nearest neighbor passages from the corpus for each generated query. These retrieved passages are then scored by the same LLM. The scores are used to select positive and negative examples. This relabeling process allows Gecko to learn from more relevant positive targets and harder negative examples than just using the original seed passages.

In this way, the authors obtain FRet (Few-shot Retrieval) dataset. The process is depicted below:

For the final training of Gecko, the FRet dataset is combined with several academic datasets like Natural Questions, HotpotQA, FEVER, and others, all formatted into a unified structure containing a task description, query, positive passage, and negative passage. This unified dataset mixture, spanning diverse tasks like question answering, fact checking, textual entailment, and more, is then used to fine-tune the Gecko model. The training objective is a contrastive loss where for each query, the model tries to bring the positive passage closer while pushing away the hard negative passage and other in-batch negatives. Additionally, a multi-task objective from classification datasets is incorporated by treating each input-label pair as a query-positive passage, using other label instances as negatives. This allows Gecko to learn semantic similarity and classification in a single unified training process.

Results

On the Massive Text Embedding Benchmark (MTEB), Gecko achieves good results even with an embedding size of 256. Gecko with an embedding dimensions 768 achieves an average score of 66.31, competing with models that are 7 times larger and have 5 times higher dimensional embeddings.

The results obtained by varying the data used for training are also very interesting. Gecko, trained solely on FRet, shows strong performance, further emphasising the importance of synthetic data.

Conclusion

By leveraging LLMs for diverse synthetic data generation and relabeling, Gecko achieves a strong performance on multiple text embedding tasks while maintaining a compact size. For more information, please consult the full paper.

Congrats to the authors for their work!

Lee, Jinhyuk, et al. "Gecko: Versatile Text Embeddings Distilled from Large Language Models." arXiv preprint arXiv:2403.20327 (2024).