arxiv:2410.02525

Contextual Document Embeddings

Published on Oct 3

· Submitted by

jxm on Oct 4

Upvote

Authors:

John X. Morris ,

Abstract

Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder.

View arXiv page View PDF Add to collection

Community

jxm

Paper author Paper submitter 1 day ago

We spent a year developing cde-small-v1, the best BERT-sized text embedding model in the world. Today, we're releasing the model on HuggingFace, along with the paper on ArXiv.

Typical text embedding models have two main problems

training them is complicated and requires many tricks: giant batches, distillation, hard negatives...
the embeddings don't "know" what corpus they will be used in; consequently, all text spans are encoded the same way

To fix (1) we develop a new training technique: contextual batching. all batches share a lot of context – one batch might be about horse races in Kentucky, the next batch about differential equations, etc.

This lets us get better performance without big batches or hard negative mining. There's also some cool theory behind it.

And for (2), we propose a new contextual embedding architecture. this requires changes to both the training and evaluation pipeline to incorporate contextual tokens – essentially, model sees extra text from the surrounding context, and can update the embedding accordingly.

If you use text embeddings, feel free to try cde-small-v1 on HuggingFace: https://huggingface.co/jxm/cde-small-v1 As noted, it's slightly more involved to use, since there's an extra step of embedding context tokens beforehand.

Let us know what you think!

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.02525 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.02525 in a Space README.md to link it from this page.