arxiv:2406.15319

LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs

Published on Jun 21

· Submitted by

wenhu on Jun 24

#1 Paper of the day

Upvote

Authors:

Ziyan Jiang ,

Xueguang Ma ,

Wenhu Chen

Abstract

In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced `heavy' retriever and `light' reader design can lead to sub-optimal performance. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a `long retriever' and a `long reader'. LongRAG processes the entire Wikipedia into 4K-token units, which is 30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 700K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k retrieved units (approx 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ, which is the best known result. LongRAG also achieves 64.3% on HotpotQA (full-wiki), which is on par of the SoTA model. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.

View arXiv page View PDF Add to collection

Community

wenhu

Paper author Paper submitter 5 days ago

•

edited 5 days ago

How to combine long-context LLM with RAG? We provide a solution called LongRAG, which can unleash the power of long-context LLM in RAG systems. Without any training, we are getting almost SoTA performance on NQ and HotpotQA.
https://tiger-ai-lab.github.io/LongRAG/

andyshieh

4 days ago

Hi @wenhu ,

Thanks for the great work and for sharing it with the HF community!

I’ve got a couple of questions about the proposed retriever. How does it actually lower the retrieval overhead if it still needs to find the chunk in each document (or group of documents) that maximizes the similarity score? If I’m getting this right, doesn’t it still need to do O(N) similarity calculations with N being the number of chunks?

Also, regarding the claimed recall@1 improvement: how much of that is due to the fact that the retrieved content unit is larger than a passage? For instance, it wouldn’t be too surprising if retrieving 5 grouped documents at once (still considered as recall@1) has a higher recall than pulling 1 passage at a time.

wenhu

Paper author 4 days ago

Thanks a lot for the questions! These are actually being discussed in the paper.

This is a good question. Currently, we use the maximize-chunk score because it gives the best performance, which is more of a makeshift solution. There are some long-context retrievers like E5-Mistral to compress the index a lot, however, it doesn't yield good performance as of now (see Table 3). We have tried using GPT long-embedding or Cohere long-embedding, which can give better results but more costly. We believe future research can make significantly boost the performance to get rid of the maximum-chunk retriever. Our lab is also working on this, so hopefully we will have new "long retriever" to share.
The current recall improvement is indeed due to the larger unit, which is the point of "long retriever". We advocate to use larger and cohesive unit in RAG instead of using 100 smaller disparate smaller units. We believe this can improve the simplify the RAG system and improve attribution. We analyze the performance in Table 2.
Overall, I think this paper proposes a new ideology instead of the "perfect solution". We hope people can think about how to maximize the power of long-context in

andyshieh

4 days ago

Hi @wenhu ,

Thanks for your response! The paper indeed explores an interesting strategy that's different from traditional RAGs, especially with the rise of more powerful LongContext LLMs.

For practitioners like me, it would be incredibly valuable to compare:

SOTA passage-based RAG (retrieve -> rerank -> reader) versus
LongContext (retrieve cohesive documents -> reader)

I think simply putting the top 100 passages from the retriever into the reader model might not do justice to passage-based RAGs. For instance, the top 10 passages after being reranked by a SOTA reranking model (like BGE m3) would likely perform better. (more concise and precise)

However, here's an anecdote that shows where long cohesive documents can potentially outperform passages: We've worked with documents where the full name of a person is mentioned only at the beginning. For example, an article might start with "John Doe declared..." and then refer to him as "Mr. Doe" throughout the rest of the article. If a passage-based model only includes parts referring to "Mr. Doe," it can sometimes produce incorrect results. Keeping the whole article in context helps the LLM understand who "Mr. Doe" is.

On the flip side, there are cases where including too much content can confuse a weaker reader model. For example, in a long article listing a large number of drugs and their purposes, if a user is only looking for drugs for a specific purpose (like breast cancer), too much content can overwhelm the model. We've seen that even SOTA models like those in the Llama 2 era can struggle with too much information. But that was a while ago... It would be fascinating to see if the best open-source long-context models nowadays, like Qwen 2 72B, benefit from or suffer due to long cohesive documents.

Looking forward to more insights on this!

ziyjiang

Paper author 3 days ago

•

edited 3 days ago

@andyshieh Thanks for your interest in our work. I’d like to share my thoughts regarding your comments:

Some of the fully-supervised RAG baselines in our Table 4 and Table 5 actually are the passage-level RAG framework with a rerank step. Additionally, our framework should also benefit from a rerank step. In our work, we emphasize simplifying the retriever as much as possible while achieving the overall performance on par with frameworks that include a rerank or iterative retriever (in a multi-hop setup). (That's why we didn't introduce a rerank step.)
Our work aims to discuss the trade-off between the burden on the reader and the retriever. If the reader model is weaker, we place more burden on the retriever. Conversely, if the reader model is stronger, we place more burden on the reader. We have tested a few of the best models known for handling long contexts in our work (although they are not open-sourced). We may add some other models, including open-source ones, as our readers in the future.