Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that Self-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, Self-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Retrieval-Generation Synergy Augmented Large Language Models (2023)
- RA-DIT: Retrieval-Augmented Dual Instruction Tuning (2023)
- RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models (2023)
- RAGAS: Automated Evaluation of Retrieval Augmented Generation (2023)
- Making Retrieval-Augmented Language Models Robust to Irrelevant Context (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
LLMs make a ton of factual mistakes. This limits using them for real-world stuff where being right matters.
Some researchers tried "retrieval augmentation" before - prepending context from Wikipedia to give the LM knowledge. But this is slow and doesn't guarantee the output will actually use the facts correctly.
Researchers from UW and IBM take a clever and different approach. They train the LM to "reflect" on itself - critiquing when it needs more info and whether its output matches the evidence.
The model learns to generate special "retrieve" tokens to selectively get relevant facts from Wikipedia only when needed. It also generates "critique tokens" to check if its output is properly supported.
They tested this "SELF-RAG" framework on question answering, reasoning, and long-form generation tasks. It beat normal LLMs and retrieval-augmented ones.
The key findings:
- SELF-RAG models (7B and 13B parameters) won on 6 diverse tasks against strong baselines.
- It had 81% accuracy on a fact checking task, way better than 71% for another new technique.
- For biography generation, it scored 80% on factuality versus just 71% for ChatGPT.
- It achieved much higher citation accuracy - properly attributing claims to evidence sources.
This shows the benefits of having LLMs reflect on their own limitations and selectively get knowledge.
There are still issues though - SELF-RAG can still make unsupported claims sometimes. More work is needed on training, knowledge sources, and self-critique reliability.
But overall it's an elegant approach to improving factuality without sacrificing too much of the creative versatility of large models. Really keen to see how this research direction develops!
TLDR: Researchers made LMs learn to notice their own mistakes and retrieve knowledge to correct themselves. Early results show improvements in factual accuracy.