Career_Roadmap / rag.md
SaiChaitanya's picture
Upload 106 files
25773cf verified
# RAG: Retrieval-Augmented Generation
Paper: [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/pdf/2005.11401v4.pdf)
Code: [https://python.langchain.com/docs/use_cases/question_answering/quickstart](https://python.langchain.com/docs/use_cases/question_answering/quickstart)
Similarity Search: [https://arxiv.org/pdf/2403.05440.pdf](https://arxiv.org/pdf/2403.05440.pdf)
Prompt Hub: [https://smith.langchain.com/hub](https://smith.langchain.com/hub)
## Premise
LLMs store factual knowledge in their parameters. But accessing and manipulating this knowledge in a precise way is not easy. Instead, that specific knowledge can be accessed through the weights and similarity and then added to the prompt of another model for answer generation:
![RAG Architecture](readme_data/rag1.png)
The original paper considers training end-to-end retriever and generator models in one pipeline.
Models like GPT3.5 and GPT4 don't need that training piece, if their are used with the open AI embedding models.
**Where RAG Is Used?**
* Mitigate hallucination generated by LLMs
* Hallucinations are factually incorrect information generated by an LLM in response to an instruction or question from a user.
* Hallucinations are very hard to capture and need other methodologies to catch them.
* Even with RAG the probability of hallucination is not zero.
* Allow LLMs the consume the data that is not part of their training in their inference.
* LLMs are pre-training on huge amounts of data from public sources.
* Proprietary data is not available to any general pre-trained LLM.
**How RAG Works**
1. One can vectorize the semantics of a piece of text using specialized LLMs, called embedding models.
2. A collection of text can be vectorized to be used for answering the incoming questions.
3. A question is embedded using the same embedding model, and similar documents from a vector database is retrieved using a similarity search algorithm, like cosine similarity.
4. Found documents with the question are passed to generator LLM to generate and answer.
## RAG Components
Here are different components present in a RAG pipeline:
1. **Embedding Model:** Vectorization model which for each string outputs a vector of fixed length.
* The length is the dimension of latent space for this model.
3. **Vector DB:** Specialized database for saving pairs of (text,embeddings). Each of these pairs are called a document. Usually we put related documents in one collection. This makes the similarity search easier.
2. **Similarity Metric:** Given two document pairs $(t_1,e_1)$ and $(t_2,e_2), the metric calculates the similarity of $t1$ and $t_2$ by performing some geometric calculation on their respective embeddings.
* **Cosine Similarity:** Calculates the cosine of the angle between embedding1 and embedding2:
$$\cos(\theta)=\frac{e_1 \cdot e_2}{||e_1||\;||e_2||}.$$
* **Inner Product:** Calculates the inner product of $e_1$ and $e_2$:
$$e1\cdot e_2.$$
* **Distance:** Calculates the distance of $e_1$ from $e_2$ using $L_p$ norms:
$$||e_1-e_2||_p.$$
4. **Generator Model:** Generates the final answer based on the question and found similar text in the database that may contain the answer.
## Cosine Similarity Problems
* The motivation for using cosine similarity is that the norm of the learned embedding-vectors is not as important as the directional alignment between the embedding-vectors.
* But cosine similarity "work better but sometimes also worse than the unnormalized dot-product between embedded vectors in practice."
* The paper derives "analytically how cosine-similarity can yield arbitrary and therefore meaningless ‘similarities.’"
* To do this, they "study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights."
* "The underlying reason is not cosine similarity itself, but the fact that the learned embeddings have a degree of freedom that can render arbitrary cosine-similarities even though their (unnormalized) dot-products are well-defined and unique."