# RAG: Retrieval-Augmented Generation Paper: [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/pdf/2005.11401v4.pdf) Code: [https://python.langchain.com/docs/use_cases/question_answering/quickstart](https://python.langchain.com/docs/use_cases/question_answering/quickstart) Similarity Search: [https://arxiv.org/pdf/2403.05440.pdf](https://arxiv.org/pdf/2403.05440.pdf) Prompt Hub: [https://smith.langchain.com/hub](https://smith.langchain.com/hub) ## Premise LLMs store factual knowledge in their parameters. But accessing and manipulating this knowledge in a precise way is not easy. Instead, that specific knowledge can be accessed through the weights and similarity and then added to the prompt of another model for answer generation: ![RAG Architecture](readme_data/rag1.png) The original paper considers training end-to-end retriever and generator models in one pipeline. Models like GPT3.5 and GPT4 don't need that training piece, if their are used with the open AI embedding models. **Where RAG Is Used?** * Mitigate hallucination generated by LLMs * Hallucinations are factually incorrect information generated by an LLM in response to an instruction or question from a user. * Hallucinations are very hard to capture and need other methodologies to catch them. * Even with RAG the probability of hallucination is not zero. * Allow LLMs the consume the data that is not part of their training in their inference. * LLMs are pre-training on huge amounts of data from public sources. * Proprietary data is not available to any general pre-trained LLM. **How RAG Works** 1. One can vectorize the semantics of a piece of text using specialized LLMs, called embedding models. 2. A collection of text can be vectorized to be used for answering the incoming questions. 3. A question is embedded using the same embedding model, and similar documents from a vector database is retrieved using a similarity search algorithm, like cosine similarity. 4. Found documents with the question are passed to generator LLM to generate and answer. ## RAG Components Here are different components present in a RAG pipeline: 1. **Embedding Model:** Vectorization model which for each string outputs a vector of fixed length. * The length is the dimension of latent space for this model. 3. **Vector DB:** Specialized database for saving pairs of (text,embeddings). Each of these pairs are called a document. Usually we put related documents in one collection. This makes the similarity search easier. 2. **Similarity Metric:** Given two document pairs $(t_1,e_1)$ and $(t_2,e_2), the metric calculates the similarity of $t1$ and $t_2$ by performing some geometric calculation on their respective embeddings. * **Cosine Similarity:** Calculates the cosine of the angle between embedding1 and embedding2: $$\cos(\theta)=\frac{e_1 \cdot e_2}{||e_1||\;||e_2||}.$$ * **Inner Product:** Calculates the inner product of $e_1$ and $e_2$: $$e1\cdot e_2.$$ * **Distance:** Calculates the distance of $e_1$ from $e_2$ using $L_p$ norms: $$||e_1-e_2||_p.$$ 4. **Generator Model:** Generates the final answer based on the question and found similar text in the database that may contain the answer. ## Cosine Similarity Problems * The motivation for using cosine similarity is that the norm of the learned embedding-vectors is not as important as the directional alignment between the embedding-vectors. * But cosine similarity "work better but sometimes also worse than the unnormalized dot-product between embedded vectors in practice." * The paper derives "analytically how cosine-similarity can yield arbitrary and therefore meaningless ‘similarities.’" * To do this, they "study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights." * "The underlying reason is not cosine similarity itself, but the fact that the learned embeddings have a degree of freedom that can render arbitrary cosine-similarities even though their (unnormalized) dot-products are well-defined and unique."