File size: 4,129 Bytes
25773cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a431caa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# RAG: Retrieval-Augmented Generation
Paper: [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/pdf/2005.11401v4.pdf)  
Code: [https://python.langchain.com/docs/use_cases/question_answering/quickstart](https://python.langchain.com/docs/use_cases/question_answering/quickstart)  
Similarity Search: [https://arxiv.org/pdf/2403.05440.pdf](https://arxiv.org/pdf/2403.05440.pdf)  
Prompt Hub: [https://smith.langchain.com/hub](https://smith.langchain.com/hub)

## Premise
LLMs store factual knowledge in their parameters. But accessing and manipulating this knowledge in a precise way is not easy. Instead, that specific knowledge can be accessed through the weights and similarity and then added to the prompt of another model for answer generation:

![RAG Architecture](readme_data/rag1.png)
The original paper considers training end-to-end retriever and generator models in one pipeline.   
Models like GPT3.5 and GPT4 don't need that training piece, if their are used with the open AI embedding models.

**Where RAG Is Used?** 
* Mitigate hallucination generated by LLMs
    * Hallucinations are factually incorrect information generated by an LLM in response to an instruction or question from a user. 
    * Hallucinations are very hard to capture and need other methodologies to catch them.
    * Even with RAG the probability of hallucination is not zero.
* Allow LLMs the consume the data that is not part of their training in their inference. 
    * LLMs are pre-training on huge amounts of data from public sources. 
    * Proprietary data is not available to any general pre-trained LLM.

**How RAG Works**  
1. One can vectorize the semantics of a piece of text using specialized LLMs, called embedding models. 
2. A collection of text can be vectorized to be used for answering the incoming questions. 
3. A question is embedded using the same embedding model, and similar documents from a vector database is retrieved using a similarity search algorithm, like cosine similarity.
4. Found documents with the question are passed to generator LLM to generate and answer.


## RAG Components
Here are different components present in a RAG pipeline:
1. **Embedding Model:** Vectorization model which for each string outputs a vector of fixed length. 
    * The length is the dimension of latent space for this model.
3. **Vector DB:** Specialized database for saving pairs of (text,embeddings). Each of these pairs are called a document. Usually we put related documents in one collection. This makes the similarity search easier.
2. **Similarity Metric:** Given two document pairs $(t_1,e_1)$ and $(t_2,e_2), the metric calculates the similarity of $t1$ and $t_2$ by performing some geometric calculation on their respective embeddings. 
    * **Cosine Similarity:** Calculates the cosine of the angle between embedding1 and embedding2:
    $$\cos(\theta)=\frac{e_1 \cdot e_2}{||e_1||\;||e_2||}.$$
    * **Inner Product:** Calculates the inner product of $e_1$ and $e_2$:
    $$e1\cdot e_2.$$
    * **Distance:** Calculates the distance of $e_1$ from $e_2$ using $L_p$ norms:
    $$||e_1-e_2||_p.$$
4. **Generator Model:** Generates the final answer based on the question and found similar text in the database that may contain the answer.

## Cosine Similarity Problems
* The motivation for using cosine similarity is that the norm of the learned embedding-vectors is not as important as the directional alignment between the embedding-vectors.
* But cosine similarity "work better but sometimes also worse than the unnormalized dot-product between embedded vectors in practice."
* The paper derives "analytically how cosine-similarity can yield arbitrary and therefore meaningless ‘similarities.’" 
* To do this, they "study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights."
* "The underlying reason is not cosine similarity itself, but the fact that the learned embeddings have a degree of freedom that can render arbitrary cosine-similarities even though their (unnormalized) dot-products are well-defined and unique."