RAG: Retrieval-Augmented Generation

Paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Code: https://python.langchain.com/docs/use_cases/question_answering/quickstart
Similarity Search: https://arxiv.org/pdf/2403.05440.pdf
Prompt Hub: https://smith.langchain.com/hub

Premise

LLMs store factual knowledge in their parameters. But accessing and manipulating this knowledge in a precise way is not easy. Instead, that specific knowledge can be accessed through the weights and similarity and then added to the prompt of another model for answer generation:

The original paper considers training end-to-end retriever and generator models in one pipeline.
Models like GPT3.5 and GPT4 don't need that training piece, if their are used with the open AI embedding models.

Where RAG Is Used?

Mitigate hallucination generated by LLMs
- Hallucinations are factually incorrect information generated by an LLM in response to an instruction or question from a user.
- Hallucinations are very hard to capture and need other methodologies to catch them.
- Even with RAG the probability of hallucination is not zero.
Allow LLMs the consume the data that is not part of their training in their inference.
- LLMs are pre-training on huge amounts of data from public sources.
- Proprietary data is not available to any general pre-trained LLM.

How RAG Works

One can vectorize the semantics of a piece of text using specialized LLMs, called embedding models.
A collection of text can be vectorized to be used for answering the incoming questions.
A question is embedded using the same embedding model, and similar documents from a vector database is retrieved using a similarity search algorithm, like cosine similarity.
Found documents with the question are passed to generator LLM to generate and answer.

RAG Components

Here are different components present in a RAG pipeline:

Embedding Model: Vectorization model which for each string outputs a vector of fixed length.
- The length is the dimension of latent space for this model.
Vector DB: Specialized database for saving pairs of (text,embeddings). Each of these pairs are called a document. Usually we put related documents in one collection. This makes the similarity search easier.
Similarity Metric: Given two document pairs $(t_1,e_1)$ and $(t_2,e_2), the metric calculates the similarity of $t1$ and $t_2$ by performing some geometric calculation on their respective embeddings.
- Cosine Similarity: Calculates the cosine of the angle between embedding1 and embedding2: $$\cos(\theta)=\frac{e_1 \cdot e_2}{||e_1||;||e_2||}.$$
- Inner Product: Calculates the inner product of $e_1$ and $e_2$: $$e1\cdot e_2.$$
- Distance: Calculates the distance of $e_1$ from $e_2$ using $L_p$ norms: $$||e_1-e_2||_p.$$
Generator Model: Generates the final answer based on the question and found similar text in the database that may contain the answer.

Cosine Similarity Problems

The motivation for using cosine similarity is that the norm of the learned embedding-vectors is not as important as the directional alignment between the embedding-vectors.
But cosine similarity "work better but sometimes also worse than the unnormalized dot-product between embedded vectors in practice."
The paper derives "analytically how cosine-similarity can yield arbitrary and therefore meaningless ‘similarities.’"
To do this, they "study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights."
"The underlying reason is not cosine similarity itself, but the fact that the learned embeddings have a degree of freedom that can render arbitrary cosine-similarities even though their (unnormalized) dot-products are well-defined and unique."