File size: 8,817 Bytes
4191edb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
title: Simple RAG
jupyter: python3
eval: false
code-annotations: hover

---

```{python}
!pip install -q torch transformers accelerate bitsandbytes transformers sentence-transformers faiss-gpu
```

```{python}
!pip install -q langchain
```

::: callout-note
If running in Google Colab, you may need to run this cell to make sure you're using UTF-8 locale to install LangChain
```{python}
import locale
locale.getpreferredencoding = lambda: "UTF-8"
```
:::


## Prepare the data

In this example, we'll load all of the issues (both open and closed) from [PEFT library's repo](https://github.com/huggingface/peft).

First, you need to acquire a [GitHub personal access token](https://github.com/settings/tokens?type=beta) to access the GitHub API.

```{python}
from getpass import getpass

ACCESS_TOKEN = getpass("YOUR_GITHUB_PERSONAL_TOKEN")  # <1>
```
1. You can also use an environment variable to store your token.

Next, we'll load all of the issues in the [huggingface/peft](https://github.com/huggingface/peft) repo:
- By default, pull requests are considered issues as well, here we chose to exclude them from data with by setting `include_prs=False`
- Setting `state = "all"` means we will load both open and closed issues.

```{python}
from langchain.document_loaders import GitHubIssuesLoader

loader = GitHubIssuesLoader(
    repo="huggingface/peft",
    access_token=ACCESS_TOKEN,
    include_prs=False,
    state="all"
)

docs = loader.load()
```

The content of individual GitHub issues may be longer than what an embedding model can take as input. If we want to embed all of the available content, we need to chunk the documents into appropriately sized pieces.

The most common and straightforward approach to chunking is to define a fixed size of chunks and whether there should be any overlap between them. Keeping some overlap between chunks allows us to preserve some semantic context between the chunks.

Other approaches are typically more involved and take into account the documents' structure and context. For example, one may want to split a document based on sentences or paragraphs, or create chunks based on the

The fixed-size chunking, however, works well for most common cases, so that is what we'll do here.

```{python}
from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(chunk_size=512, chunk_overlap=30)

chunked_docs = splitter.split_documents(docs)
```

## Create the embeddings + retriever

Now that the docs are all of the appropriate size, we can create a database with their embeddings.

To create document chunk embeddings we'll use the `HuggingFaceEmbeddings` and the [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5) embeddings model. To create the vector database, we'll use `FAISS`, a library developed by Facebook AI. This library offers efficient similarity search and clustering of dense vectors, which is what we need here. FAISS is currently one of the most used libraries for NN search in massive datasets. 

::: callout-tip
There are many other embeddings models available on the Hub, and you can keep an eye on the best performing ones by checking the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard).
:::

We'll access both the embeddings model and FAISS via LangChain API.

```{python}
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

db = FAISS.from_documents(chunked_docs,
                          HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5'))
```

We need a way to return(retrieve) the documents given an unstructured query. For that, we'll use the `as_retriever` method using the `db` as a backbone:
- `search_type="similarity"` means we want to perform similarity search between the query and documents
- `search_kwargs={'k': 4}` instructs the retriever to return top 4 results.

```{python}
retriever = db.as_retriever(
    search_type="similarity",            # <1>
    search_kwargs={'k': 4}               # <1>
)
```
1. The ideal search type is context dependent, and you should experiment to find the best one for your data.

The vector database and retriever are now set up, next we need to set up the next piece of the chain - the model.

## Load quantized model

For this example, we chose [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a small but powerful model.
To make inference faster, we will load the quantized version of the model:

:::::: {.callout-tip}
With many models being released every week, you may want to substitute this model to the latest and greatest. The best way to keep track of open source LLMs is to check the [Open-source LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
:::

```{python}
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = 'HuggingFaceH4/zephyr-7b-beta'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```

## Setup the LLM chain

Finally, we have all the pieces we need to set up the LLM chain.

First, create a text_generation pipeline using the loaded model and its tokenizer.

Next, create a prompt template - this should follow the format of the model, so if you substitute the model checkpoint, make sure to use the appropriate formatting.

```{python}
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_core.output_parsers import StrOutputParser

text_generation_pipeline = pipeline(
    model=model,                # <1> 
    tokenizer=tokenizer,        # <2> 
    task="text-generation",     # <3> 
    temperature=0.2,            # <4> 
    do_sample=True,             # <5> 
    repetition_penalty=1.1,     # <6> 
    return_full_text=True,      # <7> 
    max_new_tokens=400,         # <8> 
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|system|>
Answer the question based on your knowledge. Use the following context to help:

{context}

</s>
<|user|>
{question}
</s>
<|assistant|>

 """

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

llm_chain = prompt | llm | StrOutputParser()
```

1. The pre-trained model for text generation.
2. Tokenizer to preprocess input text and postprocess generated output.
3. Specifies the task as text generation.
4. Controls the randomness in the output generation. Lower values make the output more deterministic.
5. Enables sampling to introduce randomness in the output generation.
6. Penalizes repetition in the output to encourage diversity.
7. Returns the full generated text including the input prompt.
8. Limits the maximum number of new tokens generated.

Note: _You can also use `tokenizer.apply_chat_template` to convert a list of messages (as dicts: `{'role': 'user', 'content': '(...)'}`) into a string with the appropriate chat format._


Finally, we need to combine the `llm_chain` with the retriever to create a RAG chain. We pass the original question through to the final generation step, as well as the retrieved context docs:

```{python}
from langchain_core.runnables import RunnablePassthrough

retriever = db.as_retriever()

rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)
```

## Compare the results

Let's see the difference RAG makes in generating answers to the library-specific questions.

```{python}
question = "How do you combine multiple adapters?"
```

First, let's see what kind of answer we can get with just the model itself, no context added:

```{python}
#| colab: {base_uri: 'https://localhost:8080/', height: 125}
llm_chain.invoke({"context":"", "question": question})
```

As you can see, the model interpreted the question as one about physical computer adapters, while in the context of PEFT, "adapters" refer to LoRA adapters.
Let's see if adding context from GitHub issues helps the model give a more relevant answer:

```{python}
#| colab: {base_uri: 'https://localhost:8080/', height: 125}
rag_chain.invoke(question)
```

As we can see, the added context, really helps the exact same model, provide a much more relevant and informed answer to the library-specific question.

Notably, combining multiple adapters for inference has been added to the library, and one can find this information in the documentation, so for the next iteration of this RAG it may be worth including documentation embeddings.