jerpint commited on
Commit
c6dd20e
1 Parent(s): 2642581

Use dropdown to select source (#71)

Browse files

* add dropdown menu for switching data sources

* Add ability to update Buster's config on the fly

* Add lightning, godot documentation sources

* add download script for the weights (from huggingface dataset)

* update tests

* Add logging to pytest

* Fix source titles when returning results

* return percentages instead of cosine score

* change source directly when you call chat

.gitattributes DELETED
@@ -1 +0,0 @@
1
- *.db filter=lfs diff=lfs merge=lfs -text
 
 
.gitignore CHANGED
@@ -1,3 +1,4 @@
 
1
  # Byte-compiled / optimized / DLL files
2
  __pycache__/
3
  *.py[cod]
 
1
+ buster/apps/data/
2
  # Byte-compiled / optimized / DLL files
3
  __pycache__/
4
  *.py[cod]
buster/apps/bot_configs.py ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from buster.buster import BusterConfig
2
+
3
+ huggingface_cfg = BusterConfig(
4
+ unknown_prompt="I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?",
5
+ embedding_model="text-embedding-ada-002",
6
+ top_k=3,
7
+ thresh=0.7,
8
+ max_words=3000,
9
+ completer_cfg={
10
+ "name": "ChatGPT",
11
+ "text_before_documents": (
12
+ "You are a chatbot assistant answering technical questions about huggingface transformers, a library to train transformers in python. "
13
+ "You can only respond to a question if the content necessary to answer the question is contained in the following provided documentation. "
14
+ "If the answer is in the documentation, summarize it in a helpful way to the user. "
15
+ "If it isn't, simply reply that you cannot answer the question. "
16
+ "Do not refer to the documentation directly, but use the instructions provided within it to answer questions. "
17
+ "Here is the documentation: "
18
+ "<DOCUMENTS> "
19
+ ),
20
+ "text_before_prompt": (
21
+ "<\DOCUMENTS>\n"
22
+ "REMEMBER:\n"
23
+ "You are a chatbot assistant answering technical questions about huggingface transformers, a library to train transformers in python. "
24
+ "Here are the rules you must follow:\n"
25
+ "1) You must only respond with information contained in the documentation above. Say you do not know if the information is not provided.\n"
26
+ "2) Make sure to format your answers in Markdown format, including code block and snippets.\n"
27
+ "3) Do not reference any links, urls or hyperlinks in your answers.\n"
28
+ "4) If you do not know the answer to a question, or if it is completely irrelevant to the library usage, simply reply with:\n"
29
+ "5) Do not refer to the documentation directly, but use the instructions provided within it to answer questions. "
30
+ "'I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?'"
31
+ "For example:\n"
32
+ "What is the meaning of life for huggingface?\n"
33
+ "I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?"
34
+ "Now answer the following question:\n"
35
+ ),
36
+ "completion_kwargs": {
37
+ "model": "gpt-3.5-turbo",
38
+ },
39
+ },
40
+ response_format="gradio",
41
+ source="huggingface",
42
+ )
43
+
44
+
45
+ pytorch_cfg = BusterConfig(
46
+ unknown_prompt="I'm sorry, but I am an AI language model trained to assist with questions related to the pytorch library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?",
47
+ embedding_model="text-embedding-ada-002",
48
+ top_k=3,
49
+ thresh=0.7,
50
+ max_words=3000,
51
+ completer_cfg={
52
+ "name": "ChatGPT",
53
+ "text_before_documents": (
54
+ "You are a chatbot assistant answering technical questions about pytorch, a library to train neural networks in python. "
55
+ "You can only respond to a question if the content necessary to answer the question is contained in the following provided documentation. "
56
+ "If the answer is in the documentation, summarize it in a helpful way to the user. "
57
+ "If it isn't, simply reply that you cannot answer the question. "
58
+ "Do not refer to the documentation directly, but use the instructions provided within it to answer questions. "
59
+ "Here is the documentation: "
60
+ "<DOCUMENTS> "
61
+ ),
62
+ "text_before_prompt": (
63
+ "<\DOCUMENTS>\n"
64
+ "REMEMBER:\n"
65
+ "You are a chatbot assistant answering technical questions about pytorch transformers, a library to train neural networks in python. "
66
+ "Here are the rules you must follow:\n"
67
+ "1) You must only respond with information contained in the documentation above. Say you do not know if the information is not provided.\n"
68
+ "2) Make sure to format your answers in Markdown format, including code block and snippets.\n"
69
+ "3) Do not include any links, urls or hyperlinks in your answers.\n"
70
+ "4) If you do not know the answer to a question, or if it is completely irrelevant to the library usage, simply reply with:\n"
71
+ "5) Do not refer to the documentation directly, but use the instructions provided within it to answer questions. "
72
+ "'I'm sorry, but I am an AI language model trained to assist with questions related to the pytorch transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?'"
73
+ "For example:\n"
74
+ "What is the meaning of life for pytorch?\n"
75
+ "I'm sorry, but I am an AI language model trained to assist with questions related to the pytorch library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?"
76
+ "Now answer the following question:\n"
77
+ ),
78
+ "completion_kwargs": {
79
+ "model": "gpt-3.5-turbo",
80
+ },
81
+ },
82
+ response_format="gradio",
83
+ source="pytorch",
84
+ )
85
+
86
+ lightning_cfg = BusterConfig(
87
+ unknown_prompt="I'm sorry, but I am an AI language model trained to assist with questions related to the pytorch lightning library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?",
88
+ embedding_model="text-embedding-ada-002",
89
+ top_k=3,
90
+ thresh=0.7,
91
+ max_words=3000,
92
+ completer_cfg={
93
+ "name": "ChatGPT",
94
+ "text_before_documents": (
95
+ "You are a chatbot assistant answering technical questions about pytorch lightning, a library to train neural networks in python. "
96
+ "You can only respond to a question if the content necessary to answer the question is contained in the following provided documentation. "
97
+ "If the answer is in the documentation, summarize it in a helpful way to the user. "
98
+ "If it isn't, simply reply that you cannot answer the question. "
99
+ "Do not refer to the documentation directly, but use the instructions provided within it to answer questions. "
100
+ "Here is the documentation: "
101
+ "<DOCUMENTS> "
102
+ ),
103
+ "text_before_prompt": (
104
+ "<\DOCUMENTS>\n"
105
+ "REMEMBER:\n"
106
+ "You are a chatbot assistant answering technical questions about pytorch lightning transformers, a library to train neural networks in python. "
107
+ "Here are the rules you must follow:\n"
108
+ "1) You must only respond with information contained in the documentation above. Say you do not know if the information is not provided.\n"
109
+ "2) Make sure to format your answers in Markdown format, including code block and snippets.\n"
110
+ "3) Do not include any links, urls or hyperlinks in your answers.\n"
111
+ "4) If you do not know the answer to a question, or if it is completely irrelevant to the library usage, simply reply with:\n"
112
+ "5) Do not refer to the documentation directly, but use the instructions provided within it to answer questions. "
113
+ "'I'm sorry, but I am an AI language model trained to assist with questions related to the pytorch lightning library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?'"
114
+ "For example:\n"
115
+ "What is the meaning of life for pytorch lightning?\n"
116
+ "I'm sorry, but I am an AI language model trained to assist with questions related to the pytorch lightning library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?"
117
+ "Now answer the following question:\n"
118
+ ),
119
+ "completion_kwargs": {
120
+ "model": "gpt-3.5-turbo",
121
+ },
122
+ },
123
+ response_format="gradio",
124
+ source="lightning",
125
+ )
126
+
127
+
128
+ godot_cfg = BusterConfig(
129
+ unknown_prompt="I'm sorry, but I am an AI language model trained to assist with questions related to the godot library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?",
130
+ embedding_model="text-embedding-ada-002",
131
+ top_k=3,
132
+ thresh=0.7,
133
+ max_words=3000,
134
+ completer_cfg={
135
+ "name": "ChatGPT",
136
+ "text_before_documents": (
137
+ "You are a chatbot assistant answering technical questions about godot, a game-engine library. "
138
+ "You can only respond to a question if the content necessary to answer the question is contained in the following provided documentation. "
139
+ "If the answer is in the documentation, summarize it in a helpful way to the user. "
140
+ "If it isn't, simply reply that you cannot answer the question. "
141
+ "Do not refer to the documentation directly, but use the instructions provided within it to answer questions. "
142
+ "Here is the documentation: "
143
+ "<DOCUMENTS> "
144
+ ),
145
+ "text_before_prompt": (
146
+ "<\DOCUMENTS>\n"
147
+ "REMEMBER:\n"
148
+ "You are a chatbot assistant answering technical questions about godot, a game-engine library."
149
+ "Here are the rules you must follow:\n"
150
+ "1) You must only respond with information contained in the documentation above. Say you do not know if the information is not provided.\n"
151
+ "2) Make sure to format your answers in Markdown format, including code block and snippets.\n"
152
+ "3) Do not include any links, urls or hyperlinks in your answers.\n"
153
+ "4) If you do not know the answer to a question, or if it is completely irrelevant to the library usage, simply reply with:\n"
154
+ "5) Do not refer to the documentation directly, but use the instructions provided within it to answer questions. "
155
+ "'I'm sorry, but I am an AI language model trained to assist with questions related to the godot library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?'"
156
+ "For example:\n"
157
+ "What is the meaning of life for godot?\n"
158
+ "I'm sorry, but I am an AI language model trained to assist with questions related to the pytorch lightning library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?"
159
+ "Now answer the following question:\n"
160
+ ),
161
+ "completion_kwargs": {
162
+ "model": "gpt-3.5-turbo",
163
+ },
164
+ },
165
+ response_format="gradio",
166
+ source="godot",
167
+ )
168
+
169
+
170
+ available_configs = {
171
+ "huggingface": huggingface_cfg,
172
+ "pytorch": pytorch_cfg,
173
+ "pytorch-lightning": lightning_cfg,
174
+ "godot": godot_cfg,
175
+ }
buster/apps/gradio_app.py CHANGED
@@ -3,53 +3,27 @@ import pathlib
3
 
4
  import gradio as gr
5
 
 
6
  from buster.buster import Buster, BusterConfig
 
 
7
 
8
- DATA_DIR = pathlib.Path(__file__).parent.parent.resolve() / "data" # points to ../data/
9
-
10
- buster_cfg = BusterConfig(
11
- documents_file=os.path.join(DATA_DIR, "document_embeddings_huggingface.tar.gz"),
12
- unknown_prompt="I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?",
13
- embedding_model="text-embedding-ada-002",
14
- top_k=3,
15
- thresh=0.7,
16
- max_words=3000,
17
- completer_cfg={
18
- "name": "ChatGPT",
19
- "text_before_documents": (
20
- "You are a chatbot assistant answering technical questions about huggingface transformers, a library to train transformers in python. "
21
- "You can only respond to a question if the content necessary to answer the question is contained in the following provided documentation. "
22
- "If it isn't, simply reply that you cannot answer the question. "
23
- "Here is the documentation: "
24
- "<BEGIN_DOCUMENTATION> "
25
- ),
26
- "text_before_prompt": (
27
- "<\END_DOCUMENTATION>\n"
28
- "REMINDER:\n"
29
- "You are a chatbot assistant answering technical questions about huggingface transformers, a library to train transformers in python. "
30
- "Here are the rules you must follow:\n"
31
- "1) You must only respond with information contained in the documentation above. Say you do not know if the information is not provided.\n"
32
- "2) Make sure to format your answers in Markdown format, including code block and snippets.\n"
33
- "3) Do not include any links to urls or hyperlinks in your answers.\n"
34
- "4) If you do not know the answer to a question, or if it is completely irrelevant to the library usage, simply reply with:\n"
35
- "'I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?'"
36
- "For example:\n"
37
- "What is the meaning of life for huggingface?\n"
38
- "I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?"
39
- "Now answer the following question:\n"
40
- ),
41
- "completion_kwargs": {
42
- "model": "gpt-3.5-turbo",
43
- },
44
- },
45
- response_format="gradio",
46
- )
47
- buster = Buster(buster_cfg)
48
-
49
-
50
- def chat(question, history):
51
- history = history or []
52
 
 
 
 
 
 
53
  answer = buster.process_input(question)
54
 
55
  # formatting hack for code blocks to render properly every time
@@ -59,11 +33,20 @@ def chat(question, history):
59
  return history, history
60
 
61
 
62
- block = gr.Blocks(css=".gradio-container {background-color: lightgray}")
63
 
64
  with block:
65
  with gr.Row():
66
- gr.Markdown("<h3><center>Buster 🤖: A Question-Answering Bot for Huggingface 🤗 Transformers </center></h3>")
 
 
 
 
 
 
 
 
 
67
 
68
  chatbot = gr.Chatbot()
69
 
@@ -75,11 +58,11 @@ with block:
75
  )
76
  submit = gr.Button(value="Send", variant="secondary").style(full_width=False)
77
 
78
- gr.Examples(
 
79
  examples=[
80
  "What kind of models should I use for images and text?",
81
  "When should I finetune a model vs. training it form scratch?",
82
- "How can I deploy my trained huggingface model?",
83
  "Can you give me some python code to quickly finetune a model on my sentiment analysis dataset?",
84
  ],
85
  inputs=message,
@@ -95,8 +78,8 @@ with block:
95
  state = gr.State()
96
  agent_state = gr.State()
97
 
98
- submit.click(chat, inputs=[message, state], outputs=[chatbot, state])
99
- message.submit(chat, inputs=[message, state], outputs=[chatbot, state])
100
 
101
 
102
  block.launch(debug=True)
 
3
 
4
  import gradio as gr
5
 
6
+ from buster.apps.bot_configs import available_configs
7
  from buster.buster import Buster, BusterConfig
8
+ from buster.documents.base import DocumentsManager
9
+ from buster.documents.utils import download_db, get_documents_manager_from_extension
10
 
11
+ DEFAULT_CONFIG = "huggingface"
12
+ DB_URL = "https://huggingface.co/datasets/jerpint/buster-data/resolve/main/documents.db"
13
+
14
+ # Download the db...
15
+ documents_filepath = download_db(db_url=DB_URL, output_dir="./data")
16
+ documents: DocumentsManager = get_documents_manager_from_extension(documents_filepath)(documents_filepath)
17
+
18
+ # initialize buster with the default config...
19
+ default_cfg: BusterConfig = available_configs.get(DEFAULT_CONFIG)
20
+ buster = Buster(cfg=default_cfg, documents=documents)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
+
23
+ def chat(question, history, bot_source):
24
+ history = history or []
25
+ cfg = available_configs.get(bot_source)
26
+ buster.update_cfg(cfg)
27
  answer = buster.process_input(question)
28
 
29
  # formatting hack for code blocks to render properly every time
 
33
  return history, history
34
 
35
 
36
+ block = gr.Blocks(css="#chatbot .overflow-y-auto{height:500px}")
37
 
38
  with block:
39
  with gr.Row():
40
+ gr.Markdown("<h3><center>Buster 🤖: A Question-Answering Bot for open-source libraries </center></h3>")
41
+
42
+ doc_source = gr.Dropdown(
43
+ choices=sorted(list(available_configs.keys())),
44
+ value=DEFAULT_CONFIG,
45
+ interactive=True,
46
+ multiselect=False,
47
+ label="Source of Documentation",
48
+ info="The source of documentation to select from",
49
+ )
50
 
51
  chatbot = gr.Chatbot()
52
 
 
58
  )
59
  submit = gr.Button(value="Send", variant="secondary").style(full_width=False)
60
 
61
+ examples = gr.Examples(
62
+ # TODO: seems not possible (for now) to update examples on change...
63
  examples=[
64
  "What kind of models should I use for images and text?",
65
  "When should I finetune a model vs. training it form scratch?",
 
66
  "Can you give me some python code to quickly finetune a model on my sentiment analysis dataset?",
67
  ],
68
  inputs=message,
 
78
  state = gr.State()
79
  agent_state = gr.State()
80
 
81
+ submit.click(chat, inputs=[message, state, doc_source], outputs=[chatbot, state])
82
+ message.submit(chat, inputs=[message, state, doc_source], outputs=[chatbot, state])
83
 
84
 
85
  block.launch(debug=True)
buster/buster.py CHANGED
@@ -1,12 +1,12 @@
1
  import logging
2
  from dataclasses import dataclass, field
 
3
 
4
  import numpy as np
5
  import pandas as pd
6
  from openai.embeddings_utils import cosine_similarity, get_embedding
7
 
8
  from buster.completers import get_completer
9
- from buster.documents import get_documents_manager_from_extension
10
  from buster.formatter import (
11
  Response,
12
  ResponseFormatter,
@@ -33,6 +33,7 @@ class BusterConfig:
33
  unknown_prompt: Prompt to use to generate the "I don't know" embedding to compare to.
34
  text_before_prompt: Text to prompt GPT with before the user prompt, but after the documentation.
35
  reponse_footnote: Generic response to add the the chatbot's reply.
 
36
  """
37
 
38
  documents_file: str = "buster/data/document_embeddings.tar.gz"
@@ -60,34 +61,45 @@ class BusterConfig:
60
  response_format: str = "slack"
61
  unknown_prompt: str = "I Don't know how to answer your question."
62
  response_footnote: str = "I'm a bot 🤖 and not always perfect."
 
 
 
 
63
 
64
 
65
  class Buster:
66
- def __init__(self, cfg: BusterConfig):
67
- # TODO: right now, the cfg is being passed as an omegaconf, is this what we want?
68
  self.cfg = cfg
69
- self.completer = get_completer(cfg.completer_cfg)
70
- self._init_documents()
71
- self._init_unk_embedding()
72
- self._init_response_formatter()
 
 
 
 
 
 
 
 
 
73
 
74
- def _init_response_formatter(self):
 
 
 
 
 
75
  self.response_formatter = response_formatter_factory(
76
  format=self.cfg.response_format, response_footnote=self.cfg.response_footnote
77
  )
 
78
 
79
- def _init_documents(self):
80
- filepath = self.cfg.documents_file
81
- logger.info(f"loading embeddings from {filepath}...")
82
- self.documents = get_documents_manager_from_extension(filepath)(filepath)
83
- logger.info(f"embeddings loaded.")
84
-
85
- def _init_unk_embedding(self):
86
- logger.info("Generating UNK embedding...")
87
- self.unk_embedding = get_embedding(
88
- self.cfg.unknown_prompt,
89
- engine=self.cfg.embedding_model,
90
- )
91
 
92
  def rank_documents(
93
  self,
@@ -95,16 +107,17 @@ class Buster:
95
  top_k: float,
96
  thresh: float,
97
  engine: str,
 
98
  ) -> pd.DataFrame:
99
  """
100
  Compare the question to the series of documents and return the best matching documents.
101
  """
102
 
103
- query_embedding = get_embedding(
104
  query,
105
  engine=engine,
106
  )
107
- matched_documents = self.documents.retrieve(query_embedding, top_k)
108
 
109
  # log matched_documents to the console
110
  logger.info(f"matched documents before thresh: {matched_documents}")
@@ -119,7 +132,9 @@ class Buster:
119
  def prepare_documents(self, matched_documents: pd.DataFrame, max_words: int) -> str:
120
  # gather the documents in one large plaintext variable
121
  documents_list = matched_documents.content.to_list()
122
- documents_str = " ".join(documents_list)
 
 
123
 
124
  # truncate the documents to fit
125
  # TODO: increase to actual token count
@@ -135,11 +150,13 @@ class Buster:
135
  self,
136
  response,
137
  matched_documents: pd.DataFrame,
138
- unknown_prompt: str,
139
  ):
140
  logger.info(f"GPT Response:\n{response.text}")
141
  sources = (
142
- Source(dct["source"], dct["url"], dct["similarity"]) for dct in matched_documents.to_dict(orient="records")
 
 
 
143
  )
144
 
145
  return sources
@@ -154,7 +171,7 @@ class Buster:
154
 
155
  set the unk_threshold to 0 to essentially turn off this feature.
156
  """
157
- response_embedding = get_embedding(
158
  completion,
159
  engine=engine,
160
  )
@@ -180,17 +197,18 @@ class Buster:
180
  top_k=self.cfg.top_k,
181
  thresh=self.cfg.thresh,
182
  engine=self.cfg.embedding_model,
 
183
  )
184
 
185
  if len(matched_documents) == 0:
186
- response = Response("I did not find any sources to answer your question.")
187
  sources = tuple()
188
  return self.response_formatter(response, sources)
189
 
190
  # generate a completion
191
  documents: str = self.prepare_documents(matched_documents, max_words=self.cfg.max_words)
192
- response = self.completer.generate_response(user_input, documents)
193
- sources = self.add_sources(response, matched_documents, self.cfg.unknown_prompt)
194
 
195
  # check for relevance
196
  relevant = self.check_response_relevance(
 
1
  import logging
2
  from dataclasses import dataclass, field
3
+ from functools import lru_cache
4
 
5
  import numpy as np
6
  import pandas as pd
7
  from openai.embeddings_utils import cosine_similarity, get_embedding
8
 
9
  from buster.completers import get_completer
 
10
  from buster.formatter import (
11
  Response,
12
  ResponseFormatter,
 
33
  unknown_prompt: Prompt to use to generate the "I don't know" embedding to compare to.
34
  text_before_prompt: Text to prompt GPT with before the user prompt, but after the documentation.
35
  reponse_footnote: Generic response to add the the chatbot's reply.
36
+ source: the source of the document to consider
37
  """
38
 
39
  documents_file: str = "buster/data/document_embeddings.tar.gz"
 
61
  response_format: str = "slack"
62
  unknown_prompt: str = "I Don't know how to answer your question."
63
  response_footnote: str = "I'm a bot 🤖 and not always perfect."
64
+ source: str = ""
65
+
66
+
67
+ from buster.documents.base import DocumentsManager
68
 
69
 
70
  class Buster:
71
+ def __init__(self, cfg: BusterConfig, documents: DocumentsManager):
72
+ self._unk_embedding = None
73
  self.cfg = cfg
74
+ self.update_cfg(cfg)
75
+
76
+ self.documents = documents
77
+
78
+ @property
79
+ def unk_embedding(self):
80
+ return self._unk_embedding
81
+
82
+ @unk_embedding.setter
83
+ def unk_embedding(self, embedding):
84
+ logger.info("Setting new UNK embedding...")
85
+ self._unk_embedding = embedding
86
+ return self._unk_embedding
87
 
88
+ def update_cfg(self, cfg: BusterConfig):
89
+ """Every time we set a new config, we update the things that need to be updated."""
90
+ logger.info(f"Updating config to {cfg.source}:\n{cfg}")
91
+ self.cfg = cfg
92
+ self.completer = get_completer(cfg.completer_cfg)
93
+ self.unk_embedding = self.get_embedding(self.cfg.unknown_prompt, engine=self.cfg.embedding_model)
94
  self.response_formatter = response_formatter_factory(
95
  format=self.cfg.response_format, response_footnote=self.cfg.response_footnote
96
  )
97
+ logger.info(f"Config Updated.")
98
 
99
+ @lru_cache
100
+ def get_embedding(self, query: str, engine: str):
101
+ logger.info("generating embedding")
102
+ return get_embedding(query, engine=engine)
 
 
 
 
 
 
 
 
103
 
104
  def rank_documents(
105
  self,
 
107
  top_k: float,
108
  thresh: float,
109
  engine: str,
110
+ source: str,
111
  ) -> pd.DataFrame:
112
  """
113
  Compare the question to the series of documents and return the best matching documents.
114
  """
115
 
116
+ query_embedding = self.get_embedding(
117
  query,
118
  engine=engine,
119
  )
120
+ matched_documents = self.documents.retrieve(query_embedding, top_k=top_k, source=source)
121
 
122
  # log matched_documents to the console
123
  logger.info(f"matched documents before thresh: {matched_documents}")
 
132
  def prepare_documents(self, matched_documents: pd.DataFrame, max_words: int) -> str:
133
  # gather the documents in one large plaintext variable
134
  documents_list = matched_documents.content.to_list()
135
+ documents_str = ""
136
+ for idx, doc in enumerate(documents_list):
137
+ documents_str += f"<DOCUMENT> {doc} <\DOCUMENT>"
138
 
139
  # truncate the documents to fit
140
  # TODO: increase to actual token count
 
150
  self,
151
  response,
152
  matched_documents: pd.DataFrame,
 
153
  ):
154
  logger.info(f"GPT Response:\n{response.text}")
155
  sources = (
156
+ Source(
157
+ source=dct["source"], title=dct["title"], url=dct["url"], question_similarity=dct["similarity"] * 100
158
+ )
159
+ for dct in matched_documents.to_dict(orient="records")
160
  )
161
 
162
  return sources
 
171
 
172
  set the unk_threshold to 0 to essentially turn off this feature.
173
  """
174
+ response_embedding = self.get_embedding(
175
  completion,
176
  engine=engine,
177
  )
 
197
  top_k=self.cfg.top_k,
198
  thresh=self.cfg.thresh,
199
  engine=self.cfg.embedding_model,
200
+ source=self.cfg.source,
201
  )
202
 
203
  if len(matched_documents) == 0:
204
+ response = Response(self.cfg.unknown_prompt)
205
  sources = tuple()
206
  return self.response_formatter(response, sources)
207
 
208
  # generate a completion
209
  documents: str = self.prepare_documents(matched_documents, max_words=self.cfg.max_words)
210
+ response: Response = self.completer.generate_response(user_input, documents)
211
+ sources = self.add_sources(response, matched_documents)
212
 
213
  # check for relevance
214
  relevant = self.check_response_relevance(
buster/data/documents.db DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:b86c2b4f5a2ec410c2b9132ed62213528ba10c0dc260162f689e30ba677815f1
3
- size 244338688
 
 
 
 
buster/documents/utils.py CHANGED
@@ -1,4 +1,5 @@
1
  import os
 
2
  from typing import Type
3
 
4
  from buster.documents.base import DocumentsManager
@@ -12,6 +13,18 @@ def get_file_extension(filepath: str) -> str:
12
  return os.path.splitext(filepath)[1]
13
 
14
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  def get_documents_manager_from_extension(filepath: str) -> Type[DocumentsManager]:
16
  ext = get_file_extension(filepath)
17
 
 
1
  import os
2
+ import urllib.request
3
  from typing import Type
4
 
5
  from buster.documents.base import DocumentsManager
 
13
  return os.path.splitext(filepath)[1]
14
 
15
 
16
+ def download_db(db_url: str, output_dir: str):
17
+ os.makedirs(output_dir, exist_ok=True)
18
+ fname = os.path.join(output_dir, "documents.db")
19
+ if not os.path.exists(fname):
20
+ print(f"Downloading db file from {db_url} to {fname}...")
21
+ urllib.request.urlretrieve(db_url, fname)
22
+ print("Downloaded.")
23
+ else:
24
+ print("File already exists. Skipping.")
25
+ return fname
26
+
27
+
28
  def get_documents_manager_from_extension(filepath: str) -> Type[DocumentsManager]:
29
  ext = get_file_extension(filepath)
30
 
buster/formatter/base.py CHANGED
@@ -4,9 +4,10 @@ from typing import Iterable, NamedTuple
4
 
5
  # Should be from the `documents` module.
6
  class Source(NamedTuple):
7
- source: str
8
  url: str
9
  question_similarity: float
 
10
  # TODO Add answer similarity.
11
  # answer_similarity: float
12
 
@@ -22,7 +23,7 @@ class Response:
22
  @dataclass
23
  class ResponseFormatter:
24
  response_footnote: str
25
- source_template: str = "{source.name} (relevance: {source.question_similarity:2.3f})"
26
  error_msg_template: str = """Something went wrong:\n{response.error_msg}"""
27
  error_fallback_template: str = "Something went very wrong."
28
  sourced_answer_template: str = (
 
4
 
5
  # Should be from the `documents` module.
6
  class Source(NamedTuple):
7
+ title: str
8
  url: str
9
  question_similarity: float
10
+ source: str = ""
11
  # TODO Add answer similarity.
12
  # answer_similarity: float
13
 
 
23
  @dataclass
24
  class ResponseFormatter:
25
  response_footnote: str
26
+ source_template: str = "{source.name} (relevance: {source.question_similarity:2.1f})"
27
  error_msg_template: str = """Something went wrong:\n{response.error_msg}"""
28
  error_fallback_template: str = "Something went very wrong."
29
  sourced_answer_template: str = (
buster/formatter/gradio.py CHANGED
@@ -17,7 +17,7 @@ class GradioResponseFormatter(ResponseFormatter):
17
  """{footnote}"""
18
  )
19
  unsourced_answer_template: str = "{response.text}<br><br>{footnote}"
20
- source_template: str = """[🔗 {source.source}]({source.url}), relevance: {source.question_similarity:2.3f}"""
21
 
22
  def sources_list(self, sources: Iterable[Source]) -> str | None:
23
  """Format sources into a list."""
 
17
  """{footnote}"""
18
  )
19
  unsourced_answer_template: str = "{response.text}<br><br>{footnote}"
20
+ source_template: str = """[🔗 {source.title}]({source.url}), relevance: {source.question_similarity:2.1f} %"""
21
 
22
  def sources_list(self, sources: Iterable[Source]) -> str | None:
23
  """Format sources into a list."""
buster/formatter/html.py CHANGED
@@ -37,5 +37,5 @@ class HTMLResponseFormatter(ResponseFormatter):
37
  response.error,
38
  html.escape(response.error_msg) if response.error_msg else response.error_msg,
39
  )
40
- sources = (Source(html.escape(source.source), source.url, source.question_similarity) for source in sources)
41
  return super().__call__(response, sources)
 
37
  response.error,
38
  html.escape(response.error_msg) if response.error_msg else response.error_msg,
39
  )
40
+ sources = (Source(html.escape(source.title), source.url, source.question_similarity) for source in sources)
41
  return super().__call__(response, sources)
buster/formatter/markdown.py CHANGED
@@ -8,7 +8,7 @@ from buster.formatter.base import ResponseFormatter, Source
8
  class MarkdownResponseFormatter(ResponseFormatter):
9
  """Format the answer in markdown."""
10
 
11
- source_template: str = """[🔗 {source.source}]({source.url}), relevance: {source.question_similarity:2.3f}"""
12
 
13
  def sources_list(self, sources: Iterable[Source]) -> str | None:
14
  """Format sources into a list."""
 
8
  class MarkdownResponseFormatter(ResponseFormatter):
9
  """Format the answer in markdown."""
10
 
11
+ source_template: str = """[🔗 {source.title}]({source.url}), relevance: {source.question_similarity:2.3f}"""
12
 
13
  def sources_list(self, sources: Iterable[Source]) -> str | None:
14
  """Format sources into a list."""
buster/formatter/slack.py CHANGED
@@ -8,7 +8,7 @@ from buster.formatter import ResponseFormatter, Source
8
  class SlackResponseFormatter(ResponseFormatter):
9
  """Format the answer for Slack."""
10
 
11
- source_template: str = """<{source.url}|🔗 {source.source}>, relevance: {source.question_similarity:2.3f}"""
12
 
13
  def sources_list(self, sources: Iterable[Source]) -> str | None:
14
  """Format sources into a list."""
 
8
  class SlackResponseFormatter(ResponseFormatter):
9
  """Format the answer for Slack."""
10
 
11
+ source_template: str = """<{source.url}|🔗 {source.title}>, relevance: {source.question_similarity:2.3f}"""
12
 
13
  def sources_list(self, sources: Iterable[Source]) -> str | None:
14
  """Format sources into a list."""
pyproject.toml CHANGED
@@ -18,3 +18,7 @@ profile = "black"
18
 
19
  [tool.black]
20
  line-length = 120
 
 
 
 
 
18
 
19
  [tool.black]
20
  line-length = 120
21
+
22
+ [tool.pytest.ini_options]
23
+ log_cli = true
24
+ log_cli_level = "INFO"
tests/test_chatbot.py CHANGED
@@ -5,7 +5,9 @@ import numpy as np
5
  import pandas as pd
6
 
7
  from buster.buster import Buster, BusterConfig
8
- from buster.documents import DocumentsManager
 
 
9
 
10
  TEST_DATA_DIR = Path(__file__).resolve().parent / "data"
11
  DOCUMENTS_FILE = os.path.join(str(TEST_DATA_DIR), "document_embeddings_huggingface_subset.tar.gz")
@@ -16,6 +18,17 @@ def get_fake_embedding(length=1536):
16
  return list(rng.random(length, dtype=np.float32))
17
 
18
 
 
 
 
 
 
 
 
 
 
 
 
19
  class DocumentsMock(DocumentsManager):
20
  def __init__(self, filepath):
21
  self.filepath = filepath
@@ -39,20 +52,24 @@ class DocumentsMock(DocumentsManager):
39
  return self.documents
40
 
41
 
 
 
 
 
 
42
  def test_chatbot_mock_data(tmp_path, monkeypatch):
43
  gpt_expected_answer = "this is GPT answer"
44
- monkeypatch.setattr("buster.buster.get_documents_manager_from_extension", lambda filepath: DocumentsMock)
45
- monkeypatch.setattr("buster.buster.get_embedding", lambda x, engine: get_fake_embedding())
46
- monkeypatch.setattr("openai.Completion.create", lambda **kwargs: {"choices": [{"text": gpt_expected_answer}]})
47
 
48
  hf_transformers_cfg = BusterConfig(
49
- documents_file=tmp_path / "not_a_real_file.tar.gz",
50
  unknown_prompt="This doesn't seem to be related to the huggingface library. I am not sure how to answer.",
51
  embedding_model="text-embedding-ada-002",
52
  top_k=3,
53
- thresh=0.7,
54
  max_words=3000,
55
  response_format="slack",
 
56
  completer_cfg={
57
  "name": "GPT3",
58
  "text_before_prompt": (
@@ -72,7 +89,9 @@ def test_chatbot_mock_data(tmp_path, monkeypatch):
72
  },
73
  },
74
  )
75
- buster = Buster(hf_transformers_cfg)
 
 
76
  answer = buster.process_input("What is a transformer?")
77
  assert isinstance(answer, str)
78
  assert answer.startswith(gpt_expected_answer)
@@ -80,7 +99,6 @@ def test_chatbot_mock_data(tmp_path, monkeypatch):
80
 
81
  def test_chatbot_real_data__chatGPT():
82
  hf_transformers_cfg = BusterConfig(
83
- documents_file=DOCUMENTS_FILE,
84
  unknown_prompt="I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?",
85
  embedding_model="text-embedding-ada-002",
86
  top_k=3,
@@ -101,14 +119,14 @@ def test_chatbot_real_data__chatGPT():
101
  },
102
  },
103
  )
104
- buster = Buster(hf_transformers_cfg)
 
105
  answer = buster.process_input("What is a transformer?")
106
  assert isinstance(answer, str)
107
 
108
 
109
  def test_chatbot_real_data__chatGPT_OOD():
110
  buster_cfg = BusterConfig(
111
- documents_file=DOCUMENTS_FILE,
112
  unknown_prompt="I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?",
113
  embedding_model="text-embedding-ada-002",
114
  top_k=3,
@@ -122,7 +140,7 @@ def test_chatbot_real_data__chatGPT_OOD():
122
  """Do not include any links to urls or hyperlinks in your answers. """
123
  """If you do not know the answer to a question, or if it is completely irrelevant to the library usage, let the user know you cannot answer. """
124
  """Use this response: """
125
- """I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?"""
126
  """For example:\n"""
127
  """What is the meaning of life for huggingface?\n"""
128
  """I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?"""
@@ -135,7 +153,8 @@ def test_chatbot_real_data__chatGPT_OOD():
135
  },
136
  response_format="gradio",
137
  )
138
- buster = Buster(buster_cfg)
 
139
  answer = buster.process_input("What is a good recipe for brocolli soup?")
140
  assert isinstance(answer, str)
141
  assert buster_cfg.unknown_prompt in answer
@@ -143,7 +162,6 @@ def test_chatbot_real_data__chatGPT_OOD():
143
 
144
  def test_chatbot_real_data__GPT():
145
  hf_transformers_cfg = BusterConfig(
146
- documents_file=DOCUMENTS_FILE,
147
  unknown_prompt="This doesn't seem to be related to the huggingface library. I am not sure how to answer.",
148
  embedding_model="text-embedding-ada-002",
149
  top_k=3,
@@ -169,6 +187,7 @@ def test_chatbot_real_data__GPT():
169
  },
170
  },
171
  )
172
- buster = Buster(hf_transformers_cfg)
 
173
  answer = buster.process_input("What is a transformer?")
174
  assert isinstance(answer, str)
 
5
  import pandas as pd
6
 
7
  from buster.buster import Buster, BusterConfig
8
+ from buster.completers.base import Completer
9
+ from buster.documents import DocumentsManager, get_documents_manager_from_extension
10
+ from buster.formatter.base import Response
11
 
12
  TEST_DATA_DIR = Path(__file__).resolve().parent / "data"
13
  DOCUMENTS_FILE = os.path.join(str(TEST_DATA_DIR), "document_embeddings_huggingface_subset.tar.gz")
 
18
  return list(rng.random(length, dtype=np.float32))
19
 
20
 
21
+ class MockCompleter(Completer):
22
+ def __init__(self, expected_answer):
23
+ self.expected_answer = expected_answer
24
+
25
+ def complete(self):
26
+ return
27
+
28
+ def generate_response(self, user_input, documents) -> Response:
29
+ return Response(self.expected_answer)
30
+
31
+
32
  class DocumentsMock(DocumentsManager):
33
  def __init__(self, filepath):
34
  self.filepath = filepath
 
52
  return self.documents
53
 
54
 
55
+ import logging
56
+
57
+ logging.basicConfig(level=logging.INFO)
58
+
59
+
60
  def test_chatbot_mock_data(tmp_path, monkeypatch):
61
  gpt_expected_answer = "this is GPT answer"
62
+ monkeypatch.setattr(Buster, "get_embedding", lambda self, prompt, engine: get_fake_embedding())
63
+ monkeypatch.setattr("buster.buster.get_completer", lambda x: MockCompleter(expected_answer=gpt_expected_answer))
 
64
 
65
  hf_transformers_cfg = BusterConfig(
 
66
  unknown_prompt="This doesn't seem to be related to the huggingface library. I am not sure how to answer.",
67
  embedding_model="text-embedding-ada-002",
68
  top_k=3,
69
+ thresh=0,
70
  max_words=3000,
71
  response_format="slack",
72
+ source="fake source",
73
  completer_cfg={
74
  "name": "GPT3",
75
  "text_before_prompt": (
 
89
  },
90
  },
91
  )
92
+ filepath = tmp_path / "not_a_real_file.tar.gz"
93
+ documents = DocumentsMock(filepath)
94
+ buster = Buster(cfg=hf_transformers_cfg, documents=documents)
95
  answer = buster.process_input("What is a transformer?")
96
  assert isinstance(answer, str)
97
  assert answer.startswith(gpt_expected_answer)
 
99
 
100
  def test_chatbot_real_data__chatGPT():
101
  hf_transformers_cfg = BusterConfig(
 
102
  unknown_prompt="I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?",
103
  embedding_model="text-embedding-ada-002",
104
  top_k=3,
 
119
  },
120
  },
121
  )
122
+ documents = get_documents_manager_from_extension(DOCUMENTS_FILE)(DOCUMENTS_FILE)
123
+ buster = Buster(cfg=hf_transformers_cfg, documents=documents)
124
  answer = buster.process_input("What is a transformer?")
125
  assert isinstance(answer, str)
126
 
127
 
128
  def test_chatbot_real_data__chatGPT_OOD():
129
  buster_cfg = BusterConfig(
 
130
  unknown_prompt="I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?",
131
  embedding_model="text-embedding-ada-002",
132
  top_k=3,
 
140
  """Do not include any links to urls or hyperlinks in your answers. """
141
  """If you do not know the answer to a question, or if it is completely irrelevant to the library usage, let the user know you cannot answer. """
142
  """Use this response: """
143
+ """'I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?'\n"""
144
  """For example:\n"""
145
  """What is the meaning of life for huggingface?\n"""
146
  """I'm sorry, but I am an AI language model trained to assist with questions related to the huggingface transformers library. I cannot answer that question as it is not relevant to the library or its usage. Is there anything else I can assist you with?"""
 
153
  },
154
  response_format="gradio",
155
  )
156
+ documents = get_documents_manager_from_extension(DOCUMENTS_FILE)(DOCUMENTS_FILE)
157
+ buster = Buster(cfg=buster_cfg, documents=documents)
158
  answer = buster.process_input("What is a good recipe for brocolli soup?")
159
  assert isinstance(answer, str)
160
  assert buster_cfg.unknown_prompt in answer
 
162
 
163
  def test_chatbot_real_data__GPT():
164
  hf_transformers_cfg = BusterConfig(
 
165
  unknown_prompt="This doesn't seem to be related to the huggingface library. I am not sure how to answer.",
166
  embedding_model="text-embedding-ada-002",
167
  top_k=3,
 
187
  },
188
  },
189
  )
190
+ documents = get_documents_manager_from_extension(DOCUMENTS_FILE)(DOCUMENTS_FILE)
191
+ buster = Buster(cfg=hf_transformers_cfg, documents=documents)
192
  answer = buster.process_input("What is a transformer?")
193
  assert isinstance(answer, str)