jphillips commited on
Commit
566d4e4
·
unverified ·
1 Parent(s): 2fbdd0c

Vector storage (#1)

Browse files

* clean logging

* Add VectorStores and Embeddings

* Clear outputs

* Add qa chain

.gitignore CHANGED
@@ -1,3 +1,7 @@
 
 
 
 
1
  # Byte-compiled / optimized / DLL files
2
  __pycache__/
3
  *.py[cod]
 
1
+ ### Project specific
2
+ db/*
3
+ data/*
4
+ flagged
5
  # Byte-compiled / optimized / DLL files
6
  __pycache__/
7
  *.py[cod]
docs/similarity_search.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Notes on Similarity Search
2
+ Similarity search, also known as similarity measurement, is a key concept in many domains such as data mining, information retrieval, and machine learning. It quantifies the likeness or sameness between two data entities. Here, we explore three widely used methods for similarity search: Jaccard Similarity, W-Shingling, and Levenshtein Distance.
3
+
4
+ ## Jaccard Similarity
5
+ Jaccard similarity is a measure of how similar two sets are. It is defined as the size of the intersection divided by the size of the union of the two sets. It is a useful metric for comparing sets, because it is independent of the size of the sets, and it is symmetric, meaning that the Jaccard similarity of A and B is the same as the Jaccard similarity of B and A.
6
+
7
+ Jaccard similarity is commonly used in information retrieval applications like document clustering and collaborative filtering. It is also used in machine learning applications like k-means clustering and k-nearest neighbors.
8
+ ### Implementation :
9
+ ```python
10
+ def jaccard(x: str, y: str):
11
+ """Jaccard similarity of two strings"""
12
+ x = set(x.split())
13
+ y = set(y.split())
14
+ shared = x.intersection(y)
15
+ union = x.union(y)
16
+ return len(shared) / len(union)
17
+ ```
18
+ ### Pros:
19
+ - It's simple to understand and implement.
20
+ - It's good for comparing sets of data, such as lists or documents.
21
+ - It's binary, meaning it only cares if items exist, not how many times they exist.
22
+ ### Cons:
23
+ - It can be sensitive to the size of the data. If the data sets are large but the intersection is small, the similarity can be perceived as low.
24
+ - It does not take into account the frequency of the items.
25
+ ### Example:
26
+ You have two sets of data, A = {1, 2, 3, 4} and B = {3, 4, 5, 6}. The intersection of A and B is {3, 4}, and the union of A and B is {1, 2, 3, 4, 5, 6}. So, the Jaccard similarity is 2 (size of intersection) divided by 6 (size of union), which is approximately 0.33.
27
+
28
+ ## W-Shingling
29
+ Preprocessing method for strings or documents. It breaks the data into overlapping groups of W items. For example, if W = 2, then the string "I love to play football" would be broken into the following sets: {"I love", "love to", "to play", "play football"}. The W-shingling method is useful for comparing documents or strings, because it can detect similarities even if the documents are not exactly the same. For example, if you have two documents that are identical except for one word, the W-shingling method will still be able to detect the similarities between the two documents.
30
+
31
+ ### Implementation:
32
+ ```python
33
+ def w_shingling(a: str):
34
+ a = a.split()
35
+ return set([a[i], a[i+1]] for i in range(len(a)-1))
36
+ ```
37
+
38
+ ### Pros:
39
+ - It's useful for comparing documents or strings.
40
+ - It's able to detect similarities in different parts of the data, not just exact matches.
41
+ - It's robust to small changes or errors in the data.
42
+
43
+ ### Cons:
44
+ - The choice of the length of the shingles (W) can greatly affect the result. Too small, and it might not capture meaningful similarities. Too large, and it might miss important differences.
45
+ - It can be computationally intensive, especially for large documents or strings.
46
+
47
+ ### Example:
48
+ You have two sentences, "I love to play football" and "I like to play football". If we take 2-shingles (two-word groups), we get the following sets: {"I love", "love to", "to play", "play football"} and {"I like", "like to", "to play", "play football"}. The intersection is {"to play", "play football"}, and the union is all unique shingles, so the Jaccard similarity of the 2-shingles is 0.5.
49
+
50
+ ## Levenshtein Distance
51
+ Let's consider you have two words, say 'cat' and 'bat'. You want to find out how similar these two words are. One way to do this is to see how many letters you need to change in 'cat' to make it 'bat'. In this case, you only need to change the 'c' in 'cat' to a 'b' to make it 'bat'. So, the Levenshtein distance between 'cat' and 'bat' is 1. This method is used to find out how similar two pieces of data are by measuring the minimum number of changes needed to turn one piece of data into the other.
52
+ ### Implementation:
53
+ ```python
54
+ def levenshtein_distance(a:str, b:str):
55
+ lev = np.zeros((len(a),len(b)))
56
+ for i in range(len(a)):
57
+ for j in range(len(b)):
58
+ if min(i,j) == 0:
59
+ lev[i,j] = max(i,j)
60
+ else:
61
+ # calculate three possible operations
62
+ x = lev[i-1, j] # deletion
63
+ y = lev[i, j-1] # insertion
64
+ z = lev[i-1, j-1] # substitution
65
+ # take the minimum of the three
66
+ lev[i,j] = min(x,y,z)
67
+ if a[i] != b[j]:
68
+ # add one if the two characters are different
69
+ lev[i,j] += 1
70
+ return lev, lev[-1,-1]
71
+ ```
72
+
73
+ ### Pros:
74
+ - It's useful for comparing strings or sequences.
75
+ - It's able to quantify the difference between two pieces of data.
76
+ - It's useful in applications like spell checking, where you want to find the smallest number of edits to turn one word into another.
77
+ ### Cons:
78
+ - It can be computationally expensive for long strings.
79
+ - It does not handle well with transpositions (two characters being swapped), which will be counted as two operations instead of one.
80
+ ### Example:
81
+ The words "kitten" and "sitting" have a Levenshtein distance of 3 because three operations are needed to turn "kitten" into "sitting": replace 'k' with 's', replace 'e' with 'i', and append 'g'.
82
+
docs/vector_similarity_search.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Vector-Based Similarity Search
2
+ Vector-based similarity search, also known as vector space modeling, is a collection of techniques used in information retrieval and natural language processing. In these models, texts are represented as vectors in a multi-dimensional space, where each dimension corresponds to a separate term or concept. Similarity between texts can then be computed by comparing the vectors. Here, we explore three widely used methods for vector-based similarity search: TF-IDF, BM25, and SBERT.
3
+
4
+ ## TF-IDF (Term Frequency-Inverse Document Frequency)
5
+ TF-IDF is a statistical measure used to evaluate the importance of a word in a document, relative to a corpus. The importance of a word increases proportionally to the number of times it appears in the document, but is offset by the frequency of the word in the corpus.
6
+
7
+ TF-IDF is commonly used in information retrieval and text mining, where it is used to rank documents by relevance in response to a query.
8
+
9
+ ### Implementation:
10
+
11
+ ```python
12
+ import numpy as np
13
+
14
+ docs: list[str] = [a,b,c]
15
+ vocab = set(a+b+c)
16
+
17
+ def tf_idf(word:str, sentence:str):
18
+ term_frequency = sentence.count(word) / len(sentence)
19
+ iverse_document_frequency = np.log10(len(docs) / sum([1 for doc in docs if word in doc]))
20
+ return round(term_frequency * inverse_document_frequency, 4)
21
+
22
+ def vector_tf_idf(a:str, b:str, vocab:set[str]):
23
+ vec_a = []
24
+ vec_b = []
25
+ for word in vocab:
26
+ vec_a.append(tf_idf(word, a))
27
+ vec__b.append(tf_idf(word, b))
28
+ return vec_a, vec_b
29
+ ```
30
+
31
+ ### Pros:
32
+ - It's simple to understand and implement.
33
+ - It's good for comparing documents in a corpus.
34
+ - It takes into account not only the frequency of a term in a single document (TF), but also the distribution of the term in the entire document set (IDF).
35
+
36
+ ### Cons:
37
+ - It assumes that the terms are independent, which is often not the case in natural language.
38
+ Example:
39
+ Suppose we have a document set consisting of five documents. The term "the" appears often in all documents, while the term "zebra" appears many times in one document, but not in others. TF-IDF will assign a higher weight to "zebra" because it is more important for distinguishing documents in the set.
40
+
41
+ ## BM25 (Best Matching 25)
42
+ BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. BM25 can be viewed as an enhanced version of TFIDF. It's a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document. It serves as a method of normalizing term frequency by taking into account current and average document length.
43
+
44
+ ### Implementation:
45
+ ```python
46
+ import numpy as np
47
+
48
+ docs = [a,b,c,d,e,f]
49
+
50
+ avg_doc_length = sum([len(doc) for doc in docs]) / len(docs)
51
+ N = len(docs)
52
+
53
+ def bm25(word:str, sentence:str, k:float=1.2, b:float=0.75):
54
+ freq = sentence.count(word)
55
+ term_freq = freq * (k + 1) / (freq + k * (1 - b + b * len(sentence) / avg_doc_length))
56
+ inverse_document_frequency = np.log(((N - N_q + 0.5) / (N_q + 0.5)) + 1)
57
+ return round(term_freq * inverse_document_frequency, 4)
58
+
59
+ ```
60
+
61
+ ### Pros:
62
+ - It's effective for ranking documents in response to a user query.
63
+ - It takes into account term frequency and document length.
64
+ ### Cons:
65
+ - Like TF-IDF, it assumes that the terms are independent.
66
+ ### Example:
67
+ Consider a document set consisting of five documents. If a user's query is "zebra", the BM25 score for each document will be calculated based on the occurrence of "zebra" and the length of the document. Documents with a higher frequency of "zebra" and shorter lengths will get higher scores.
68
+
69
+ ## SBERT (Sentence-BERT)
70
+ SBERT is a modification of the pre-trained BERT network that is specifically designed for sentence embeddings. It uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.
71
+
72
+ SBERT utilizes dense vector representations of sentences that are trained on large datasets. It can be used for a wide range of language understanding tasks, including sentence similarity, semantic search, and clustering. This allows for more semantic similarity detection than TF-IDF or BM25.
73
+
74
+ ### Implementation:
75
+ ```python
76
+ from sentence_transformers import SentenceTransformer
77
+ docs = [a,b,c,d,e,f]
78
+
79
+ def compute_sbert(docs: list[str]):
80
+ model = SentenceTransformer('bert-base-nli-mean-tokens')
81
+ sentence_embeddings = model.encode(corpus)
82
+ embeddings = model.encode(corpus)
83
+ return embeddings
84
+
85
+ from sklearn.metrics.pairwise import cosine_similarity
86
+ import numpy as np
87
+
88
+ def score_sbert(sentence_embeddings: np.ndarray):
89
+ scores = np.zeros((sentence_embeddings.shape[0], sentence_embeddings.shape[0]))
90
+ for i in range(sentence_embeddings.shape[0]):
91
+ scores[i,:] = cosine_similarity(sentence_embeddings[i], sentence_embeddings)[0]
92
+ return scores
93
+
94
+ import matplotlib.pyplot as plt
95
+ import seaborn as sns
96
+
97
+ def plot_scores(scores):
98
+ plt.figure(figsize=(10,9))
99
+ labels=['a','b','c','d','e','f']
100
+ sns.heatmap(scores, xticklabels=labels, yticklabels=labels, annot=True)
101
+ ```
102
+ ### Pros:
103
+ - It's effective for comparing sentence-level semantic similarity.
104
+ - It can handle a wide range of language understanding tasks.
105
+ ### Cons:
106
+ - It requires significant computational resources and time to train.
107
+ ### Example:
108
+ Suppose we have three sentences: "I have a dog", "I have a pet", and "The car is red". If we compute the SBERT embeddings for these sentences and then calculate the cosine similarity between the embeddings, we'll find that the first two sentences ("I have a dog" and "I have a pet") are more similar to each other than either is to the third sentence ("The car is red"). This is because SBERT is able to capture the semantic similarity between "dog" and "pet".
modules/chroma_sandbox.ipynb ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": null,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "from langchain.vectorstores import FAISS\n",
10
+ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
11
+ "from langchain import OpenAI\n",
12
+ "from langchain.chains import RetrievalQA\n",
13
+ "from langchain.document_loaders import DirectoryLoader\n",
14
+ "import magic\n",
15
+ "import os\n",
16
+ "import nltk\n",
17
+ "\n",
18
+ "openai_api_key = os.getenv(\"OPENAI_API_KEY\")\n",
19
+ "data_location= os.getenv(\"VECTOR_DATA_DIR\")"
20
+ ]
21
+ },
22
+ {
23
+ "cell_type": "markdown",
24
+ "metadata": {},
25
+ "source": [
26
+ "## Chroma"
27
+ ]
28
+ },
29
+ {
30
+ "cell_type": "code",
31
+ "execution_count": null,
32
+ "metadata": {},
33
+ "outputs": [],
34
+ "source": [
35
+ "from modules.vector_stores.vector_stores.chroma_manager import get_default_chroma_mgr\n",
36
+ "\n",
37
+ "chroma_mgr = get_default_chroma_mgr(persisted=True)"
38
+ ]
39
+ },
40
+ {
41
+ "cell_type": "code",
42
+ "execution_count": null,
43
+ "metadata": {},
44
+ "outputs": [],
45
+ "source": [
46
+ "chroma_mgr.persist()"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "code",
51
+ "execution_count": null,
52
+ "metadata": {},
53
+ "outputs": [],
54
+ "source": [
55
+ "from modules.vector_stores.retrieval.basic_qa import get_default_qa\n",
56
+ "\n",
57
+ "qa = get_default_qa(chroma_mgr.db)\n"
58
+ ]
59
+ },
60
+ {
61
+ "cell_type": "code",
62
+ "execution_count": null,
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": [
66
+ "## Cite sources\n",
67
+ "def process_llm_response(llm_response):\n",
68
+ " print(llm_response['result'])\n",
69
+ " print('\\n\\nSources:')\n",
70
+ " for source in llm_response[\"source_documents\"]:\n",
71
+ " print(source.metadata['source'])"
72
+ ]
73
+ },
74
+ {
75
+ "cell_type": "code",
76
+ "execution_count": null,
77
+ "metadata": {},
78
+ "outputs": [],
79
+ "source": [
80
+ "# full example\n",
81
+ "query = \"What is a date table?\"\n",
82
+ "resp = qa.ask(query)"
83
+ ]
84
+ },
85
+ {
86
+ "cell_type": "markdown",
87
+ "metadata": {},
88
+ "source": [
89
+ "## FAISS"
90
+ ]
91
+ },
92
+ {
93
+ "cell_type": "code",
94
+ "execution_count": null,
95
+ "metadata": {},
96
+ "outputs": [],
97
+ "source": [
98
+ "from modules.vector_stores.loaders.pypdf_load_strategy import PyPDFLoadStrategy, PyPDFConfig, get_default_pypdf_loader\n",
99
+ "from modules.vector_stores.embedding.openai import OpenAIEmbeddings, OpenAIEmbedConfig, get_default_openai_embeddings\n",
100
+ "def get_example_pdf_embedding():\n",
101
+ " dir_location = \"../data\"\n",
102
+ " loader = get_default_pypdf_loader(dir_location)\n",
103
+ " documents = loader.load()\n",
104
+ " embeddings = get_default_openai_embeddings()\n",
105
+ " index = FAISS.from_documents(documents, embeddings)\n",
106
+ " return index\n",
107
+ "index = get_example_pdf_embedding()\n",
108
+ "llm = OpenAI(openai_api_key=openai_api_key)\n",
109
+ "qa = RetrievalQA.from_chain_type(llm=llm, chain_type=\"stuff\", retriever=index.as_retriever())\n",
110
+ "qa = RetrievalQA.from_chain_type(llm=llm,\n",
111
+ " chain_type=\"stuff\",\n",
112
+ " retriever=index.as_retriever(),\n",
113
+ " return_source_documents=True)\n",
114
+ "query = \"What is a date table?\"\n",
115
+ "result = qa({\"query\": query})"
116
+ ]
117
+ },
118
+ {
119
+ "cell_type": "code",
120
+ "execution_count": null,
121
+ "metadata": {},
122
+ "outputs": [],
123
+ "source": [
124
+ "result"
125
+ ]
126
+ },
127
+ {
128
+ "cell_type": "code",
129
+ "execution_count": null,
130
+ "metadata": {},
131
+ "outputs": [],
132
+ "source": [
133
+ "\n",
134
+ "docsearch = FAISS.from_documents(documents, embeddings)\n",
135
+ "llm = OpenAI(openai_api_key=openai_api_key)\n",
136
+ "qa = RetrievalQA.from_chain_type(llm=llm, chain_type=\"stuff\", retriever=docsearch.as_retriever())\n"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "code",
141
+ "execution_count": null,
142
+ "metadata": {},
143
+ "outputs": [],
144
+ "source": [
145
+ "qa = RetrievalQA.from_chain_type(llm=llm,\n",
146
+ " chain_type=\"stuff\",\n",
147
+ " retriever=docsearch.as_retriever(),\n",
148
+ " return_source_documents=True)\n",
149
+ "query = \"What is a date table?\"\n",
150
+ "result = qa({\"query\": query})"
151
+ ]
152
+ },
153
+ {
154
+ "cell_type": "code",
155
+ "execution_count": null,
156
+ "metadata": {},
157
+ "outputs": [],
158
+ "source": [
159
+ "result\n"
160
+ ]
161
+ }
162
+ ],
163
+ "metadata": {
164
+ "kernelspec": {
165
+ "display_name": ".venv",
166
+ "language": "python",
167
+ "name": "python3"
168
+ },
169
+ "language_info": {
170
+ "codemirror_mode": {
171
+ "name": "ipython",
172
+ "version": 3
173
+ },
174
+ "file_extension": ".py",
175
+ "mimetype": "text/x-python",
176
+ "name": "python",
177
+ "nbconvert_exporter": "python",
178
+ "pygments_lexer": "ipython3",
179
+ "version": "3.10.6"
180
+ },
181
+ "orig_nbformat": 4
182
+ },
183
+ "nbformat": 4,
184
+ "nbformat_minor": 2
185
+ }
modules/knowledge_retrieval/destination_chain.py CHANGED
@@ -36,8 +36,7 @@ class DestinationChainStrategy(DestinationChain):
36
  def __init__(self, config: LLMChainConfig, display: Callable, knowledge_domain: KnowledgeDomain, usage: str):
37
  settings = UserSettings.get_instance()
38
  api_key = settings.get_api_key()
39
- print("Api key")
40
- print(api_key)
41
  super().__init__(api_key=api_key, knowledge_domain=knowledge_domain, llm=config.llm_class, display=display, usage=usage)
42
 
43
  self.llm = config.llm_class(temperature=config.temperature, max_tokens=config.max_tokens)
 
36
  def __init__(self, config: LLMChainConfig, display: Callable, knowledge_domain: KnowledgeDomain, usage: str):
37
  settings = UserSettings.get_instance()
38
  api_key = settings.get_api_key()
39
+
 
40
  super().__init__(api_key=api_key, knowledge_domain=knowledge_domain, llm=config.llm_class, display=display, usage=usage)
41
 
42
  self.llm = config.llm_class(temperature=config.temperature, max_tokens=config.max_tokens)
modules/llm/__init__.py ADDED
File without changes
modules/llm/defaults.py ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from langchain import OpenAI
3
+ from langchain.chat_models import ChatOpenAI
4
+ OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')
5
+
6
+
7
+
8
+ def get_default_cloud_chat_llm():
9
+ """
10
+ Returns a default LLM instance with the OpenAI API key set in the environment.
11
+
12
+ Returns:
13
+ OpenAI: A new OpenAI instance.
14
+ """
15
+ llm = ChatOpenAI(model="gpt-3.5-turbo", openai_api_key=OPENAI_API_KEY, temperature=0)
16
+ return llm
17
+
18
+ def get_default_cloud_completion_llm():
19
+ """
20
+ Returns a default LLM instance with the OpenAI API key set in the environment.
21
+
22
+ Returns:
23
+ OpenAI: A new OpenAI instance.
24
+ """
25
+ llm = OpenAI(openai_api_key=OPENAI_API_KEY)
26
+ return llm
27
+
28
+ def get_default_local_llm():
29
+ """
30
+ Coming soon!
31
+ """
32
+ pass
modules/reasoning/chain_of_thought.py CHANGED
@@ -1,5 +1,4 @@
1
  from langchain import PromptTemplate, LLMChain
2
- import streamlit as st
3
  from .reasoning_strategy import ReasoningStrategy, ReasoningConfig
4
  from typing import Callable
5
  import pprint
 
1
  from langchain import PromptTemplate, LLMChain
 
2
  from .reasoning_strategy import ReasoningStrategy, ReasoningConfig
3
  from typing import Callable
4
  import pprint
modules/reasoning/reasoning_strategy.py CHANGED
@@ -2,8 +2,6 @@ from langchain.llms import OpenAI
2
  from pydantic import BaseModel
3
  from langchain.llms.base import BaseLLM
4
  from typing import Type, Callable
5
- import streamlit as st
6
- import os
7
 
8
 
9
 
 
2
  from pydantic import BaseModel
3
  from langchain.llms.base import BaseLLM
4
  from typing import Type, Callable
 
 
5
 
6
 
7
 
modules/vector_stores/__init__.py ADDED
File without changes
modules/vector_stores/embedding/__init__.py ADDED
File without changes
modules/vector_stores/embedding/instructorxl.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ from langchain.embeddings import HuggingFaceInstructEmbeddings
2
+
3
+
4
+ def get_default_instructor_embedding():
5
+ instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
6
+ model_kwargs={"device": "cuda"})
7
+ return instructor_embeddings
modules/vector_stores/embedding/openai.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain.embeddings.openai import OpenAIEmbeddings
2
+ from dataclasses import dataclass
3
+ import os
4
+
5
+ @dataclass
6
+ class OpenAIEmbedConfig:
7
+ openai_api_key: str
8
+
9
+ def get_default_openai_embeddings() -> OpenAIEmbeddings:
10
+ """
11
+ Returns a default OpenAIEmbeddings instance with a default API key.
12
+
13
+ Returns:
14
+ OpenAIEmbeddings: A new OpenAIEmbeddings instance.
15
+ """
16
+ openai_api_key = os.environ.get('OPENAI_API_KEY')
17
+ embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
18
+ return embeddings
modules/vector_stores/embedding_bases.py ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ from abc import ABC, abstractmethod
2
+ class DocumentLoadStrategy(ABC):
3
+ @abstractmethod
4
+ def load(self):
5
+ pass
6
+
7
+ @abstractmethod
8
+ def split(self, documents, chunk_size, chunk_overlap):
9
+ pass
modules/vector_stores/loaders/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ from .pypdf_load_strategy import PyPDFLoadStrategy, PyPDFConfig, PyPDFLoader
modules/vector_stores/loaders/pypdf_load_strategy.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List, Iterable
2
+ from langchain.document_loaders import PyPDFLoader, DirectoryLoader
3
+ from loguru import logger
4
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
5
+ from modules.vector_stores.embedding_bases import DocumentLoadStrategy
6
+ from langchain.schema import Document
7
+ from dataclasses import dataclass
8
+
9
+ @dataclass
10
+ class PyPDFConfig:
11
+ dir_location: str
12
+ glob_pattern: str = "./*.pdf"
13
+ chunk_size: int = 1000
14
+ chunk_overlap: int = 200
15
+
16
+ class PyPDFLoadStrategy(DocumentLoadStrategy):
17
+ def __init__(self, config: PyPDFConfig):
18
+ """
19
+ A document load strategy that loads PDF files using PyPDF.
20
+
21
+ Args:
22
+ dir_path (str): The directory path to load PDF files from.
23
+ glob_pattern (str): The glob pattern to match PDF files.
24
+
25
+ Attributes:
26
+ logger (logging.Logger): The logger instance for this class.
27
+ dir_path (str): The directory path to load PDF files from.
28
+ glob_pattern (str): The glob pattern to match PDF files.
29
+ """
30
+ self.logger = logger
31
+ self.dir_path = config.dir_location
32
+ self.glob_pattern = config.glob_pattern
33
+ self.chunk_size = config.chunk_size
34
+ self.chunk_overlap = config.chunk_overlap
35
+
36
+
37
+ def load(self) -> Iterable[Document]:
38
+ """
39
+ Loads PDF files from the specified directory path and returns an iterable of `Document` instances.
40
+
41
+ Returns:
42
+ Iterable[Document]: An iterable of `Document` instances.
43
+ """
44
+ loader = DirectoryLoader(
45
+ self.dir_path, glob=self.glob_pattern, loader_cls=PyPDFLoader
46
+ ) # Note: If you're using PyPDFLoader then it will split by page for you already
47
+ documents = loader.load()
48
+ self.logger.info(f"Loaded {len(documents)} documents from {self.dir_path}")
49
+ return documents
50
+
51
+ def split(self, documents: Iterable[Document]):
52
+ """
53
+ Splits the specified list of PyPDFLoader instances into text chunks using a recursive character text splitter.
54
+
55
+ Args:
56
+ documents (Iterable[Document]): The documents to split.
57
+ chunk_size (int): The size of each text chunk.
58
+ chunk_overlap (int): The overlap between adjacent text chunks.
59
+
60
+ Returns:
61
+ List[str]: A list of text chunks.
62
+ """
63
+ text_splitter = RecursiveCharacterTextSplitter(
64
+ chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap
65
+ )
66
+ texts = text_splitter.split_documents(documents)
67
+ self.logger.info(f"Split {len(documents)} documents into {len(texts)}")
68
+ return texts
69
+
70
+
71
+ def get_default_pypdf_loader(dir_location: str) -> PyPDFLoadStrategy:
72
+ dir_path = dir_location
73
+
74
+ config: PyPDFConfig = PyPDFConfig(
75
+ dir_location=dir_path
76
+ )
77
+ return PyPDFLoadStrategy(config)
modules/vector_stores/retrieval/__init__.py ADDED
File without changes
modules/vector_stores/retrieval/basic_qa.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from modules.llm.defaults import get_default_cloud_chat_llm
2
+ from langchain.chains import RetrievalQA
3
+ from langchain.chains.retrieval_qa.base import BaseRetrievalQA
4
+ from langchain.vectorstores.base import VectorStore
5
+ from dataclasses import dataclass
6
+
7
+ @dataclass
8
+ class QAgentConfig:
9
+ index: VectorStore
10
+ qa: BaseRetrievalQA
11
+
12
+ class QAgent:
13
+ def __init__(self, index: VectorStore, qa: BaseRetrievalQA):
14
+ self.index = index
15
+ self.llm = get_default_cloud_chat_llm()
16
+ self.qa_chain = qa
17
+
18
+ def ask(self, question: str):
19
+ resp = self.qa_chain(question)
20
+ self.process_llm_response(resp)
21
+ return resp
22
+
23
+ ## Cite sources
24
+ def process_llm_response(self, llm_response):
25
+ print(llm_response['result'])
26
+ print('\n\nSources:')
27
+ for source in llm_response["source_documents"]:
28
+ print(source.metadata['source'])
29
+
30
+
31
+ def get_default_qa(index: VectorStore) -> QAgent:
32
+ llm = get_default_cloud_chat_llm()
33
+ qa_chain = RetrievalQA.from_chain_type(llm,chain_type="stuff",retriever=index.as_retriever(), return_source_documents=True)
34
+ qagent = QAgent(index, qa_chain)
35
+ return qagent
modules/vector_stores/vector_stores/__init__.py ADDED
File without changes
modules/vector_stores/vector_stores/chroma_manager.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain.vectorstores import Chroma
2
+ from langchain.vectorstores.base import VectorStoreRetriever
3
+ from modules.vector_stores.loaders.pypdf_load_strategy import (
4
+ get_default_pypdf_loader,
5
+ )
6
+ from modules.vector_stores.embedding.instructorxl import get_default_instructor_embedding
7
+
8
+ instruct_embed = get_default_instructor_embedding()
9
+ from dataclasses import dataclass
10
+ from langchain.embeddings.base import Embeddings
11
+ from typing import Iterable
12
+ from langchain.schema import Document
13
+ from loguru import logger
14
+
15
+
16
+ @dataclass
17
+ class ChromaConfig:
18
+ documents: Iterable[Document]
19
+ persist_directory: str
20
+ embedding: Embeddings
21
+ persisted: bool = False
22
+
23
+
24
+ class ChromaManager:
25
+ def __init__(self, config: ChromaConfig):
26
+ self.documents = config.documents
27
+ self.persist_directory = config.persist_directory
28
+ self.embedding = config.embedding
29
+ if config.persisted:
30
+ self.db = Chroma(
31
+ persist_directory=config.persist_directory, embedding_function=config.embedding
32
+ )
33
+ else:
34
+ self.db = Chroma.from_documents(
35
+ documents=config.documents,
36
+ embedding=config.embedding,
37
+ persist_directory=config.persist_directory,
38
+ )
39
+
40
+ def persist(self):
41
+ logger.info("Persisting Chroma to disk...")
42
+ self.db.persist()
43
+ logger.info("Chroma saved to %s", self.persist_directory)
44
+
45
+ def delete(self):
46
+ logger.info("Deleting Chroma from disk...")
47
+ self.db.delete_collection()
48
+ self.db.persist()
49
+ logger.info("Chroma deleted from %s", self.persist_directory)
50
+
51
+ def fetch_documents(self, query):
52
+ logger.info("Fetching documents from Chroma...")
53
+ retriever: VectorStoreRetriever = self.db.as_retriever()
54
+ documents = retriever.get_relevant_documents(query)
55
+ logger.info("Fetched %s documents from Chroma", len(documents))
56
+ return documents
57
+
58
+
59
+ def get_default_chroma_mgr(persisted=False):
60
+ """
61
+ Returns a default ChromaConfig instance. The default currently only reads in pdf files from the data directory.
62
+
63
+ Returns:
64
+ ChromaConfig: A new ChromaConfig instance.
65
+ """
66
+ dir_location = "../data"
67
+ persist_directory = "../db"
68
+ loader = get_default_pypdf_loader(dir_location)
69
+ documents: Iterable[Document] = loader.load()
70
+ embedding = get_default_instructor_embedding()
71
+ if persisted:
72
+ config = ChromaConfig(
73
+ documents=documents,
74
+ persist_directory=persist_directory,
75
+ embedding=embedding,
76
+ persisted=True,
77
+ )
78
+ else:
79
+ config = ChromaConfig(
80
+ documents=documents, persist_directory=persist_directory, embedding=embedding
81
+ )
82
+ chroma_mgr = ChromaManager(config)
83
+ return chroma_mgr
84
+
modules/vector_stores/vector_stores/pinecone_manager.py ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from langchain.vectorstores import Pinecone
3
+ from modules.vector_stores.embedding.openai import get_default_openai_embeddings
4
+ import pinecone
5
+
6
+
7
+ PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')
8
+ PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV') # You may need to switch with your env
9
+
10
+ class PineconeSessionManager:
11
+ """
12
+ A class for managing Pinecone sessions and indexes.
13
+
14
+ Attributes:
15
+ embeddings (OpenAIEmbeddings): The embeddings object to use for indexing.
16
+ index_name (str): The name of the Pinecone index to use.
17
+ index (pinecone.GRPCIndex): The Pinecone index object.
18
+ docsearch (Pinecone): The Pinecone search object.
19
+ """
20
+ def __init__(self, embeddings, index_name):
21
+ """
22
+ Initializes a new PineconeSessionManager instance.
23
+
24
+ Args:
25
+ embeddings (OpenAIEmbeddings): The embeddings object to use for indexing.
26
+ index_name (str): The name of the Pinecone index to use.
27
+ """
28
+ self.embeddings = embeddings
29
+ self.index_name = index_name
30
+ # initialize pinecone
31
+ pinecone.init(
32
+ api_key=PINECONE_API_KEY, # find at app.pinecone.io
33
+ environment=PINECONE_API_ENV # next to api key in console
34
+ )
35
+
36
+ if index_name not in pinecone.list_indexes():
37
+ # we create a new index
38
+ pinecone.create_index(
39
+ name=index_name,
40
+ metric='cosine',
41
+ dimension=len(res[0]) # 1536 dim of text-embedding-ada-002
42
+ )
43
+ self.index = pinecone.GRPCIndex(index_name)
44
+ self.docsearch: Pinecone = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)
45
+
46
+ def get_default_pinecone_session(index: str) -> PineconeSessionManager:
47
+ """
48
+ Returns a default PineconeSessionManager instance with OpenAI embeddings and a default index name.
49
+
50
+ Returns:
51
+ PineconeSessionManager: A new PineconeSessionManager instance.
52
+ """
53
+ embeddings = get_default_openai_embeddings()
54
+ index_name = index # put in the name of your pinecone index here
55
+ pc_session = PineconeSessionManager(embeddings, index_name)
56
+ return pc_session
pyproject.toml ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=61.0","setuptools-scm","pytest", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "AIAgents"
7
+ version = "0.1"
8
+ description = "Fearnworks AI Agents"
9
+
10
+ [options]
11
+ py_modules = ["modules"]
12
+
requirements.txt CHANGED
@@ -1,8 +1,24 @@
1
- langchain
2
- streamlit
3
- gradio
 
 
4
  python-dotenv
5
- openai
6
- wikipedia
7
  ipykernel
8
- loguru
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ###### General
2
+ pandas
3
+ torch
4
+ ###### Util
5
+ loguru
6
  python-dotenv
 
 
7
  ipykernel
8
+ ###### Langchain
9
+ langchain
10
+ unstructured
11
+ pypdf
12
+ wikipedia
13
+ ###### Open AI Libs
14
+ tiktoken
15
+ openai
16
+ ###### Vectorstores
17
+ pinecone-client[grpc] # if using pine vectorstore
18
+ chromadb
19
+ ##### UIs
20
+ # streamlit
21
+ gradio
22
+ ###### Embedding
23
+ sentence-transformers
24
+ InstructorEmbedding
sandbox.ipynb CHANGED
@@ -2,56 +2,68 @@
2
  "cells": [
3
  {
4
  "cell_type": "code",
5
- "execution_count": 1,
6
  "metadata": {},
7
- "outputs": [
8
- {
9
- "name": "stderr",
10
- "output_type": "stream",
11
- "text": [
12
- "/home/jphillips/ai_agents/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
13
- " from .autonotebook import tqdm as notebook_tqdm\n"
14
- ]
15
- },
16
- {
17
- "name": "stdout",
18
- "output_type": "stream",
19
- "text": [
20
- "Running on local URL: http://127.0.0.1:7860\n",
21
- "\n",
22
- "To create a public link, set `share=True` in `launch()`.\n"
23
- ]
24
- },
25
- {
26
- "data": {
27
- "text/html": [
28
- "<div><iframe src=\"http://127.0.0.1:7860/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
29
- ],
30
- "text/plain": [
31
- "<IPython.core.display.HTML object>"
32
- ]
33
- },
34
- "metadata": {},
35
- "output_type": "display_data"
36
- },
37
- {
38
- "data": {
39
- "text/plain": []
40
- },
41
- "execution_count": 1,
42
- "metadata": {},
43
- "output_type": "execute_result"
44
- }
45
- ],
46
  "source": [
47
- "import gradio as gr\n",
 
 
 
 
 
 
48
  "\n",
49
- "def greet(name):\n",
50
- " return \"Hello \" + name + \"!\"\n",
 
 
51
  "\n",
52
- "demo = gr.Interface(fn=greet, inputs=\"text\", outputs=\"text\")\n",
53
  "\n",
54
- "demo.launch() "
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  ]
56
  }
57
  ],
 
2
  "cells": [
3
  {
4
  "cell_type": "code",
5
+ "execution_count": null,
6
  "metadata": {},
7
+ "outputs": [],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  "source": [
9
+ "import os\n",
10
+ "from langchain.vectorstores import Chroma, Pinecone\n",
11
+ "from langchain.embeddings.openai import OpenAIEmbeddings\n",
12
+ "import pinecone \n",
13
+ "import os\n",
14
+ "from dotenv import load_dotenv, find_dotenv\n",
15
+ "load_dotenv(find_dotenv() )\n",
16
  "\n",
17
+ "pinecone.init(\n",
18
+ " api_key=os.environ[\"PINECONE_API_KEY\"], \n",
19
+ " environment=os.environ[\"PINECONE_ENV\"]\n",
20
+ ")\n",
21
  "\n",
22
+ "OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')\n",
23
  "\n",
24
+ "PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')\n",
25
+ "PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV') # You may need to switch with your env\n",
26
+ "PINECONE_INDEX_NAME= os.environ.get('PINECONE_INDEX_NAME')\n",
27
+ "\n",
28
+ "def get_default_index() -> Pinecone:\n",
29
+ " embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)\n",
30
+ " # initialize pinecone\n",
31
+ " pinecone.init(\n",
32
+ " api_key=PINECONE_API_KEY, # find at app.pinecone.io\n",
33
+ " environment=PINECONE_API_ENV # next to api key in console\n",
34
+ " )\n",
35
+ " index_name = PINECONE_INDEX_NAME # put in the name of your pinecone index here\n",
36
+ " print(PINECONE_API_ENV)\n",
37
+ " print(PINECONE_API_KEY)\n",
38
+ " if index_name not in pinecone.list_indexes():\n",
39
+ " # we create a new index\n",
40
+ " pinecone.create_index(\n",
41
+ " name=index_name,\n",
42
+ " metric='cosine',\n",
43
+ " dimension=len(res[0]) # 1536 dim of text-embedding-ada-002\n",
44
+ " )\n",
45
+ " index = pinecone.GRPCIndex(index_name)\n",
46
+ " return index\n",
47
+ "\n",
48
+ "index = get_default_index()\n",
49
+ "index.describe_index_stats()"
50
+ ]
51
+ },
52
+ {
53
+ "cell_type": "code",
54
+ "execution_count": null,
55
+ "metadata": {},
56
+ "outputs": [],
57
+ "source": [
58
+ "from modules.vector_stores.vector_stores.pinecone_manager import get_default_pinecone_session, PineconeSessionManager\n",
59
+ "import os\n",
60
+ "from langchain.vectorstores import Pinecone\n",
61
+ "PINECONE_INDEX_NAME= os.environ.get('PINECONE_INDEX_NAME')\n",
62
+ "print(PINECONE_INDEX_NAME)\n",
63
+ "### This is the default pinecone session manager\n",
64
+ "pinecone_session: PineconeSessionManager = get_default_pinecone_session(PINECONE_INDEX_NAME)\n",
65
+ "### This is the index for the default pinecone session manager\n",
66
+ "vector_index: Pinecone = pinecone_session.docsearch"
67
  ]
68
  }
69
  ],
setup.py ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ import setuptools
2
+
3
+ setuptools.setup(
4
+ name="river",
5
+ version="0.1",
6
+ description="Fearnworks AI Agents",
7
+ packages=setuptools.find_packages() + ["modules"],
8
+ python_requires=">=3.10",
9
+ )