Spaces:

jpfearnworks
/

ai_agents

Paused

App Files Files Community

jphillips commited on Jun 25, 2023

Commit

566d4e4

•

1 Parent(s): 2fbdd0c

Vector storage (#1)

Browse files

* clean logging

* Add VectorStores and Embeddings

* Clear outputs

* Add qa chain

Files changed (25) hide show

.gitignore +4 -0
docs/similarity_search.md +82 -0
docs/vector_similarity_search.md +108 -0
modules/chroma_sandbox.ipynb +185 -0
modules/knowledge_retrieval/destination_chain.py +1 -2
modules/llm/__init__.py +0 -0
modules/llm/defaults.py +32 -0
modules/reasoning/chain_of_thought.py +0 -1
modules/reasoning/reasoning_strategy.py +0 -2
modules/vector_stores/__init__.py +0 -0
modules/vector_stores/embedding/__init__.py +0 -0
modules/vector_stores/embedding/instructorxl.py +7 -0
modules/vector_stores/embedding/openai.py +18 -0
modules/vector_stores/embedding_bases.py +9 -0
modules/vector_stores/loaders/__init__.py +1 -0
modules/vector_stores/loaders/pypdf_load_strategy.py +77 -0
modules/vector_stores/retrieval/__init__.py +0 -0
modules/vector_stores/retrieval/basic_qa.py +35 -0
modules/vector_stores/vector_stores/__init__.py +0 -0
modules/vector_stores/vector_stores/chroma_manager.py +84 -0
modules/vector_stores/vector_stores/pinecone_manager.py +56 -0
pyproject.toml +12 -0
requirements.txt +22 -6
sandbox.ipynb +57 -45
setup.py +9 -0

.gitignore CHANGED Viewed

@@ -1,3 +1,7 @@
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]

+### Project specific
+db/*
+data/*
+flagged
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]

docs/similarity_search.md ADDED Viewed

	@@ -0,0 +1,82 @@

+# Notes on Similarity Search
+Similarity search, also known as similarity measurement, is a key concept in many domains such as data mining, information retrieval, and machine learning. It quantifies the likeness or sameness between two data entities. Here, we explore three widely used methods for similarity search: Jaccard Similarity, W-Shingling, and Levenshtein Distance.
+## Jaccard Similarity
+Jaccard similarity is a measure of how similar two sets are. It is defined as the size of the intersection divided by the size of the union of the two sets. It is a useful metric for comparing sets, because it is independent of the size of the sets, and it is symmetric, meaning that the Jaccard similarity of A and B is the same as the Jaccard similarity of B and A.
+Jaccard similarity is commonly used in information retrieval applications like document clustering and collaborative filtering. It is also used in machine learning applications like k-means clustering and k-nearest neighbors.
+### Implementation :
+```python
+def jaccard(x: str, y: str):
+    """Jaccard similarity of two strings"""
+    x = set(x.split())
+    y = set(y.split())
+    shared = x.intersection(y)
+    union = x.union(y)
+    return len(shared) / len(union)
+```
+### Pros:
+- It's simple to understand and implement.
+- It's good for comparing sets of data, such as lists or documents.
+- It's binary, meaning it only cares if items exist, not how many times they exist.
+### Cons:
+- It can be sensitive to the size of the data. If the data sets are large but the intersection is small, the similarity can be perceived as low.
+- It does not take into account the frequency of the items.
+### Example:
+You have two sets of data, A = {1, 2, 3, 4} and B = {3, 4, 5, 6}. The intersection of A and B is {3, 4}, and the union of A and B is {1, 2, 3, 4, 5, 6}. So, the Jaccard similarity is 2 (size of intersection) divided by 6 (size of union), which is approximately 0.33.
+## W-Shingling
+Preprocessing method for strings or documents. It breaks the data into overlapping groups of W items. For example, if W = 2, then the string "I love to play football" would be broken into the following sets: {"I love", "love to", "to play", "play football"}. The W-shingling method is useful for comparing documents or strings, because it can detect similarities even if the documents are not exactly the same. For example, if you have two documents that are identical except for one word, the W-shingling method will still be able to detect the similarities between the two documents.
+### Implementation:
+```python
+def w_shingling(a: str):
+   a = a.split()
+   return set([a[i], a[i+1]] for i in range(len(a)-1))
+```
+### Pros:
+- It's useful for comparing documents or strings.
+- It's able to detect similarities in different parts of the data, not just exact matches.
+- It's robust to small changes or errors in the data.
+### Cons:
+- The choice of the length of the shingles (W) can greatly affect the result. Too small, and it might not capture meaningful similarities. Too large, and it might miss important differences.
+- It can be computationally intensive, especially for large documents or strings.
+### Example:
+You have two sentences, "I love to play football" and "I like to play football". If we take 2-shingles (two-word groups), we get the following sets: {"I love", "love to", "to play", "play football"} and {"I like", "like to", "to play", "play football"}. The intersection is {"to play", "play football"}, and the union is all unique shingles, so the Jaccard similarity of the 2-shingles is 0.5.
+## Levenshtein Distance
+Let's consider you have two words, say 'cat' and 'bat'. You want to find out how similar these two words are. One way to do this is to see how many letters you need to change in 'cat' to make it 'bat'. In this case, you only need to change the 'c' in 'cat' to a 'b' to make it 'bat'. So, the Levenshtein distance between 'cat' and 'bat' is 1. This method is used to find out how similar two pieces of data are by measuring the minimum number of changes needed to turn one piece of data into the other.
+### Implementation:
+```python
+def levenshtein_distance(a:str, b:str):
+    lev = np.zeros((len(a),len(b)))
+    for i in range(len(a)):
+        for j in range(len(b)):
+            if min(i,j) == 0:
+                lev[i,j] = max(i,j)
+            else:
+                # calculate three possible operations
+                x = lev[i-1, j] # deletion
+                y = lev[i, j-1] # insertion
+                z = lev[i-1, j-1] # substitution
+                # take the minimum of the three
+                lev[i,j] = min(x,y,z)
+                if a[i] != b[j]:
+                    # add one if the two characters are different
+                    lev[i,j] += 1
+    return lev, lev[-1,-1]
+```
+### Pros:
+- It's useful for comparing strings or sequences.
+- It's able to quantify the difference between two pieces of data.
+- It's useful in applications like spell checking, where you want to find the smallest number of edits to turn one word into another.
+### Cons:
+- It can be computationally expensive for long strings.
+- It does not handle well with transpositions (two characters being swapped), which will be counted as two operations instead of one.
+### Example:
+The words "kitten" and "sitting" have a Levenshtein distance of 3 because three operations are needed to turn "kitten" into "sitting": replace 'k' with 's', replace 'e' with 'i', and append 'g'.

docs/vector_similarity_search.md ADDED Viewed

	@@ -0,0 +1,108 @@

+# Vector-Based Similarity Search
+Vector-based similarity search, also known as vector space modeling, is a collection of techniques used in information retrieval and natural language processing. In these models, texts are represented as vectors in a multi-dimensional space, where each dimension corresponds to a separate term or concept. Similarity between texts can then be computed by comparing the vectors. Here, we explore three widely used methods for vector-based similarity search: TF-IDF, BM25, and SBERT.
+## TF-IDF (Term Frequency-Inverse Document Frequency)
+TF-IDF is a statistical measure used to evaluate the importance of a word in a document, relative to a corpus. The importance of a word increases proportionally to the number of times it appears in the document, but is offset by the frequency of the word in the corpus.
+TF-IDF is commonly used in information retrieval and text mining, where it is used to rank documents by relevance in response to a query.
+### Implementation:
+```python
+import numpy as np
+docs: list[str] = [a,b,c]
+vocab = set(a+b+c)
+def tf_idf(word:str, sentence:str):
+    term_frequency = sentence.count(word) / len(sentence)
+    iverse_document_frequency = np.log10(len(docs) / sum([1 for doc in docs if word in doc]))
+    return round(term_frequency * inverse_document_frequency, 4)
+def vector_tf_idf(a:str, b:str, vocab:set[str]):
+    vec_a = []
+    vec_b = []
+    for word in vocab:
+        vec_a.append(tf_idf(word, a))
+        vec__b.append(tf_idf(word, b))
+    return vec_a, vec_b
+```
+### Pros:
+- It's simple to understand and implement.
+- It's good for comparing documents in a corpus.
+- It takes into account not only the frequency of a term in a single document (TF), but also the distribution of the term in the entire document set (IDF).
+### Cons:
+- It assumes that the terms are independent, which is often not the case in natural language.
+Example:
+Suppose we have a document set consisting of five documents. The term "the" appears often in all documents, while the term "zebra" appears many times in one document, but not in others. TF-IDF will assign a higher weight to "zebra" because it is more important for distinguishing documents in the set.
+## BM25 (Best Matching 25)
+BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. BM25 can be viewed as an enhanced version of TFIDF. It's a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document. It serves as a method of normalizing term frequency by taking into account current and average document length.
+### Implementation:
+```python
+import numpy as np
+docs = [a,b,c,d,e,f]
+avg_doc_length = sum([len(doc) for doc in docs]) / len(docs)
+N = len(docs)
+def bm25(word:str, sentence:str, k:float=1.2, b:float=0.75):
+    freq = sentence.count(word)
+    term_freq = freq * (k + 1) / (freq + k * (1 - b + b * len(sentence) / avg_doc_length))
+    inverse_document_frequency = np.log(((N - N_q + 0.5) / (N_q + 0.5)) + 1)
+    return round(term_freq * inverse_document_frequency, 4)
+```
+### Pros:
+- It's effective for ranking documents in response to a user query.
+- It takes into account term frequency and document length.
+### Cons:
+- Like TF-IDF, it assumes that the terms are independent.
+### Example:
+Consider a document set consisting of five documents. If a user's query is "zebra", the BM25 score for each document will be calculated based on the occurrence of "zebra" and the length of the document. Documents with a higher frequency of "zebra" and shorter lengths will get higher scores.
+## SBERT (Sentence-BERT)
+SBERT is a modification of the pre-trained BERT network that is specifically designed for sentence embeddings. It uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity.
+SBERT utilizes dense vector representations of sentences that are trained on large datasets. It can be used for a wide range of language understanding tasks, including sentence similarity, semantic search, and clustering. This allows for more semantic similarity detection than TF-IDF or BM25.
+### Implementation:
+```python
+from sentence_transformers import SentenceTransformer
+docs = [a,b,c,d,e,f]
+def compute_sbert(docs: list[str]):
+    model = SentenceTransformer('bert-base-nli-mean-tokens')
+    sentence_embeddings = model.encode(corpus)
+    embeddings = model.encode(corpus)
+    return embeddings
+from sklearn.metrics.pairwise import cosine_similarity
+import numpy as np
+def score_sbert(sentence_embeddings: np.ndarray):
+    scores = np.zeros((sentence_embeddings.shape[0], sentence_embeddings.shape[0]))
+    for i in range(sentence_embeddings.shape[0]):
+            scores[i,:] = cosine_similarity(sentence_embeddings[i], sentence_embeddings)[0]
+    return scores
+import matplotlib.pyplot as plt
+import seaborn as sns
+def plot_scores(scores):
+    plt.figure(figsize=(10,9))
+    labels=['a','b','c','d','e','f']
+    sns.heatmap(scores, xticklabels=labels, yticklabels=labels, annot=True)
+```
+### Pros:
+- It's effective for comparing sentence-level semantic similarity.
+- It can handle a wide range of language understanding tasks.
+### Cons:
+- It requires significant computational resources and time to train.
+### Example:
+Suppose we have three sentences: "I have a dog", "I have a pet", and "The car is red". If we compute the SBERT embeddings for these sentences and then calculate the cosine similarity between the embeddings, we'll find that the first two sentences ("I have a dog" and "I have a pet") are more similar to each other than either is to the third sentence ("The car is red"). This is because SBERT is able to capture the semantic similarity between "dog" and "pet".

modules/chroma_sandbox.ipynb ADDED Viewed

	@@ -0,0 +1,185 @@

+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.vectorstores import FAISS\n",
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+    "from langchain import OpenAI\n",
+    "from langchain.chains import RetrievalQA\n",
+    "from langchain.document_loaders import DirectoryLoader\n",
+    "import magic\n",
+    "import os\n",
+    "import nltk\n",
+    "\n",
+    "openai_api_key = os.getenv(\"OPENAI_API_KEY\")\n",
+    "data_location= os.getenv(\"VECTOR_DATA_DIR\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Chroma"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from modules.vector_stores.vector_stores.chroma_manager import get_default_chroma_mgr\n",
+    "\n",
+    "chroma_mgr = get_default_chroma_mgr(persisted=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chroma_mgr.persist()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from modules.vector_stores.retrieval.basic_qa import get_default_qa\n",
+    "\n",
+    "qa = get_default_qa(chroma_mgr.db)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## Cite sources\n",
+    "def process_llm_response(llm_response):\n",
+    "    print(llm_response['result'])\n",
+    "    print('\\n\\nSources:')\n",
+    "    for source in llm_response[\"source_documents\"]:\n",
+    "        print(source.metadata['source'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# full example\n",
+    "query = \"What is a date table?\"\n",
+    "resp = qa.ask(query)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## FAISS"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from modules.vector_stores.loaders.pypdf_load_strategy import PyPDFLoadStrategy, PyPDFConfig, get_default_pypdf_loader\n",
+    "from modules.vector_stores.embedding.openai import OpenAIEmbeddings, OpenAIEmbedConfig, get_default_openai_embeddings\n",
+    "def get_example_pdf_embedding():\n",
+    "    dir_location = \"../data\"\n",
+    "    loader = get_default_pypdf_loader(dir_location)\n",
+    "    documents = loader.load()\n",
+    "    embeddings = get_default_openai_embeddings()\n",
+    "    index = FAISS.from_documents(documents, embeddings)\n",
+    "    return index\n",
+    "index = get_example_pdf_embedding()\n",
+    "llm = OpenAI(openai_api_key=openai_api_key)\n",
+    "qa = RetrievalQA.from_chain_type(llm=llm, chain_type=\"stuff\", retriever=index.as_retriever())\n",
+    "qa = RetrievalQA.from_chain_type(llm=llm,\n",
+    "                                chain_type=\"stuff\",\n",
+    "                                retriever=index.as_retriever(),\n",
+    "                                return_source_documents=True)\n",
+    "query = \"What is a date table?\"\n",
+    "result = qa({\"query\": query})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "result"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "docsearch = FAISS.from_documents(documents, embeddings)\n",
+    "llm = OpenAI(openai_api_key=openai_api_key)\n",
+    "qa = RetrievalQA.from_chain_type(llm=llm, chain_type=\"stuff\", retriever=docsearch.as_retriever())\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "qa = RetrievalQA.from_chain_type(llm=llm,\n",
+    "                                chain_type=\"stuff\",\n",
+    "                                retriever=docsearch.as_retriever(),\n",
+    "                                return_source_documents=True)\n",
+    "query = \"What is a date table?\"\n",
+    "result = qa({\"query\": query})"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "result\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.6"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

modules/knowledge_retrieval/destination_chain.py CHANGED Viewed

@@ -36,8 +36,7 @@ class DestinationChainStrategy(DestinationChain):
     def __init__(self, config: LLMChainConfig, display: Callable, knowledge_domain: KnowledgeDomain, usage: str):
         settings = UserSettings.get_instance()
         api_key = settings.get_api_key()
-        print("Api key")
-        print(api_key)
         super().__init__(api_key=api_key, knowledge_domain=knowledge_domain, llm=config.llm_class, display=display, usage=usage)
         self.llm = config.llm_class(temperature=config.temperature, max_tokens=config.max_tokens)

     def __init__(self, config: LLMChainConfig, display: Callable, knowledge_domain: KnowledgeDomain, usage: str):
         settings = UserSettings.get_instance()
         api_key = settings.get_api_key()
         super().__init__(api_key=api_key, knowledge_domain=knowledge_domain, llm=config.llm_class, display=display, usage=usage)
         self.llm = config.llm_class(temperature=config.temperature, max_tokens=config.max_tokens)

modules/llm/__init__.py ADDED Viewed

File without changes

modules/llm/defaults.py ADDED Viewed

	@@ -0,0 +1,32 @@

+import os
+from langchain import OpenAI
+from langchain.chat_models import ChatOpenAI
+OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')
+def get_default_cloud_chat_llm():
+    """
+    Returns a default LLM instance with the OpenAI API key set in the environment.
+    Returns:
+        OpenAI: A new OpenAI instance.
+    """
+    llm = ChatOpenAI(model="gpt-3.5-turbo", openai_api_key=OPENAI_API_KEY, temperature=0)
+    return llm
+def get_default_cloud_completion_llm():
+    """
+    Returns a default LLM instance with the OpenAI API key set in the environment.
+    Returns:
+        OpenAI: A new OpenAI instance.
+    """
+    llm = OpenAI(openai_api_key=OPENAI_API_KEY)
+    return llm
+def get_default_local_llm():
+    """
+    Coming soon!
+    """
+    pass

modules/reasoning/chain_of_thought.py CHANGED Viewed

@@ -1,5 +1,4 @@
 from langchain import PromptTemplate, LLMChain
-import streamlit as st
 from .reasoning_strategy import ReasoningStrategy, ReasoningConfig
 from typing import Callable
 import pprint

 from langchain import PromptTemplate, LLMChain
 from .reasoning_strategy import ReasoningStrategy, ReasoningConfig
 from typing import Callable
 import pprint

modules/reasoning/reasoning_strategy.py CHANGED Viewed

@@ -2,8 +2,6 @@ from langchain.llms import OpenAI
 from pydantic import BaseModel
 from langchain.llms.base import BaseLLM
 from typing import Type, Callable
-import streamlit as st
-import os

 from pydantic import BaseModel
 from langchain.llms.base import BaseLLM
 from typing import Type, Callable

modules/vector_stores/__init__.py ADDED Viewed

File without changes

modules/vector_stores/embedding/__init__.py ADDED Viewed

File without changes

modules/vector_stores/embedding/instructorxl.py ADDED Viewed

	@@ -0,0 +1,7 @@

+from langchain.embeddings import HuggingFaceInstructEmbeddings
+def get_default_instructor_embedding():
+    instructor_embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl",
+                                                        model_kwargs={"device": "cuda"})
+    return instructor_embeddings

modules/vector_stores/embedding/openai.py ADDED Viewed

	@@ -0,0 +1,18 @@

+from langchain.embeddings.openai import OpenAIEmbeddings
+from dataclasses import dataclass
+import os
+@dataclass
+class OpenAIEmbedConfig:
+    openai_api_key: str
+def get_default_openai_embeddings() -> OpenAIEmbeddings:
+    """
+    Returns a default OpenAIEmbeddings instance with a default API key.
+    Returns:
+        OpenAIEmbeddings: A new OpenAIEmbeddings instance.
+    """
+    openai_api_key = os.environ.get('OPENAI_API_KEY')
+    embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
+    return embeddings

modules/vector_stores/embedding_bases.py ADDED Viewed

	@@ -0,0 +1,9 @@

+from abc import ABC, abstractmethod
+class DocumentLoadStrategy(ABC):
+    @abstractmethod
+    def load(self):
+        pass
+    @abstractmethod
+    def split(self, documents, chunk_size, chunk_overlap):
+        pass

modules/vector_stores/loaders/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .pypdf_load_strategy import PyPDFLoadStrategy, PyPDFConfig, PyPDFLoader

modules/vector_stores/loaders/pypdf_load_strategy.py ADDED Viewed

	@@ -0,0 +1,77 @@

+from typing import List, Iterable
+from langchain.document_loaders import PyPDFLoader, DirectoryLoader
+from loguru import logger
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from modules.vector_stores.embedding_bases import DocumentLoadStrategy
+from langchain.schema import Document
+from dataclasses import dataclass
+@dataclass
+class PyPDFConfig:
+    dir_location: str
+    glob_pattern: str = "./*.pdf"
+    chunk_size: int = 1000
+    chunk_overlap: int =    200
+class PyPDFLoadStrategy(DocumentLoadStrategy):
+    def __init__(self, config: PyPDFConfig):
+        """
+        A document load strategy that loads PDF files using PyPDF.
+        Args:
+            dir_path (str): The directory path to load PDF files from.
+            glob_pattern (str): The glob pattern to match PDF files.
+        Attributes:
+            logger (logging.Logger): The logger instance for this class.
+            dir_path (str): The directory path to load PDF files from.
+            glob_pattern (str): The glob pattern to match PDF files.
+        """
+        self.logger = logger
+        self.dir_path = config.dir_location
+        self.glob_pattern = config.glob_pattern
+        self.chunk_size = config.chunk_size
+        self.chunk_overlap = config.chunk_overlap
+    def load(self) -> Iterable[Document]:
+        """
+        Loads PDF files from the specified directory path and returns an iterable of `Document` instances.
+        Returns:
+            Iterable[Document]: An iterable of `Document` instances.
+        """
+        loader = DirectoryLoader(
+            self.dir_path, glob=self.glob_pattern, loader_cls=PyPDFLoader
+        )  # Note: If you're using PyPDFLoader then it will split by page for you already
+        documents = loader.load()
+        self.logger.info(f"Loaded {len(documents)} documents from {self.dir_path}")
+        return documents
+    def split(self, documents: Iterable[Document]):
+        """
+        Splits the specified list of PyPDFLoader instances into text chunks using a recursive character text splitter.
+        Args:
+            documents  (Iterable[Document]): The documents to split.
+            chunk_size (int): The size of each text chunk.
+            chunk_overlap (int): The overlap between adjacent text chunks.
+        Returns:
+            List[str]: A list of text chunks.
+        """
+        text_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap
+        )
+        texts = text_splitter.split_documents(documents)
+        self.logger.info(f"Split {len(documents)} documents into {len(texts)}")
+        return texts
+def get_default_pypdf_loader(dir_location: str) -> PyPDFLoadStrategy:
+    dir_path = dir_location
+    config: PyPDFConfig = PyPDFConfig(
+        dir_location=dir_path
+    )
+    return PyPDFLoadStrategy(config)

modules/vector_stores/retrieval/__init__.py ADDED Viewed

File without changes

modules/vector_stores/retrieval/basic_qa.py ADDED Viewed

	@@ -0,0 +1,35 @@

+from modules.llm.defaults import get_default_cloud_chat_llm
+from langchain.chains import RetrievalQA
+from langchain.chains.retrieval_qa.base import BaseRetrievalQA
+from langchain.vectorstores.base import VectorStore
+from dataclasses import dataclass
+@dataclass
+class QAgentConfig:
+    index: VectorStore
+    qa: BaseRetrievalQA
+class QAgent:
+    def __init__(self, index: VectorStore, qa: BaseRetrievalQA):
+        self.index = index
+        self.llm = get_default_cloud_chat_llm()
+        self.qa_chain = qa
+    def ask(self, question: str):
+        resp = self.qa_chain(question)
+        self.process_llm_response(resp)
+        return resp
+    ## Cite sources
+    def process_llm_response(self, llm_response):
+        print(llm_response['result'])
+        print('\n\nSources:')
+        for source in llm_response["source_documents"]:
+            print(source.metadata['source'])
+def get_default_qa(index: VectorStore) -> QAgent:
+    llm = get_default_cloud_chat_llm()
+    qa_chain = RetrievalQA.from_chain_type(llm,chain_type="stuff",retriever=index.as_retriever(), return_source_documents=True)
+    qagent = QAgent(index, qa_chain)
+    return qagent

modules/vector_stores/vector_stores/__init__.py ADDED Viewed

File without changes

modules/vector_stores/vector_stores/chroma_manager.py ADDED Viewed

	@@ -0,0 +1,84 @@

+from langchain.vectorstores import Chroma
+from langchain.vectorstores.base import VectorStoreRetriever
+from modules.vector_stores.loaders.pypdf_load_strategy import (
+    get_default_pypdf_loader,
+)
+from modules.vector_stores.embedding.instructorxl import get_default_instructor_embedding
+instruct_embed = get_default_instructor_embedding()
+from dataclasses import dataclass
+from langchain.embeddings.base import Embeddings
+from typing import Iterable
+from langchain.schema import Document
+from loguru import logger
+@dataclass
+class ChromaConfig:
+    documents: Iterable[Document]
+    persist_directory: str
+    embedding: Embeddings
+    persisted: bool = False
+class ChromaManager:
+    def __init__(self, config: ChromaConfig):
+        self.documents = config.documents
+        self.persist_directory = config.persist_directory
+        self.embedding = config.embedding
+        if config.persisted:
+            self.db = Chroma(
+                persist_directory=config.persist_directory, embedding_function=config.embedding
+            )
+        else:
+            self.db = Chroma.from_documents(
+                documents=config.documents,
+                embedding=config.embedding,
+                persist_directory=config.persist_directory,
+            )
+    def persist(self):
+        logger.info("Persisting Chroma to disk...")
+        self.db.persist()
+        logger.info("Chroma saved to %s", self.persist_directory)
+    def delete(self):
+        logger.info("Deleting Chroma from disk...")
+        self.db.delete_collection()
+        self.db.persist()
+        logger.info("Chroma deleted from %s", self.persist_directory)
+    def fetch_documents(self, query):
+        logger.info("Fetching documents from Chroma...")
+        retriever: VectorStoreRetriever = self.db.as_retriever()
+        documents = retriever.get_relevant_documents(query)
+        logger.info("Fetched %s documents from Chroma", len(documents))
+        return documents
+def get_default_chroma_mgr(persisted=False):
+    """
+    Returns a default ChromaConfig instance. The default currently only reads in pdf files from the data directory.
+    Returns:
+        ChromaConfig: A new ChromaConfig instance.
+    """
+    dir_location = "../data"
+    persist_directory = "../db"
+    loader = get_default_pypdf_loader(dir_location)
+    documents: Iterable[Document] = loader.load()
+    embedding = get_default_instructor_embedding()
+    if persisted:
+        config = ChromaConfig(
+            documents=documents,
+            persist_directory=persist_directory,
+            embedding=embedding,
+            persisted=True,
+        )
+    else:
+        config = ChromaConfig(
+            documents=documents, persist_directory=persist_directory, embedding=embedding
+        )
+    chroma_mgr = ChromaManager(config)
+    return chroma_mgr

modules/vector_stores/vector_stores/pinecone_manager.py ADDED Viewed

	@@ -0,0 +1,56 @@

+import os
+from langchain.vectorstores import Pinecone
+from modules.vector_stores.embedding.openai import get_default_openai_embeddings
+import pinecone
+PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')
+PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV') # You may need to switch with your env
+class PineconeSessionManager:
+    """
+    A class for managing Pinecone sessions and indexes.
+    Attributes:
+        embeddings (OpenAIEmbeddings): The embeddings object to use for indexing.
+        index_name (str): The name of the Pinecone index to use.
+        index (pinecone.GRPCIndex): The Pinecone index object.
+        docsearch (Pinecone): The Pinecone search object.
+    """
+    def __init__(self, embeddings, index_name):
+        """
+        Initializes a new PineconeSessionManager instance.
+        Args:
+            embeddings (OpenAIEmbeddings): The embeddings object to use for indexing.
+            index_name (str): The name of the Pinecone index to use.
+        """
+        self.embeddings = embeddings
+        self.index_name = index_name
+        # initialize pinecone
+        pinecone.init(
+            api_key=PINECONE_API_KEY,  # find at app.pinecone.io
+            environment=PINECONE_API_ENV  # next to api key in console
+        )
+        if index_name not in pinecone.list_indexes():
+            # we create a new index
+            pinecone.create_index(
+                name=index_name,
+                metric='cosine',
+                dimension=len(res[0])  # 1536 dim of text-embedding-ada-002
+            )
+        self.index = pinecone.GRPCIndex(index_name)
+        self.docsearch: Pinecone = Pinecone.from_existing_index(index_name=index_name, embedding=embeddings)
+def get_default_pinecone_session(index: str) -> PineconeSessionManager:
+    """
+    Returns a default PineconeSessionManager instance with OpenAI embeddings and a default index name.
+    Returns:
+        PineconeSessionManager: A new PineconeSessionManager instance.
+    """
+    embeddings = get_default_openai_embeddings()
+    index_name = index # put in the name of your pinecone index here
+    pc_session = PineconeSessionManager(embeddings, index_name)
+    return pc_session

pyproject.toml ADDED Viewed

	@@ -0,0 +1,12 @@

+[build-system]
+requires = ["setuptools>=61.0","setuptools-scm","pytest", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "AIAgents"
+version = "0.1"
+description = "Fearnworks AI Agents"
+[options]
+py_modules = ["modules"]

requirements.txt CHANGED Viewed

@@ -1,8 +1,24 @@
-langchain
-streamlit
-gradio
 python-dotenv
-openai
-wikipedia
 ipykernel
-loguru

+###### General
+pandas
+torch
+###### Util
+loguru
 python-dotenv
 ipykernel
+###### Langchain
+langchain
+unstructured
+pypdf
+wikipedia
+###### Open AI Libs
+tiktoken
+openai
+###### Vectorstores
+pinecone-client[grpc] # if using pine vectorstore
+chromadb
+##### UIs
+# streamlit
+gradio
+###### Embedding
+sentence-transformers
+InstructorEmbedding

sandbox.ipynb CHANGED Viewed

@@ -2,56 +2,68 @@
  "cells": [
   {
    "cell_type": "code",
-   "execution_count": 1,
    "metadata": {},
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "/home/jphillips/ai_agents/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
-      "  from .autonotebook import tqdm as notebook_tqdm\n"
-     ]
-    },
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Running on local URL:  http://127.0.0.1:7860\n",
-      "\n",
-      "To create a public link, set `share=True` in `launch()`.\n"
-     ]
-    },
-    {
-     "data": {
-      "text/html": [
-       "<div><iframe src=\"http://127.0.0.1:7860/\" width=\"100%\" height=\"500\" allow=\"autoplay; camera; microphone; clipboard-read; clipboard-write;\" frameborder=\"0\" allowfullscreen></iframe></div>"
-      ],
-      "text/plain": [
-       "<IPython.core.display.HTML object>"
-      ]
-     },
-     "metadata": {},
-     "output_type": "display_data"
-    },
-    {
-     "data": {
-      "text/plain": []
-     },
-     "execution_count": 1,
-     "metadata": {},
-     "output_type": "execute_result"
-    }
-   ],
    "source": [
-    "import gradio as gr\n",
     "\n",
-    "def greet(name):\n",
-    "    return \"Hello \" + name + \"!\"\n",
     "\n",
-    "demo = gr.Interface(fn=greet, inputs=\"text\", outputs=\"text\")\n",
     "\n",
-    "demo.launch()   "
    ]
   }
  ],

  "cells": [
   {
    "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
+    "import os\n",
+    "from langchain.vectorstores import Chroma, Pinecone\n",
+    "from langchain.embeddings.openai import OpenAIEmbeddings\n",
+    "import pinecone \n",
+    "import os\n",
+    "from dotenv import load_dotenv, find_dotenv\n",
+    "load_dotenv(find_dotenv()                              )\n",
     "\n",
+    "pinecone.init(\n",
+    "    api_key=os.environ[\"PINECONE_API_KEY\"], \n",
+    "    environment=os.environ[\"PINECONE_ENV\"]\n",
+    ")\n",
     "\n",
+    "OPENAI_API_KEY = os.environ.get('OPENAI_API_KEY')\n",
     "\n",
+    "PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')\n",
+    "PINECONE_API_ENV = os.environ.get('PINECONE_API_ENV') # You may need to switch with your env\n",
+    "PINECONE_INDEX_NAME= os.environ.get('PINECONE_INDEX_NAME')\n",
+    "\n",
+    "def get_default_index() -> Pinecone:\n",
+    "    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)\n",
+    "    # initialize pinecone\n",
+    "    pinecone.init(\n",
+    "        api_key=PINECONE_API_KEY,  # find at app.pinecone.io\n",
+    "        environment=PINECONE_API_ENV  # next to api key in console\n",
+    "    )\n",
+    "    index_name = PINECONE_INDEX_NAME # put in the name of your pinecone index here\n",
+    "    print(PINECONE_API_ENV)\n",
+    "    print(PINECONE_API_KEY)\n",
+    "    if index_name not in pinecone.list_indexes():\n",
+    "        # we create a new index\n",
+    "        pinecone.create_index(\n",
+    "            name=index_name,\n",
+    "            metric='cosine',\n",
+    "            dimension=len(res[0])  # 1536 dim of text-embedding-ada-002\n",
+    "        )\n",
+    "    index = pinecone.GRPCIndex(index_name)\n",
+    "    return index\n",
+    "\n",
+    "index = get_default_index()\n",
+    "index.describe_index_stats()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from modules.vector_stores.vector_stores.pinecone_manager import get_default_pinecone_session, PineconeSessionManager\n",
+    "import os\n",
+    "from langchain.vectorstores import Pinecone\n",
+    "PINECONE_INDEX_NAME= os.environ.get('PINECONE_INDEX_NAME')\n",
+    "print(PINECONE_INDEX_NAME)\n",
+    "### This is the default pinecone session manager\n",
+    "pinecone_session: PineconeSessionManager = get_default_pinecone_session(PINECONE_INDEX_NAME)\n",
+    "### This is the index for the default pinecone session manager\n",
+    "vector_index: Pinecone = pinecone_session.docsearch"
    ]
   }
  ],

setup.py ADDED Viewed

	@@ -0,0 +1,9 @@

+import setuptools
+setuptools.setup(
+    name="river",
+    version="0.1",
+    description="Fearnworks AI Agents",
+    packages=setuptools.find_packages() + ["modules"],
+    python_requires=">=3.10",
+)