anpigon commited on
Commit
6c93025
1 Parent(s): 0a54092

feat: add support for document ingestion and vectorization

Browse files

Add a new file `constants.py` with a constant `persist_directory` to store the vectorized documents.

In `ingest.py`, import necessary modules and define a `loader` to load documents from a directory. Then, use a `text_splitter` to split the documents into smaller chunks. Next, use an `embedding` to convert the text chunks into vectors. Finally, create a `vectordb` using the `Chroma` vector store and persist it to the `persist_directory`.

Files changed (2) hide show
  1. constants.py +1 -0
  2. ingest.py +27 -0
constants.py ADDED
@@ -0,0 +1 @@
 
 
1
+ persist_directory = 'db'
ingest.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langchain.document_loaders import PyPDFDirectoryLoader
2
+ from langchain.embeddings.openai import OpenAIEmbeddings
3
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
4
+ from langchain.vectorstores import Chroma
5
+
6
+ from constants import persist_directory
7
+
8
+ loader = PyPDFDirectoryLoader("docs/")
9
+ documents = loader.load()
10
+
11
+ text_splitter = RecursiveCharacterTextSplitter(
12
+ chunk_size=1000,
13
+ chunk_overlap=200,
14
+ separators=["\n\n", "\n", ".", "!", ",", " ", ""],
15
+ keep_separator=True,
16
+ )
17
+ texts = text_splitter.split_documents(documents)
18
+
19
+ embedding = OpenAIEmbeddings()
20
+
21
+ vectordb = Chroma.from_documents(
22
+ documents=texts,
23
+ embedding=embedding,
24
+ persist_directory=persist_directory,
25
+ )
26
+
27
+ vectordb.persist()