Spaces:
Runtime error
Runtime error
feat: add support for document ingestion and vectorization
Browse filesAdd a new file `constants.py` with a constant `persist_directory` to store the vectorized documents.
In `ingest.py`, import necessary modules and define a `loader` to load documents from a directory. Then, use a `text_splitter` to split the documents into smaller chunks. Next, use an `embedding` to convert the text chunks into vectors. Finally, create a `vectordb` using the `Chroma` vector store and persist it to the `persist_directory`.
- constants.py +1 -0
- ingest.py +27 -0
constants.py
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
persist_directory = 'db'
|
ingest.py
ADDED
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from langchain.document_loaders import PyPDFDirectoryLoader
|
2 |
+
from langchain.embeddings.openai import OpenAIEmbeddings
|
3 |
+
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
4 |
+
from langchain.vectorstores import Chroma
|
5 |
+
|
6 |
+
from constants import persist_directory
|
7 |
+
|
8 |
+
loader = PyPDFDirectoryLoader("docs/")
|
9 |
+
documents = loader.load()
|
10 |
+
|
11 |
+
text_splitter = RecursiveCharacterTextSplitter(
|
12 |
+
chunk_size=1000,
|
13 |
+
chunk_overlap=200,
|
14 |
+
separators=["\n\n", "\n", ".", "!", ",", " ", ""],
|
15 |
+
keep_separator=True,
|
16 |
+
)
|
17 |
+
texts = text_splitter.split_documents(documents)
|
18 |
+
|
19 |
+
embedding = OpenAIEmbeddings()
|
20 |
+
|
21 |
+
vectordb = Chroma.from_documents(
|
22 |
+
documents=texts,
|
23 |
+
embedding=embedding,
|
24 |
+
persist_directory=persist_directory,
|
25 |
+
)
|
26 |
+
|
27 |
+
vectordb.persist()
|