Spaces:

Asish22
/

code-crawler

Running

App Files Files Community

juliaturc commited on Aug 30, 2024

Commit

2db1bb0

2 Parent(s): 40b4763 5b5303c

Merge pull request #13 from Storia-AI/julia/marqo

Browse files

Files changed (8) hide show

README.md +55 -27
requirements.txt +1 -0
src/chat.py +14 -23
src/chunker.py +25 -21
src/embedder.py +63 -22
src/index.py +51 -28
src/repo_manager.py +7 -21
src/vector_store.py +59 -6

README.md CHANGED Viewed

@@ -7,40 +7,68 @@
 **Ok, but why chat with a codebase?**
 Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through
-the code itself.
-`repo2vec` is like GitHub Copilot but with the most up-to-date information about your repo.
-Features:
 - **Dead-simple set-up.** Run *two scripts* and you have a functional chat interface for your code. That's really it.
 - **Heavily documented answers.** Every response shows where in the code the context for the answer was pulled from. Let's build trust in the AI.
 - **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Customize to your heart's content.
-Here are the two scripts you need to run:
-```
-pip install -r requirements.txt
-export GITHUB_REPO_NAME=...
-export OPENAI_API_KEY=...
-export PINECONE_API_KEY=...
-export PINECONE_INDEX_NAME=...
-python src/index.py $GITHUB_REPO_NAME --pinecone_index_name=$PINECONE_INDEX_NAME
-python src/chat.py $GITHUB_REPO_NAME --pinecone_index_name=$PINECONE_INDEX_NAME
-```
-This will index your entire codebase in a vector DB, then bring up a `gradio` app where you can ask questions about it.
-The assistant responses always include GitHub links to the documents retrieved for each query.
-If you want to publicly host your chat experience, set `--share=true`:
-```
-python src/chat.py $GITHUB_REPO_NAME --share=true ...
 ```
-That's it.
-Here is, for example, a conversation about the repo [Storia-AI/image-eval](https://github.com/Storia-AI/image-eval):
-![screenshot](assets/chat_screenshot.png)
 # Peeking under the hood
@@ -50,10 +78,11 @@ The `src/index.py` script performs the following steps:
     - Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
 2. **Chunks files**. See [Chunker](src/chunker.py).
     - For code files, we implement a special `CodeChunker` that takes the parse tree into account.
-3. **Batch-embeds chunks**. See [Embedder](src/embedder.py).
-    - By default, we use OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
 4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
-    - By default, we use [Pinecone](https://pinecone.io) as a vector store, but you can easily plug in your own.
 Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
 ```
@@ -77,10 +106,9 @@ The sources are conveniently surfaced in the chat and linked directly to GitHub.
 # Want your repository hosted?
-We're working to make all code on the internet searchable and understandable for devs. If you would like help hosting
-your repository, we're onboarding a handful of repos onto our infrastructure **for free**.
-You'll get a dedicated url for your repo like `https://sage.storia.ai/[REPO_NAME]`. Just send us a message at [founders@storia.ai](mailto:founders@storia.ai)!
 ![](assets/sage.gif)

 **Ok, but why chat with a codebase?**
 Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through
+the code itself.
+`repo2vec` is like GitHub Copilot but with the most up-to-date information about your repo.
+Features:
 - **Dead-simple set-up.** Run *two scripts* and you have a functional chat interface for your code. That's really it.
 - **Heavily documented answers.** Every response shows where in the code the context for the answer was pulled from. Let's build trust in the AI.
 - **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Customize to your heart's content.
+# How to run it
+## Indexing the codebase
+We currently support two options for indexing the codebase:
+1. **Locally**, using the open-source [Marqo vector store](https://github.com/marqo-ai/marqo). Marqo is both an embedder (you can choose your favorite embedding model from Hugging Face) and a vector store.
+    You can bring up a Marqo instance using Docker:
+    ```
+    docker rm -f marqo
+    docker pull marqoai/marqo:latest
+    docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
+    ```
+    Then, to index your codebase, run:
+    ```
+    pip install -r requirements.txt
+    python src/index.py
+        github-repo-name \  # e.g. Storia-AI/repo2vec
+        --embedder_type=marqo \
+        --vector_store_type=marqo \
+        --index_name=your-index-name
+    ```
+2. **Using external providers** (OpenAI for embeddings and [Pinecone](https://www.pinecone.io/) for the vector store). To index your codebase, run:
+    ```
+    pip install -r requirements.txt
+    export OPENAI_API_KEY=...
+    export PINECONE_API_KEY=...
+    python src/index.py
+        github-repo-name \  # e.g. Storia-AI/repo2vec
+        --embedder_type=openai \
+        --vector_store_type=pinecone \
+        --index_name=your-index-name
+    ```
+    We are planning on adding more providers soon, so that you can mix and match them. Contributions are also welcome!
+## Chatting with the codebase
+To bring a `gradio` app where you can chat with your codebase, simply point it to your vector store:
 ```
+export OPENAI_API_KEY=...
+python src/chat.py \
+    github-repo-name \  # e.g. Storia-AI/repo2vec
+    --vector_store_type=marqo \  # or pinecone
+    --index_name=your-index-name
+```
+To get a public URL for your chat app, set `--share=true`.
+Currently, the chat will use OpenAI's GPT-4, but we are working on adding support for other providers and local LLMs. Stay tuned!
 # Peeking under the hood
     - Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
 2. **Chunks files**. See [Chunker](src/chunker.py).
     - For code files, we implement a special `CodeChunker` that takes the parse tree into account.
+3. **Batch-embeds chunks**. See [Embedder](src/embedder.py). We currently support:
+    - [Marqo](https://github.com/marqo-ai/marqo) as an embedder, which allows you to specify your favorite Hugging Face embedding model;
+    - OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
 4. **Stores embeddings in a vector store**. See [VectorStore](src/vector_store.py).
+    - We currently support [Marqo](https://github.com/marqo-ai/marqo) and [Pinecone](https://pinecone.io), but you can easily plug in your own.
 Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
 ```
 # Want your repository hosted?
+We're working to make all code on the internet searchable and understandable for devs. You can check out our early product, [Code Sage](https://sage.storia.ai). We pre-indexed a slew of OSS repos, and you can index your desired ones by simply pasting a GitHub URL.
+If you're the maintainer of an OSS repo and would like a dedicated page on Code Sage (e.g. `sage.storia.ai/your-repo`), then send us a message at [founders@storia.ai](mailto:founders@storia.ai). We'll do it for free!
 ![](assets/sage.gif)

requirements.txt CHANGED Viewed

@@ -4,6 +4,7 @@ gradio==4.42.0
 langchain==0.2.14
 langchain-community==0.2.12
 langchain-openai==0.1.22
 nbformat==5.10.4
 openai==1.42.0
 pinecone==5.0.1

 langchain==0.2.14
 langchain-community==0.2.12
 langchain-openai==0.1.22
+marqo==3.7.0
 nbformat==5.10.4
 openai==1.42.0
 pinecone==5.0.1

src/chat.py CHANGED Viewed

@@ -5,16 +5,16 @@ You must run main.py first in order to index the codebase into a vector store.
 import argparse
-from dotenv import load_dotenv
 import gradio as gr
-from langchain.chains import create_history_aware_retriever, create_retrieval_chain
 from langchain.chains.combine_documents import create_stuff_documents_chain
 from langchain.schema import AIMessage, HumanMessage
-from langchain_community.vectorstores import Pinecone
 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
-from langchain_openai import ChatOpenAI, OpenAIEmbeddings
 from repo_manager import RepoManager
 load_dotenv()
@@ -23,14 +23,7 @@ load_dotenv()
 def build_rag_chain(args):
     """Builds a RAG chain via LangChain."""
     llm = ChatOpenAI(model=args.openai_model)
-    vectorstore = Pinecone.from_existing_index(
-        index_name=args.pinecone_index_name,
-        embedding=OpenAIEmbeddings(),
-        namespace=args.repo_id,
-    )
-    retriever = vectorstore.as_retriever()
     # Prompt to contextualize the latest query based on the chat history.
     contextualize_q_system_prompt = (
@@ -45,9 +38,7 @@ def build_rag_chain(args):
             ("human", "{input}"),
         ]
     )
-    history_aware_retriever = create_history_aware_retriever(
-        llm, retriever, contextualize_q_prompt
-    )
     qa_system_prompt = (
         f"You are my coding buddy, helping me quickly understand a GitHub repository called {args.repo_id}."
@@ -76,9 +67,7 @@ def append_sources_to_response(response):
     # Deduplicate filenames while preserving their order.
     filenames = list(dict.fromkeys(filenames))
     repo_manager = RepoManager(args.repo_id)
-    github_links = [
-        repo_manager.github_link_for_file(filename) for filename in filenames
-    ]
     return response["answer"] + "\n\nSources:\n" + "\n".join(github_links)
@@ -90,8 +79,12 @@ if __name__ == "__main__":
         default="gpt-4",
         help="The OpenAI model to use for response generation",
     )
     parser.add_argument(
-        "--pinecone_index_name", required=True, help="Pinecone index name"
     )
     parser.add_argument(
         "--share",
@@ -109,9 +102,7 @@ if __name__ == "__main__":
             history_langchain_format.append(HumanMessage(content=human))
             history_langchain_format.append(AIMessage(content=ai))
         history_langchain_format.append(HumanMessage(content=message))
-        response = rag_chain.invoke(
-            {"input": message, "chat_history": history_langchain_format}
-        )
         answer = append_sources_to_response(response)
         return answer

 import argparse
 import gradio as gr
+from dotenv import load_dotenv
+from langchain.chains import (create_history_aware_retriever,
+                              create_retrieval_chain)
 from langchain.chains.combine_documents import create_stuff_documents_chain
 from langchain.schema import AIMessage, HumanMessage
 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
+from langchain_openai import ChatOpenAI
+import vector_store
 from repo_manager import RepoManager
 load_dotenv()
 def build_rag_chain(args):
     """Builds a RAG chain via LangChain."""
     llm = ChatOpenAI(model=args.openai_model)
+    retriever = vector_store.build_from_args(args).to_langchain().as_retriever()
     # Prompt to contextualize the latest query based on the chat history.
     contextualize_q_system_prompt = (
             ("human", "{input}"),
         ]
     )
+    history_aware_retriever = create_history_aware_retriever(llm, retriever, contextualize_q_prompt)
     qa_system_prompt = (
         f"You are my coding buddy, helping me quickly understand a GitHub repository called {args.repo_id}."
     # Deduplicate filenames while preserving their order.
     filenames = list(dict.fromkeys(filenames))
     repo_manager = RepoManager(args.repo_id)
+    github_links = [repo_manager.github_link_for_file(filename) for filename in filenames]
     return response["answer"] + "\n\nSources:\n" + "\n".join(github_links)
         default="gpt-4",
         help="The OpenAI model to use for response generation",
     )
+    parser.add_argument("--vector_store_type", default="pinecone", choices=["pinecone", "marqo"])
+    parser.add_argument("--index_name", required=True, help="Vector store index name")
     parser.add_argument(
+        "--marqo_url",
+        default="http://localhost:8882",
+        help="URL for the Marqo server. Required if using Marqo as embedder or vector store.",
     )
     parser.add_argument(
         "--share",
             history_langchain_format.append(HumanMessage(content=human))
             history_langchain_format.append(AIMessage(content=ai))
         history_langchain_format.append(HumanMessage(content=message))
+        response = rag_chain.invoke({"input": message, "chat_history": history_langchain_format})
         answer = append_sources_to_response(response)
         return answer

src/chunker.py CHANGED Viewed

@@ -1,12 +1,12 @@
 """Chunker abstraction and implementations."""
 import logging
-import nbformat
 from abc import ABC, abstractmethod
 from dataclasses import dataclass
 from functools import lru_cache
 from typing import List, Optional
 import pygments
 import tiktoken
 from semchunk import chunk as chunk_via_semchunk
@@ -30,11 +30,26 @@ class Chunk:
         """The text content to be embedded. Might contain information beyond just the text snippet from the file."""
         return self._content
     def populate_content(self, file_content: str):
         """Populates the content of the chunk with the file path and file content."""
-        self._content = (
-            self.filename + "\n\n" + file_content[self.start_byte : self.end_byte]
-        )
     def num_tokens(self, tokenizer):
         """Counts the number of tokens in the chunk."""
@@ -98,9 +113,7 @@ class CodeChunker(Chunker):
         if not node.children:
             # This is a leaf node, but it's too long. We'll have to split it with a text tokenizer.
-            return self.text_chunker.chunk(
-                filename, file_content[node.start_byte : node.end_byte]
-            )
         chunks = []
         for child in node.children:
@@ -116,11 +129,7 @@ class CodeChunker(Chunker):
         for chunk in chunks:
             if not merged_chunks:
                 merged_chunks.append(chunk)
-            elif (
-                merged_chunks[-1].num_tokens(self.tokenizer)
-                + chunk.num_tokens(self.tokenizer)
-                < self.max_tokens - 50
-            ):
                 # There's a good chance that merging these two chunks will be under the token limit. We're not 100% sure
                 # at this point, because tokenization is not necessarily additive.
                 merged = Chunk(
@@ -186,9 +195,7 @@ class CodeChunker(Chunker):
             # a bug in the code.
             assert chunk.content
             size = chunk.num_tokens(self.tokenizer)
-            assert (
-                size <= self.max_tokens
-            ), f"Chunk size {size} exceeds max_tokens {self.max_tokens}."
         return chunks
@@ -200,17 +207,13 @@ class TextChunker(Chunker):
         self.max_tokens = max_tokens
         tokenizer = tiktoken.get_encoding("cl100k_base")
-        self.count_tokens = lambda text: len(
-            tokenizer.encode(text, disallowed_special=())
-        )
     def chunk(self, file_path: str, file_content: str) -> List[Chunk]:
         """Chunks a text file into smaller pieces."""
         # We need to allocate some tokens for the filename, which is part of the chunk content.
         extra_tokens = self.count_tokens(file_path + "\n\n")
-        text_chunks = chunk_via_semchunk(
-            file_content, self.max_tokens - extra_tokens, self.count_tokens
-        )
         chunks = []
         start = 0
@@ -235,6 +238,7 @@ class IPYNBChunker(Chunker):
     Based on https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/code/code_retrieval_augmented_generation.ipynb
     """
     def __init__(self, code_chunker: CodeChunker):
         self.code_chunker = code_chunker

 """Chunker abstraction and implementations."""
 import logging
 from abc import ABC, abstractmethod
 from dataclasses import dataclass
 from functools import lru_cache
 from typing import List, Optional
+import nbformat
 import pygments
 import tiktoken
 from semchunk import chunk as chunk_via_semchunk
         """The text content to be embedded. Might contain information beyond just the text snippet from the file."""
         return self._content
+    @property
+    def to_metadata(self):
+        """Converts the chunk to a dictionary that can be passed to a vector store."""
+        # Some vector stores require the IDs to be ASCII.
+        filename_ascii = self.filename.encode("ascii", "ignore").decode("ascii")
+        return {
+            # Some vector stores require the IDs to be ASCII.
+            "id": f"{filename_ascii}_{self.start_byte}_{self.end_byte}",
+            "filename": self.filename,
+            "start_byte": self.start_byte,
+            "end_byte": self.end_byte,
+            # Note to developer: When choosing a large chunk size, you might exceed the vector store's metadata
+            # size limit. In that case, you can simply store the start/end bytes above, and fetch the content
+            # directly from the repository when needed.
+            "text": self.content,
+        }
     def populate_content(self, file_content: str):
         """Populates the content of the chunk with the file path and file content."""
+        self._content = self.filename + "\n\n" + file_content[self.start_byte : self.end_byte]
     def num_tokens(self, tokenizer):
         """Counts the number of tokens in the chunk."""
         if not node.children:
             # This is a leaf node, but it's too long. We'll have to split it with a text tokenizer.
+            return self.text_chunker.chunk(filename, file_content[node.start_byte : node.end_byte])
         chunks = []
         for child in node.children:
         for chunk in chunks:
             if not merged_chunks:
                 merged_chunks.append(chunk)
+            elif merged_chunks[-1].num_tokens(self.tokenizer) + chunk.num_tokens(self.tokenizer) < self.max_tokens - 50:
                 # There's a good chance that merging these two chunks will be under the token limit. We're not 100% sure
                 # at this point, because tokenization is not necessarily additive.
                 merged = Chunk(
             # a bug in the code.
             assert chunk.content
             size = chunk.num_tokens(self.tokenizer)
+            assert size <= self.max_tokens, f"Chunk size {size} exceeds max_tokens {self.max_tokens}."
         return chunks
         self.max_tokens = max_tokens
         tokenizer = tiktoken.get_encoding("cl100k_base")
+        self.count_tokens = lambda text: len(tokenizer.encode(text, disallowed_special=()))
     def chunk(self, file_path: str, file_content: str) -> List[Chunk]:
         """Chunks a text file into smaller pieces."""
         # We need to allocate some tokens for the filename, which is part of the chunk content.
         extra_tokens = self.count_tokens(file_path + "\n\n")
+        text_chunks = chunk_via_semchunk(file_content, self.max_tokens - extra_tokens, self.count_tokens)
         chunks = []
         start = 0
     Based on https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/code/code_retrieval_augmented_generation.ipynb
     """
     def __init__(self, code_chunker: CodeChunker):
         self.code_chunker = code_chunker

src/embedder.py CHANGED Viewed

@@ -7,6 +7,7 @@ from abc import ABC, abstractmethod
 from collections import Counter
 from typing import Dict, Generator, List, Tuple
 from openai import OpenAI
 from chunker import Chunk, Chunker
@@ -19,7 +20,7 @@ class BatchEmbedder(ABC):
     """Abstract class for batch embedding of a repository."""
     @abstractmethod
-    def embed_repo(self, chunks_per_batch: int):
         """Issues batch embedding jobs for the entire repository."""
     @abstractmethod
@@ -62,7 +63,7 @@ class OpenAIBatchEmbedder(BatchEmbedder):
                     openai_batch_id = self._issue_job_for_chunks(
                         sub_batch, batch_id=f"{repo_name}/{len(self.openai_batch_ids)}"
                     )
-                    self.openai_batch_ids[openai_batch_id] = self._metadata_for_chunks(sub_batch)
                     if max_embedding_jobs and len(self.openai_batch_ids) >= max_embedding_jobs:
                         logging.info("Reached the maximum number of embedding jobs. Stopping.")
                         return
@@ -71,7 +72,7 @@ class OpenAIBatchEmbedder(BatchEmbedder):
         # Finally, commit the last batch.
         if batch:
             openai_batch_id = self._issue_job_for_chunks(batch, batch_id=f"{repo_name}/{len(self.openai_batch_ids)}")
-            self.openai_batch_ids[openai_batch_id] = self._metadata_for_chunks(batch)
         logging.info("Issued %d jobs for %d chunks.", len(self.openai_batch_ids), chunk_count)
         # Save the job IDs to a file, just in case this script is terminated by mistake.
@@ -171,22 +172,62 @@ class OpenAIBatchEmbedder(BatchEmbedder):
             },
         }
-    @staticmethod
-    def _metadata_for_chunks(chunks):
-        metadata = []
-        for chunk in chunks:
-            filename_ascii = chunk.filename.encode("ascii", "ignore").decode("ascii")
-            metadata.append(
-                {
-                    # Some vector stores require the IDs to be ASCII.
-                    "id": f"{filename_ascii}_{chunk.start_byte}_{chunk.end_byte}",
-                    "filename": chunk.filename,
-                    "start_byte": chunk.start_byte,
-                    "end_byte": chunk.end_byte,
-                    # Note to developer: When choosing a large chunk size, you might exceed the vector store's metadata
-                    # size limit. In that case, you can simply store the start/end bytes above, and fetch the content
-                    # directly from the repository when needed.
-                    "text": chunk.content,
-                }
-            )
-        return metadata

 from collections import Counter
 from typing import Dict, Generator, List, Tuple
+import marqo
 from openai import OpenAI
 from chunker import Chunk, Chunker
     """Abstract class for batch embedding of a repository."""
     @abstractmethod
+    def embed_repo(self, chunks_per_batch: int, max_embedding_jobs: int = None):
         """Issues batch embedding jobs for the entire repository."""
     @abstractmethod
                     openai_batch_id = self._issue_job_for_chunks(
                         sub_batch, batch_id=f"{repo_name}/{len(self.openai_batch_ids)}"
                     )
+                    self.openai_batch_ids[openai_batch_id] = [chunk.to_metadata for chunk in sub_batch]
                     if max_embedding_jobs and len(self.openai_batch_ids) >= max_embedding_jobs:
                         logging.info("Reached the maximum number of embedding jobs. Stopping.")
                         return
         # Finally, commit the last batch.
         if batch:
             openai_batch_id = self._issue_job_for_chunks(batch, batch_id=f"{repo_name}/{len(self.openai_batch_ids)}")
+            self.openai_batch_ids[openai_batch_id] = [chunk.to_metadata for chunk in batch]
         logging.info("Issued %d jobs for %d chunks.", len(self.openai_batch_ids), chunk_count)
         # Save the job IDs to a file, just in case this script is terminated by mistake.
             },
         }
+class MarqoEmbedder(BatchEmbedder):
+    """Embedder that uses the open-source Marqo vector search engine.
+    Embeddings can be stored locally (in which case `url` the constructor should point to localhost) or in the cloud.
+    """
+    def __init__(self, repo_manager: RepoManager, chunker: Chunker, index_name: str, url: str, model="hf/e5-base-v2"):
+        self.repo_manager = repo_manager
+        self.chunker = chunker
+        self.client = marqo.Client(url=url)
+        self.index = self.client.index(index_name)
+        all_index_names = [result["indexName"] for result in self.client.get_indexes()["results"]]
+        if not index_name in all_index_names:
+            self.client.create_index(index_name, model=model)
+    def embed_repo(self, chunks_per_batch: int, max_embedding_jobs: int = None):
+        """Issues batch embedding jobs for the entire repository."""
+        if chunks_per_batch > 64:
+            raise ValueError("Marqo enforces a limit of 64 chunks per batch.")
+        chunk_count = 0
+        batch = []
+        for filepath, content in self.repo_manager.walk():
+            chunks = self.chunker.chunk(filepath, content)
+            chunk_count += len(chunks)
+            batch.extend(chunks)
+            if len(batch) > chunks_per_batch:
+                for i in range(0, len(batch), chunks_per_batch):
+                    sub_batch = batch[i : i + chunks_per_batch]
+                    logging.info("Indexing %d chunks...", len(sub_batch))
+                    self.index.add_documents(
+                        documents=[chunk.to_metadata for chunk in sub_batch],
+                        tensor_fields=["text"],
+                    )
+                    if max_embedding_jobs and len(self.openai_batch_ids) >= max_embedding_jobs:
+                        logging.info("Reached the maximum number of embedding jobs. Stopping.")
+                        return
+                batch = []
+        # Finally, commit the last batch.
+        if batch:
+            self.index.add_documents(documents=[chunk.to_metadata for chunk in batch], tensor_fields=["text"])
+        logging.info(f"Successfully embedded {chunk_count} chunks.")
+    def embeddings_are_ready(self) -> bool:
+        """Checks whether the batch embedding jobs are done."""
+        # Marqo indexes documents synchronously, so once embed_repo() returns, the embeddings are ready.
+        return True
+    def download_embeddings(self) -> Generator[Vector, None, None]:
+        """Yields (chunk_metadata, embedding) pairs for each chunk in the repository."""
+        # Marqo stores embeddings as they are created, so they're already in the vector store. No need to download them
+        # as we would with e.g. OpenAI, Cohere, or some other cloud-based embedding service.
+        return []

src/index.py CHANGED Viewed

@@ -5,19 +5,14 @@ import logging
 import time
 from chunker import UniversalChunker
-from embedder import OpenAIBatchEmbedder
 from repo_manager import RepoManager
-from vector_store import PineconeVectorStore
 logging.basicConfig(level=logging.INFO)
-OPENAI_EMBEDDING_SIZE = 1536
-MAX_TOKENS_PER_CHUNK = (
-    8192  # The ADA embedder from OpenAI has a maximum of 8192 tokens.
-)
-MAX_CHUNKS_PER_BATCH = (
-    2048  # The OpenAI batch embedding API enforces a maximum of 2048 chunks per batch.
-)
 MAX_TOKENS_PER_JOB = 3_000_000  # The OpenAI batch embedding API enforces a maximum of 3M tokens processed at once.
@@ -29,6 +24,8 @@ def _read_extensions(path):
 def main():
     parser = argparse.ArgumentParser(description="Batch-embeds a repository")
     parser.add_argument("repo_id", help="The ID of the repository to index")
     parser.add_argument(
         "--local_dir",
         default="repos",
@@ -41,11 +38,12 @@ def main():
         help="https://arxiv.org/pdf/2406.14497 recommends a value between 200-800.",
     )
     parser.add_argument(
-        "--chunks_per_batch", type=int, default=2000, help="Maximum chunks per batch"
-    )
-    parser.add_argument(
-        "--pinecone_index_name", required=True, help="Pinecone index name"
     )
     parser.add_argument(
         "--include",
         help="Path to a file containing a list of extensions to include. One extension per line.",
@@ -56,22 +54,37 @@ def main():
         help="Path to a file containing a list of extensions to exclude. One extension per line.",
     )
     parser.add_argument(
-        "--max_embedding_jobs", type=int,
         help="Maximum number of embedding jobs to run. Specifying this might result in "
         "indexing only part of the repository, but prevents you from burning through OpenAI credits.",
     )
     args = parser.parse_args()
-    # Validate the arguments.
     if args.tokens_per_chunk > MAX_TOKENS_PER_CHUNK:
-        parser.error(
-            f"The maximum number of tokens per chunk is {MAX_TOKENS_PER_CHUNK}."
-        )
     if args.chunks_per_batch > MAX_CHUNKS_PER_BATCH:
-        parser.error(
-            f"The maximum number of chunks per batch is {MAX_CHUNKS_PER_BATCH}."
-        )
     if args.tokens_per_chunk * args.chunks_per_batch >= MAX_TOKENS_PER_JOB:
         parser.error(f"The maximum number of chunks per job is {MAX_TOKENS_PER_JOB}.")
     if args.include and args.exclude:
@@ -91,9 +104,23 @@ def main():
     logging.info("Issuing embedding jobs...")
     chunker = UniversalChunker(max_tokens=args.tokens_per_chunk)
-    embedder = OpenAIBatchEmbedder(repo_manager, chunker, args.local_dir)
     embedder.embed_repo(args.chunks_per_batch, args.max_embedding_jobs)
     logging.info("Waiting for embeddings to be ready...")
     while not embedder.embeddings_are_ready():
         logging.info("Sleeping for 30 seconds...")
@@ -101,11 +128,7 @@ def main():
     logging.info("Moving embeddings to the vector store...")
     # Note to developer: Replace this with your preferred vector store.
-    vector_store = PineconeVectorStore(
-        index_name=args.pinecone_index_name,
-        dimension=OPENAI_EMBEDDING_SIZE,
-        namespace=repo_manager.repo_id,
-    )
     vector_store.ensure_exists()
     vector_store.upsert(embedder.download_embeddings())
     logging.info("Done!")

 import time
 from chunker import UniversalChunker
+from embedder import MarqoEmbedder, OpenAIBatchEmbedder
 from repo_manager import RepoManager
+from vector_store import build_from_args
 logging.basicConfig(level=logging.INFO)
+MAX_TOKENS_PER_CHUNK = 8192  # The ADA embedder from OpenAI has a maximum of 8192 tokens.
+MAX_CHUNKS_PER_BATCH = 2048  # The OpenAI batch embedding API enforces a maximum of 2048 chunks per batch.
 MAX_TOKENS_PER_JOB = 3_000_000  # The OpenAI batch embedding API enforces a maximum of 3M tokens processed at once.
 def main():
     parser = argparse.ArgumentParser(description="Batch-embeds a repository")
     parser.add_argument("repo_id", help="The ID of the repository to index")
+    parser.add_argument("--embedder_type", default="openai", choices=["openai", "marqo"])
+    parser.add_argument("--vector_store_type", default="pinecone", choices=["pinecone", "marqo"])
     parser.add_argument(
         "--local_dir",
         default="repos",
         help="https://arxiv.org/pdf/2406.14497 recommends a value between 200-800.",
     )
     parser.add_argument(
+        "--chunks_per_batch",
+        type=int,
+        default=2000,
+        help="Maximum chunks per batch. We recommend 2000 for the OpenAI embedder. Marqo enforces a limit of 64.",
     )
+    parser.add_argument("--index_name", required=True, help="Vector store index name")
     parser.add_argument(
         "--include",
         help="Path to a file containing a list of extensions to include. One extension per line.",
         help="Path to a file containing a list of extensions to exclude. One extension per line.",
     )
     parser.add_argument(
+        "--max_embedding_jobs",
+        type=int,
         help="Maximum number of embedding jobs to run. Specifying this might result in "
         "indexing only part of the repository, but prevents you from burning through OpenAI credits.",
     )
+    parser.add_argument(
+        "--marqo_url",
+        default="http://localhost:8882",
+        help="URL for the Marqo server. Required if using Marqo as embedder or vector store.",
+    )
+    parser.add_argument(
+        "--marqo_embedding_model",
+        default="hf/e5-base-v2",
+        help="The embedding model to use for Marqo.",
+    )
     args = parser.parse_args()
+    # Validate embedder and vector store compatibility.
+    if args.embedder_type == "openai" and args.vector_store_type != "pinecone":
+        parser.error("When using OpenAI embedder, the vector store type must be Pinecone.")
+    if args.embedder_type == "marqo" and args.vector_store_type != "marqo":
+        parser.error("When using the marqo embedder, the vector store type must also be marqo.")
+    if args.embedder_type == "marqo" and args.chunks_per_batch > 64:
+        args.chunks_per_batch = 64
+        logging.warning("Marqo enforces a limit of 64 chunks per batch. Setting --chunks_per_batch to 64.")
+    # Validate other arguments.
     if args.tokens_per_chunk > MAX_TOKENS_PER_CHUNK:
+        parser.error(f"The maximum number of tokens per chunk is {MAX_TOKENS_PER_CHUNK}.")
     if args.chunks_per_batch > MAX_CHUNKS_PER_BATCH:
+        parser.error(f"The maximum number of chunks per batch is {MAX_CHUNKS_PER_BATCH}.")
     if args.tokens_per_chunk * args.chunks_per_batch >= MAX_TOKENS_PER_JOB:
         parser.error(f"The maximum number of chunks per job is {MAX_TOKENS_PER_JOB}.")
     if args.include and args.exclude:
     logging.info("Issuing embedding jobs...")
     chunker = UniversalChunker(max_tokens=args.tokens_per_chunk)
+    if args.embedder_type == "openai":
+        embedder = OpenAIBatchEmbedder(repo_manager, chunker, args.local_dir)
+    elif args.embedder_type == "marqo":
+        embedder = MarqoEmbedder(
+            repo_manager, chunker, index_name=args.index_name, url=args.marqo_url, model=args.marqo_embedding_model
+        )
+    else:
+        raise ValueError(f"Unrecognized embedder type {args.embedder_type}")
     embedder.embed_repo(args.chunks_per_batch, args.max_embedding_jobs)
+    if args.vector_store_type == "marqo":
+        # Marqo computes embeddings and stores them in the vector store at once, so we're done.
+        logging.info("Done!")
+        return
     logging.info("Waiting for embeddings to be ready...")
     while not embedder.embeddings_are_ready():
         logging.info("Sleeping for 30 seconds...")
     logging.info("Moving embeddings to the vector store...")
     # Note to developer: Replace this with your preferred vector store.
+    vector_store = build_from_args(args)
     vector_store.ensure_exists()
     vector_store.upsert(embedder.download_embeddings())
     logging.info("Done!")

src/repo_manager.py CHANGED Viewed

@@ -35,9 +35,7 @@ class RepoManager:
     @cached_property
     def is_public(self) -> bool:
         """Checks whether a GitHub repository is publicly visible."""
-        response = requests.get(
-            f"https://api.github.com/repos/{self.repo_id}", timeout=10
-        )
         # Note that the response will be 404 for both private and non-existent repos.
         return response.status_code == 200
@@ -50,17 +48,13 @@ class RepoManager:
         if self.access_token:
             headers["Authorization"] = f"token {self.access_token}"
-        response = requests.get(
-            f"https://api.github.com/repos/{self.repo_id}", headers=headers
-        )
         if response.status_code == 200:
             branch = response.json().get("default_branch", "main")
         else:
             # This happens sometimes when we exceed the Github rate limit. The best bet in this case is to assume the
             # most common naming for the default branch ("main").
-            logging.warn(
-                f"Unable to fetch default branch for {self.repo_id}: {response.text}"
-            )
             branch = "main"
         return branch
@@ -81,9 +75,7 @@ class RepoManager:
         try:
             Repo.clone_from(clone_url, self.local_path, depth=1, single_branch=True)
         except GitCommandError as e:
-            logging.error(
-                "Unable to clone %s from %s. Error: %s", self.repo_id, clone_url, e
-            )
             return False
         return True
@@ -130,9 +122,7 @@ class RepoManager:
                     for path in included_file_paths:
                         f.write(path + "\n")
-                excluded_file_paths = set(file_paths).difference(
-                    set(included_file_paths)
-                )
                 with open(excluded_log_file, "a") as f:
                     for path in excluded_file_paths:
                         f.write(path + "\n")
@@ -142,15 +132,11 @@ class RepoManager:
                     try:
                         contents = f.read()
                     except UnicodeDecodeError:
-                        logging.warning(
-                            "Unable to decode file %s. Skipping.", file_path
-                        )
                         continue
                     yield file_path[len(self.local_dir) + 1 :], contents
     def github_link_for_file(self, file_path: str) -> str:
         """Converts a repository file path to a GitHub link."""
         file_path = file_path[len(self.repo_id) :]
-        return (
-            f"https://github.com/{self.repo_id}/blob/{self.default_branch}/{file_path}"
-        )

     @cached_property
     def is_public(self) -> bool:
         """Checks whether a GitHub repository is publicly visible."""
+        response = requests.get(f"https://api.github.com/repos/{self.repo_id}", timeout=10)
         # Note that the response will be 404 for both private and non-existent repos.
         return response.status_code == 200
         if self.access_token:
             headers["Authorization"] = f"token {self.access_token}"
+        response = requests.get(f"https://api.github.com/repos/{self.repo_id}", headers=headers)
         if response.status_code == 200:
             branch = response.json().get("default_branch", "main")
         else:
             # This happens sometimes when we exceed the Github rate limit. The best bet in this case is to assume the
             # most common naming for the default branch ("main").
+            logging.warn(f"Unable to fetch default branch for {self.repo_id}: {response.text}")
             branch = "main"
         return branch
         try:
             Repo.clone_from(clone_url, self.local_path, depth=1, single_branch=True)
         except GitCommandError as e:
+            logging.error("Unable to clone %s from %s. Error: %s", self.repo_id, clone_url, e)
             return False
         return True
                     for path in included_file_paths:
                         f.write(path + "\n")
+                excluded_file_paths = set(file_paths).difference(set(included_file_paths))
                 with open(excluded_log_file, "a") as f:
                     for path in excluded_file_paths:
                         f.write(path + "\n")
                     try:
                         contents = f.read()
                     except UnicodeDecodeError:
+                        logging.warning("Unable to decode file %s. Skipping.", file_path)
                         continue
                     yield file_path[len(self.local_dir) + 1 :], contents
     def github_link_for_file(self, file_path: str) -> str:
         """Converts a repository file path to a GitHub link."""
         file_path = file_path[len(self.repo_id) :]
+        return f"https://github.com/{self.repo_id}/blob/{self.default_branch}/{file_path}"

src/vector_store.py CHANGED Viewed

@@ -3,13 +3,19 @@
 from abc import ABC, abstractmethod
 from typing import Dict, Generator, List, Tuple
 from pinecone import Pinecone
 Vector = Tuple[Dict, List[float]]  # (metadata, embedding)
 class VectorStore(ABC):
     """Abstract class for a vector store."""
     @abstractmethod
     def ensure_exists(self):
         """Ensures that the vector store exists. Creates it if it doesn't."""
@@ -29,11 +35,15 @@ class VectorStore(ABC):
         if batch:
             self.upsert_batch(batch)
 class PineconeVectorStore(VectorStore):
     """Vector store implementation using Pinecone."""
-    def __init__(self, index_name: str, dimension: int, namespace: str):
         self.index_name = index_name
         self.dimension = dimension
         self.client = Pinecone()
@@ -42,13 +52,56 @@ class PineconeVectorStore(VectorStore):
     def ensure_exists(self):
         if self.index_name not in self.client.list_indexes().names():
-            self.client.create_index(
-                name=self.index_name, dimension=self.dimension, metric="cosine"
-            )
     def upsert_batch(self, vectors: List[Vector]):
         pinecone_vectors = [
-            (metadata.get("id", str(i)), embedding, metadata)
-            for i, (metadata, embedding) in enumerate(vectors)
         ]
         self.index.upsert(vectors=pinecone_vectors, namespace=self.namespace)

 from abc import ABC, abstractmethod
 from typing import Dict, Generator, List, Tuple
+import marqo
+from langchain_community.vectorstores import Marqo
+from langchain_core.documents import Document
+from langchain_openai import OpenAIEmbeddings
 from pinecone import Pinecone
+OPENAI_EMBEDDING_SIZE = 1536
 Vector = Tuple[Dict, List[float]]  # (metadata, embedding)
 class VectorStore(ABC):
     """Abstract class for a vector store."""
     @abstractmethod
     def ensure_exists(self):
         """Ensures that the vector store exists. Creates it if it doesn't."""
         if batch:
             self.upsert_batch(batch)
+    @abstractmethod
+    def to_langchain(self):
+        """Converts the vector store to a LangChain vector store object."""
 class PineconeVectorStore(VectorStore):
     """Vector store implementation using Pinecone."""
+    def __init__(self, index_name: str, namespace: str, dimension: int = OPENAI_EMBEDDING_SIZE):
         self.index_name = index_name
         self.dimension = dimension
         self.client = Pinecone()
     def ensure_exists(self):
         if self.index_name not in self.client.list_indexes().names():
+            self.client.create_index(name=self.index_name, dimension=self.dimension, metric="cosine")
     def upsert_batch(self, vectors: List[Vector]):
         pinecone_vectors = [
+            (metadata.get("id", str(i)), embedding, metadata) for i, (metadata, embedding) in enumerate(vectors)
         ]
         self.index.upsert(vectors=pinecone_vectors, namespace=self.namespace)
+    def to_langchain(self):
+        return Pinecone.from_existing_index(
+            index_name=self.index_name, embedding=OpenAIEmbeddings(), namespace=self.namespace
+        )
+class MarqoVectorStore(VectorStore):
+    """Vector store implementation using Marqo."""
+    def __init__(self, url: str, index_name: str):
+        self.client = marqo.Client(url=url)
+        self.index_name = index_name
+    def ensure_exists(self):
+        pass
+    def upsert_batch(self, vectors: List[Vector]):
+        # Since Marqo is both an embedder and a vector store, the embedder is already doing the upsert.
+        pass
+    def to_langchain(self):
+        vectorstore = Marqo(client=self.client, index_name=self.index_name)
+        # Monkey-patch the _construct_documents_from_results_without_score method to not expect a "metadata" field in
+        # the result, and instead take the "filename" directly from the result.
+        def patched_method(self, results):
+            documents: List[Document] = []
+            for res in results["hits"]:
+                documents.append(Document(page_content=res["text"], metadata={"filename": res["filename"]}))
+            return documents
+        vectorstore._construct_documents_from_results_without_score = patched_method.__get__(
+            vectorstore, vectorstore.__class__
+        )
+        return vectorstore
+def build_from_args(args: dict) -> VectorStore:
+    """Builds a vector store from the given command-line arguments."""
+    if args.vector_store_type == "pinecone":
+        return PineconeVectorStore(index_name=args.index_name, namespace=args.repo_id)
+    elif args.vector_store_type == "marqo":
+        return MarqoVectorStore(url=args.marqo_url, index_name=args.index_name)
+    else:
+        raise ValueError(f"Unrecognized vector store type {args.vector_store_type}")