Spaces:

Asish22
/

code-crawler

Running

App Files Files Community

juliaturc commited on Sep 3, 2024

Commit

6f4d334

1 Parent(s): 648c31f

Index GitHub Issues (#21)

Browse files

* Generalize RepoManager into DataManager

* Add chunker for GitHub issues

* Update README, fix flags.

Files changed (7) hide show

README.md +21 -14
src/chat.py +10 -14
src/chunker.py +93 -72
src/{repo_manager.py → data_manager.py} +49 -25
src/embedder.py +50 -34
src/github.py +226 -0
src/index.py +91 -52

README.md CHANGED Viewed

@@ -38,9 +38,9 @@ We currently support two options for indexing the codebase:
     python src/index.py
         github-repo-name \  # e.g. Storia-AI/repo2vec
-        --embedder_type=marqo \
-        --vector_store_type=marqo \
-        --index_name=your-index-name
     ```
 2. **Using external providers** (OpenAI for embeddings and [Pinecone](https://www.pinecone.io/) for the vector store). To index your codebase, run:
@@ -52,12 +52,15 @@ We currently support two options for indexing the codebase:
     python src/index.py
         github-repo-name \  # e.g. Storia-AI/repo2vec
-        --embedder_type=openai \
-        --vector_store_type=pinecone \
-        --index_name=your-index-name
     ```
     We are planning on adding more providers soon, so that you can mix and match them. Contributions are also welcome!
 ## Chatting with the codebase
 We provide a `gradio` app where you can chat with your codebase. You can use either a local LLM (via [Ollama](https://ollama.com)), or a cloud provider like OpenAI or Anthropic.
@@ -68,10 +71,10 @@ To chat with a local LLM:
     ```
     python src/chat.py \
         github-repo-name \  # e.g. Storia-AI/repo2vec
-        --llm_provider=ollama
-        --llm_model=llama3.1
-        --vector_store_type=marqo \  # or pinecone
-        --index_name=your-index-name
     ```
 To chat with a cloud-based LLM, for instance Anthropic's Claude:
@@ -80,10 +83,10 @@ export ANTHROPIC_API_KEY=...
 python src/chat.py \
     github-repo-name \  # e.g. Storia-AI/repo2vec
-    --llm_provider=anthropic \
-    --llm_model=claude-3-opus-20240229 \
-    --vector_store_type=marqo \  # or pinecone
-    --index_name=your-index-name
 ```
 To get a public URL for your chat app, set `--share=true`.
@@ -121,6 +124,10 @@ The `src/chat.py` brings up a [Gradio app](https://www.gradio.app/) with a chat
 The sources are conveniently surfaced in the chat and linked directly to GitHub.
 # Want your repository hosted?
 We're working to make all code on the internet searchable and understandable for devs. You can check out our early product, [Code Sage](https://sage.storia.ai). We pre-indexed a slew of OSS repos, and you can index your desired ones by simply pasting a GitHub URL.

     python src/index.py
         github-repo-name \  # e.g. Storia-AI/repo2vec
+        --embedder-type=marqo \
+        --vector-store-type=marqo \
+        --index-name=your-index-name
     ```
 2. **Using external providers** (OpenAI for embeddings and [Pinecone](https://www.pinecone.io/) for the vector store). To index your codebase, run:
     python src/index.py
         github-repo-name \  # e.g. Storia-AI/repo2vec
+        --embedder-type=openai \
+        --vector-store-type=pinecone \
+        --index-name=your-index-name
     ```
     We are planning on adding more providers soon, so that you can mix and match them. Contributions are also welcome!
+## Indexing GitHub Issues
+By default, we also index the open GitHub issues associated with a codebase. You can control what gets index with the `--index-repo` and `--index-issues` flags (and their converse `--no-index-repo` and `--no-index-issues`).
 ## Chatting with the codebase
 We provide a `gradio` app where you can chat with your codebase. You can use either a local LLM (via [Ollama](https://ollama.com)), or a cloud provider like OpenAI or Anthropic.
     ```
     python src/chat.py \
         github-repo-name \  # e.g. Storia-AI/repo2vec
+        --llm-provider=ollama
+        --llm-model=llama3.1
+        --vector-store-type=marqo \  # or pinecone
+        --index-name=your-index-name
     ```
 To chat with a cloud-based LLM, for instance Anthropic's Claude:
 python src/chat.py \
     github-repo-name \  # e.g. Storia-AI/repo2vec
+    --llm-provider=anthropic \
+    --llm-model=claude-3-opus-20240229 \
+    --vector-store-type=marqo \  # or pinecone
+    --index-name=your-index-name
 ```
 To get a public URL for your chat app, set `--share=true`.
 The sources are conveniently surfaced in the chat and linked directly to GitHub.
+# Changelog
+- 2024-09-03: Support for indexing GitHub issues.
+- 2024-08-30: Support for running everything locally (Marqo for embeddings, Ollama for LLMs).
 # Want your repository hosted?
 We're working to make all code on the internet searchable and understandable for devs. You can check out our early product, [Code Sage](https://sage.storia.ai). We pre-indexed a slew of OSS repos, and you can index your desired ones by simply pasting a GitHub URL.

src/chat.py CHANGED Viewed

@@ -7,15 +7,13 @@ import argparse
 import gradio as gr
 from dotenv import load_dotenv
-from langchain.chains import (create_history_aware_retriever,
-                              create_retrieval_chain)
 from langchain.chains.combine_documents import create_stuff_documents_chain
 from langchain.schema import AIMessage, HumanMessage
 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
 import vector_store
 from llm import build_llm_via_langchain
-from repo_manager import RepoManager
 load_dotenv()
@@ -63,26 +61,24 @@ def build_rag_chain(args):
 def append_sources_to_response(response):
     """Given an OpenAI completion response, appends to it GitHub links of the context sources."""
-    filenames = [document.metadata["filename"] for document in response["context"]]
-    # Deduplicate filenames while preserving their order.
-    filenames = list(dict.fromkeys(filenames))
-    repo_manager = RepoManager(args.repo_id)
-    github_links = [repo_manager.github_link_for_file(filename) for filename in filenames]
-    return response["answer"] + "\n\nSources:\n" + "\n".join(github_links)
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="UI to chat with your codebase")
     parser.add_argument("repo_id", help="The ID of the repository to index")
-    parser.add_argument("--llm_provider", default="anthropic", choices=["openai", "anthropic", "ollama"])
     parser.add_argument(
-        "--llm_model",
         help="The LLM name. Must be supported by the provider specified via --llm_provider.",
     )
-    parser.add_argument("--vector_store_type", default="pinecone", choices=["pinecone", "marqo"])
-    parser.add_argument("--index_name", required=True, help="Vector store index name")
     parser.add_argument(
-        "--marqo_url",
         default="http://localhost:8882",
         help="URL for the Marqo server. Required if using Marqo as embedder or vector store.",
     )

 import gradio as gr
 from dotenv import load_dotenv
+from langchain.chains import create_history_aware_retriever, create_retrieval_chain
 from langchain.chains.combine_documents import create_stuff_documents_chain
 from langchain.schema import AIMessage, HumanMessage
 from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
 import vector_store
 from llm import build_llm_via_langchain
 load_dotenv()
 def append_sources_to_response(response):
     """Given an OpenAI completion response, appends to it GitHub links of the context sources."""
+    urls = [document.metadata["url"] for document in response["context"]]
+    # Deduplicate urls while preserving their order.
+    urls = list(dict.fromkeys(urls))
+    return response["answer"] + "\n\nSources:\n" + "\n".join(urls)
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="UI to chat with your codebase")
     parser.add_argument("repo_id", help="The ID of the repository to index")
+    parser.add_argument("--llm-provider", default="anthropic", choices=["openai", "anthropic", "ollama"])
     parser.add_argument(
+        "--llm-model",
         help="The LLM name. Must be supported by the provider specified via --llm_provider.",
     )
+    parser.add_argument("--vector-store-type", default="pinecone", choices=["pinecone", "marqo"])
+    parser.add_argument("--index-name", required=True, help="Vector store index name")
     parser.add_argument(
+        "--marqo-url",
         default="http://localhost:8882",
         help="URL for the Marqo server. Required if using Marqo as embedder or vector store.",
     )

src/chunker.py CHANGED Viewed

@@ -3,8 +3,8 @@
 import logging
 from abc import ABC, abstractmethod
 from dataclasses import dataclass
-from functools import lru_cache
-from typing import List, Optional
 import nbformat
 import pygments
@@ -14,31 +14,47 @@ from tree_sitter import Node
 from tree_sitter_language_pack import get_parser
 logger = logging.getLogger(__name__)
-@dataclass
 class Chunk:
     """A chunk of code or text extracted from a file in the repository."""
-    filename: str
     start_byte: int
     end_byte: int
-    _content: Optional[str] = None
-    @property
     def content(self) -> Optional[str]:
         """The text content to be embedded. Might contain information beyond just the text snippet from the file."""
-        return self._content
-    @property
-    def to_metadata(self):
         """Converts the chunk to a dictionary that can be passed to a vector store."""
         # Some vector stores require the IDs to be ASCII.
         filename_ascii = self.filename.encode("ascii", "ignore").decode("ascii")
-        return {
             # Some vector stores require the IDs to be ASCII.
             "id": f"{filename_ascii}_{self.start_byte}_{self.end_byte}",
-            "filename": self.filename,
             "start_byte": self.start_byte,
             "end_byte": self.end_byte,
             # Note to developer: When choosing a large chunk size, you might exceed the vector store's metadata
@@ -46,22 +62,13 @@ class Chunk:
             # directly from the repository when needed.
             "text": self.content,
         }
-    def populate_content(self, file_content: str):
-        """Populates the content of the chunk with the file path and file content."""
-        self._content = self.filename + "\n\n" + file_content[self.start_byte : self.end_byte]
-    def num_tokens(self, tokenizer):
-        """Counts the number of tokens in the chunk."""
-        if not self.content:
-            raise ValueError("Content not populated.")
-        return Chunk._cached_num_tokens(self.content, tokenizer)
-    @staticmethod
-    @lru_cache(maxsize=1024)
-    def _cached_num_tokens(content: str, tokenizer):
-        """Static method to cache token counts."""
-        return len(tokenizer.encode(content, disallowed_special=()))
     def __eq__(self, other):
         if isinstance(other, Chunk):
@@ -77,20 +84,19 @@ class Chunk:
 class Chunker(ABC):
-    """Abstract class for chunking a file into smaller pieces."""
     @abstractmethod
-    def chunk(self, file_path: str, file_content: str) -> List[Chunk]:
-        """Chunks a file into smaller pieces."""
-class CodeChunker(Chunker):
     """Splits a code file into chunks of at most `max_tokens` tokens each."""
     def __init__(self, max_tokens: int):
         self.max_tokens = max_tokens
-        self.tokenizer = tiktoken.get_encoding("cl100k_base")
-        self.text_chunker = TextChunker(max_tokens)
     @staticmethod
     def _get_language_from_filename(filename: str):
@@ -103,25 +109,24 @@ class CodeChunker(Chunker):
         except pygments.util.ClassNotFound:
             return None
-    def _chunk_node(self, node: Node, filename: str, file_content: str) -> List[Chunk]:
         """Splits a node in the parse tree into a flat list of chunks."""
-        node_chunk = Chunk(filename, node.start_byte, node.end_byte)
-        node_chunk.populate_content(file_content)
-        if node_chunk.num_tokens(self.tokenizer) <= self.max_tokens:
             return [node_chunk]
         if not node.children:
             # This is a leaf node, but it's too long. We'll have to split it with a text tokenizer.
-            return self.text_chunker.chunk(filename, file_content[node.start_byte : node.end_byte])
         chunks = []
         for child in node.children:
-            chunks.extend(self._chunk_node(child, filename, file_content))
         for chunk in chunks:
             # This should always be true. Otherwise there must be a bug in the code.
-            assert chunk.content and chunk.num_tokens(self.tokenizer) <= self.max_tokens
         # Merge neighboring chunks if their combined size doesn't exceed max_tokens. The goal is to avoid pathologically
         # small chunks that end up being undeservedly preferred by the retriever.
@@ -129,16 +134,16 @@ class CodeChunker(Chunker):
         for chunk in chunks:
             if not merged_chunks:
                 merged_chunks.append(chunk)
-            elif merged_chunks[-1].num_tokens(self.tokenizer) + chunk.num_tokens(self.tokenizer) < self.max_tokens - 50:
                 # There's a good chance that merging these two chunks will be under the token limit. We're not 100% sure
                 # at this point, because tokenization is not necessarily additive.
-                merged = Chunk(
-                    merged_chunks[-1].filename,
                     merged_chunks[-1].start_byte,
                     chunk.end_byte,
                 )
-                merged.populate_content(file_content)
-                if merged.num_tokens(self.tokenizer) <= self.max_tokens:
                     merged_chunks[-1] = merged
                 else:
                     merged_chunks.append(chunk)
@@ -148,20 +153,20 @@ class CodeChunker(Chunker):
         for chunk in merged_chunks:
             # This should always be true. Otherwise there's a bug worth investigating.
-            assert chunk.content and chunk.num_tokens(self.tokenizer) <= self.max_tokens
         return merged_chunks
     @staticmethod
     def is_code_file(filename: str) -> bool:
         """Checks whether pygment & tree_sitter can parse the file as code."""
-        language = CodeChunker._get_language_from_filename(filename)
         return language and language not in ["text only", "None"]
     @staticmethod
     def parse_tree(filename: str, content: str) -> List[str]:
         """Parses the code in a file and returns the parse tree."""
-        language = CodeChunker._get_language_from_filename(filename)
         if not language or language in ["text only", "None"]:
             logging.debug("%s doesn't seem to be a code file.", filename)
@@ -180,8 +185,12 @@ class CodeChunker(Chunker):
             return None
         return tree
-    def chunk(self, file_path: str, file_content: str) -> List[Chunk]:
         """Chunks a code file into smaller pieces."""
         if not file_content.strip():
             return []
@@ -189,33 +198,33 @@ class CodeChunker(Chunker):
         if tree is None:
             return []
-        chunks = self._chunk_node(tree.root_node, file_path, file_content)
-        for chunk in chunks:
             # Make sure that the chunk has content and doesn't exceed the max_tokens limit. Otherwise there must be
             # a bug in the code.
-            assert chunk.content
-            size = chunk.num_tokens(self.tokenizer)
-            assert size <= self.max_tokens, f"Chunk size {size} exceeds max_tokens {self.max_tokens}."
-        return chunks
-class TextChunker(Chunker):
     """Wrapper around semchunk: https://github.com/umarbutler/semchunk."""
     def __init__(self, max_tokens: int):
         self.max_tokens = max_tokens
-        tokenizer = tiktoken.get_encoding("cl100k_base")
         self.count_tokens = lambda text: len(tokenizer.encode(text, disallowed_special=()))
-    def chunk(self, file_path: str, file_content: str) -> List[Chunk]:
         """Chunks a text file into smaller pieces."""
         # We need to allocate some tokens for the filename, which is part of the chunk content.
         extra_tokens = self.count_tokens(file_path + "\n\n")
         text_chunks = chunk_via_semchunk(file_content, self.max_tokens - extra_tokens, self.count_tokens)
-        chunks = []
         start = 0
         for text_chunk in text_chunks:
             # This assertion should always be true. Otherwise there's a bug worth finding.
@@ -227,22 +236,25 @@ class TextChunker(Chunker):
                 logging.warning("Couldn't find semchunk in content: %s", text_chunk)
             else:
                 end = start + len(text_chunk)
-                chunks.append(Chunk(file_path, start, end, text_chunk))
             start = end
-        return chunks
-class IPYNBChunker(Chunker):
     """Extracts the python code from a Jupyter notebook, removing all the boilerplate.
     Based on https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/code/code_retrieval_augmented_generation.ipynb
     """
-    def __init__(self, code_chunker: CodeChunker):
         self.code_chunker = code_chunker
-    def chunk(self, filename: str, content: str) -> List[Chunk]:
         if not filename.lower().endswith(".ipynb"):
             logging.warn("IPYNBChunker is only for .ipynb files.")
             return []
@@ -256,16 +268,25 @@ class IPYNBChunker(Chunker):
         return chunks
-class UniversalChunker(Chunker):
     """Chunks a file into smaller pieces, regardless of whether it's code or text."""
     def __init__(self, max_tokens: int):
-        self.code_chunker = CodeChunker(max_tokens)
-        self.text_chunker = TextChunker(max_tokens)
-    def chunk(self, file_path: str, file_content: str) -> List[Chunk]:
         if file_path.lower().endswith(".ipynb"):
-            return IPYNBChunker(self.code_chunker).chunk(file_path, file_content)
-        if CodeChunker.is_code_file(file_path):
-            return self.code_chunker.chunk(file_path, file_content)
-        return self.text_chunker.chunk(file_path, file_content)

 import logging
 from abc import ABC, abstractmethod
 from dataclasses import dataclass
+from functools import cached_property
+from typing import Any, Dict, List, Optional
 import nbformat
 import pygments
 from tree_sitter_language_pack import get_parser
 logger = logging.getLogger(__name__)
+tokenizer = tiktoken.get_encoding("cl100k_base")
 class Chunk:
+    @abstractmethod
+    def content(self) -> str:
+        """The content of the chunk to be indexed."""
+    @abstractmethod
+    def metadata(self) -> Dict:
+        """Metadata for the chunk to be indexed."""
+@dataclass
+class FileChunk(Chunk):
     """A chunk of code or text extracted from a file in the repository."""
+    file_content: str    # The content of the entire file, not just this chunk.
+    file_metadata: Dict  # Metadata of the entire file, not just this chunk.
     start_byte: int
     end_byte: int
+    @cached_property
+    def filename(self):
+        if not "file_path" in self.file_metadata:
+            raise ValueError("file_metadata must contain a 'file_path' key.")
+        return self.file_metadata["file_path"]
+    @cached_property
     def content(self) -> Optional[str]:
         """The text content to be embedded. Might contain information beyond just the text snippet from the file."""
+        return self.filename + "\n\n" + self.file_content[self.start_byte : self.end_byte]
+    @cached_property
+    def metadata(self):
         """Converts the chunk to a dictionary that can be passed to a vector store."""
         # Some vector stores require the IDs to be ASCII.
         filename_ascii = self.filename.encode("ascii", "ignore").decode("ascii")
+        chunk_metadata = {
             # Some vector stores require the IDs to be ASCII.
             "id": f"{filename_ascii}_{self.start_byte}_{self.end_byte}",
             "start_byte": self.start_byte,
             "end_byte": self.end_byte,
             # Note to developer: When choosing a large chunk size, you might exceed the vector store's metadata
             # directly from the repository when needed.
             "text": self.content,
         }
+        chunk_metadata.update(self.file_metadata)
+        return chunk_metadata
+    @cached_property
+    def num_tokens(self):
+        """Number of tokens in this chunk."""
+        return len(tokenizer.encode(self.content, disallowed_special=()))
     def __eq__(self, other):
         if isinstance(other, Chunk):
 class Chunker(ABC):
+    """Abstract class for chunking a datum into smaller pieces."""
     @abstractmethod
+    def chunk(self, content: Any, metadata: Dict) -> List[Chunk]:
+        """Chunks a datum into smaller pieces."""
+class CodeFileChunker(Chunker):
     """Splits a code file into chunks of at most `max_tokens` tokens each."""
     def __init__(self, max_tokens: int):
         self.max_tokens = max_tokens
+        self.text_chunker = TextFileChunker(max_tokens)
     @staticmethod
     def _get_language_from_filename(filename: str):
         except pygments.util.ClassNotFound:
             return None
+    def _chunk_node(self, node: Node, file_content: str, file_metadata: Dict) -> List[FileChunk]:
         """Splits a node in the parse tree into a flat list of chunks."""
+        node_chunk = FileChunk(file_content, file_metadata, node.start_byte, node.end_byte)
+        if node_chunk.num_tokens <= self.max_tokens:
             return [node_chunk]
         if not node.children:
             # This is a leaf node, but it's too long. We'll have to split it with a text tokenizer.
+            return self.text_chunker.chunk(file_content[node.start_byte : node.end_byte], file_metadata)
         chunks = []
         for child in node.children:
+            chunks.extend(self._chunk_node(child, file_content, file_metadata))
         for chunk in chunks:
             # This should always be true. Otherwise there must be a bug in the code.
+            assert chunk.num_tokens <= self.max_tokens
         # Merge neighboring chunks if their combined size doesn't exceed max_tokens. The goal is to avoid pathologically
         # small chunks that end up being undeservedly preferred by the retriever.
         for chunk in chunks:
             if not merged_chunks:
                 merged_chunks.append(chunk)
+            elif merged_chunks[-1].num_tokens + chunk.num_tokens < self.max_tokens - 50:
                 # There's a good chance that merging these two chunks will be under the token limit. We're not 100% sure
                 # at this point, because tokenization is not necessarily additive.
+                merged = FileChunk(
+                    file_content,
+                    file_metadata,
                     merged_chunks[-1].start_byte,
                     chunk.end_byte,
                 )
+                if merged.num_tokens <= self.max_tokens:
                     merged_chunks[-1] = merged
                 else:
                     merged_chunks.append(chunk)
         for chunk in merged_chunks:
             # This should always be true. Otherwise there's a bug worth investigating.
+            assert chunk.num_tokens <= self.max_tokens
         return merged_chunks
     @staticmethod
     def is_code_file(filename: str) -> bool:
         """Checks whether pygment & tree_sitter can parse the file as code."""
+        language = CodeFileChunker._get_language_from_filename(filename)
         return language and language not in ["text only", "None"]
     @staticmethod
     def parse_tree(filename: str, content: str) -> List[str]:
         """Parses the code in a file and returns the parse tree."""
+        language = CodeFileChunker._get_language_from_filename(filename)
         if not language or language in ["text only", "None"]:
             logging.debug("%s doesn't seem to be a code file.", filename)
             return None
         return tree
+    def chunk(self, content: Any, metadata: Dict) -> List[Chunk]:
         """Chunks a code file into smaller pieces."""
+        file_content = content
+        file_metadata = metadata
+        file_path = metadata["file_path"]
         if not file_content.strip():
             return []
         if tree is None:
             return []
+        file_chunks = self._chunk_node(tree.root_node, file_content, file_metadata)
+        for chunk in file_chunks:
             # Make sure that the chunk has content and doesn't exceed the max_tokens limit. Otherwise there must be
             # a bug in the code.
+            assert chunk.num_tokens <= self.max_tokens, f"Chunk size {chunk.num_tokens} exceeds max_tokens {self.max_tokens}."
+        return file_chunks
+class TextFileChunker(Chunker):
     """Wrapper around semchunk: https://github.com/umarbutler/semchunk."""
     def __init__(self, max_tokens: int):
         self.max_tokens = max_tokens
         self.count_tokens = lambda text: len(tokenizer.encode(text, disallowed_special=()))
+    def chunk(self, content: Any, metadata: Dict) -> List[Chunk]:
         """Chunks a text file into smaller pieces."""
+        file_content = content
+        file_metadata = metadata
+        file_path = file_metadata["file_path"]
         # We need to allocate some tokens for the filename, which is part of the chunk content.
         extra_tokens = self.count_tokens(file_path + "\n\n")
         text_chunks = chunk_via_semchunk(file_content, self.max_tokens - extra_tokens, self.count_tokens)
+        file_chunks = []
         start = 0
         for text_chunk in text_chunks:
             # This assertion should always be true. Otherwise there's a bug worth finding.
                 logging.warning("Couldn't find semchunk in content: %s", text_chunk)
             else:
                 end = start + len(text_chunk)
+                file_chunks.append(FileChunk(file_content, file_metadata, start, end))
             start = end
+        return file_chunks
+class IpynbFileChunker(Chunker):
     """Extracts the python code from a Jupyter notebook, removing all the boilerplate.
     Based on https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/code/code_retrieval_augmented_generation.ipynb
     """
+    def __init__(self, code_chunker: CodeFileChunker):
         self.code_chunker = code_chunker
+    def chunk(self, content: Any, metadata: Dict) -> List[Chunk]:
+        filename = metadata["file_path"]
         if not filename.lower().endswith(".ipynb"):
             logging.warn("IPYNBChunker is only for .ipynb files.")
             return []
         return chunks
+class UniversalFileChunker(Chunker):
     """Chunks a file into smaller pieces, regardless of whether it's code or text."""
     def __init__(self, max_tokens: int):
+        self.code_chunker = CodeFileChunker(max_tokens)
+        self.ipynb_chunker = IpynbFileChunker(self.code_chunker)
+        self.text_chunker = TextFileChunker(max_tokens)
+    def chunk(self, content: Any, metadata: Dict) -> List[Chunk]:
+        if not "file_path" in metadata:
+            raise ValueError("metadata must contain a 'file_path' key.")
+        file_path = metadata["file_path"]
+        # Figure out the appropriate chunker to use.
         if file_path.lower().endswith(".ipynb"):
+            chunker = self.ipynb_chunker
+        if CodeFileChunker.is_code_file(file_path):
+            chunker = self.code_chunker
+        else:
+            chunker = self.text_chunker
+        return chunker.chunk(content, metadata)

src/{repo_manager.py → data_manager.py} RENAMED Viewed

@@ -2,13 +2,28 @@
 import logging
 import os
 from functools import cached_property
 import requests
 from git import GitCommandError, Repo
-class RepoManager:
     """Class to manage a local clone of a GitHub repository."""
     def __init__(
@@ -23,11 +38,18 @@ class RepoManager:
             repo_id: The identifier of the repository in owner/repo format, e.g. "Storia-AI/repo2vec".
             local_dir: The local directory where the repository will be cloned.
         """
         self.repo_id = repo_id
         self.local_dir = local_dir or "/tmp/"
         if not os.path.exists(self.local_dir):
             os.makedirs(self.local_dir)
         self.local_path = os.path.join(self.local_dir, repo_id)
         self.access_token = os.getenv("GITHUB_TOKEN")
         self.included_extensions = included_extensions
         self.excluded_extensions = excluded_extensions
@@ -58,7 +80,7 @@ class RepoManager:
             branch = "main"
         return branch
-    def clone(self) -> bool:
         """Clones the repository to the local directory, if it's not already cloned."""
         if os.path.exists(self.local_path):
             # The repository is already cloned.
@@ -94,38 +116,35 @@ class RepoManager:
             return False
         return True
-    def walk(self, log_dir: str = None):
-        """Walks the local repository path and yields a tuple of (filepath, content) for each file.
         The filepath is relative to the root of the repository (e.g. "org/repo/your/file/path.py").
         Args:
             included_extensions: Optional set of extensions to include.
             excluded_extensions: Optional set of extensions to exclude.
-            log_dir: Optional directory where to log the included and excluded files.
         """
         # We will keep apending to these files during the iteration, so we need to clear them first.
-        if log_dir:
-            repo_name = self.repo_id.replace("/", "_")
-            included_log_file = os.path.join(log_dir, f"included_{repo_name}.txt")
-            excluded_log_file = os.path.join(log_dir, f"excluded_{repo_name}.txt")
-            if os.path.exists(included_log_file):
-                os.remove(included_log_file)
-            if os.path.exists(excluded_log_file):
-                os.remove(excluded_log_file)
         for root, _, files in os.walk(self.local_path):
             file_paths = [os.path.join(root, file) for file in files]
             included_file_paths = [f for f in file_paths if self._should_include(f)]
-            if log_dir:
-                with open(included_log_file, "a") as f:
-                    for path in included_file_paths:
-                        f.write(path + "\n")
-                excluded_file_paths = set(file_paths).difference(set(included_file_paths))
-                with open(excluded_log_file, "a") as f:
-                    for path in excluded_file_paths:
-                        f.write(path + "\n")
             for file_path in included_file_paths:
                 with open(file_path, "r") as f:
@@ -134,9 +153,14 @@ class RepoManager:
                     except UnicodeDecodeError:
                         logging.warning("Unable to decode file %s. Skipping.", file_path)
                         continue
-                    yield file_path[len(self.local_dir) + 1 :], contents
-    def github_link_for_file(self, file_path: str) -> str:
         """Converts a repository file path to a GitHub link."""
-        file_path = file_path[len(self.repo_id) :]
         return f"https://github.com/{self.repo_id}/blob/{self.default_branch}/{file_path}"

 import logging
 import os
+from abc import abstractmethod
 from functools import cached_property
+from typing import Any, Dict, Generator, Tuple
 import requests
 from git import GitCommandError, Repo
+class DataManager:
+    def __init__(self, dataset_id: str):
+        self.dataset_id = dataset_id
+    @abstractmethod
+    def download(self) -> bool:
+        """Downloads the data from a remote location."""
+    @abstractmethod
+    def walk(self) -> Generator[Tuple[Any, Dict], None, None]:
+        """Yields a tuple of (data, metadata) for each data item in the dataset."""
+class GitHubRepoManager(DataManager):
     """Class to manage a local clone of a GitHub repository."""
     def __init__(
             repo_id: The identifier of the repository in owner/repo format, e.g. "Storia-AI/repo2vec".
             local_dir: The local directory where the repository will be cloned.
         """
+        super().__init__(dataset_id=repo_id)
         self.repo_id = repo_id
         self.local_dir = local_dir or "/tmp/"
         if not os.path.exists(self.local_dir):
             os.makedirs(self.local_dir)
         self.local_path = os.path.join(self.local_dir, repo_id)
+        self.log_dir = os.path.join(self.local_dir, "logs", repo_id)
+        if not os.path.exists(self.log_dir):
+            os.makedirs(self.log_dir)
         self.access_token = os.getenv("GITHUB_TOKEN")
         self.included_extensions = included_extensions
         self.excluded_extensions = excluded_extensions
             branch = "main"
         return branch
+    def download(self) -> bool:
         """Clones the repository to the local directory, if it's not already cloned."""
         if os.path.exists(self.local_path):
             # The repository is already cloned.
             return False
         return True
+    def walk(self) -> Generator[Tuple[Any, Dict], None, None]:
+        """Walks the local repository path and yields a tuple of (content, metadata) for each file.
         The filepath is relative to the root of the repository (e.g. "org/repo/your/file/path.py").
         Args:
             included_extensions: Optional set of extensions to include.
             excluded_extensions: Optional set of extensions to exclude.
         """
         # We will keep apending to these files during the iteration, so we need to clear them first.
+        repo_name = self.repo_id.replace("/", "_")
+        included_log_file = os.path.join(self.log_dir, f"included_{repo_name}.txt")
+        excluded_log_file = os.path.join(self.log_dir, f"excluded_{repo_name}.txt")
+        if os.path.exists(included_log_file):
+            os.remove(included_log_file)
+        if os.path.exists(excluded_log_file):
+            os.remove(excluded_log_file)
         for root, _, files in os.walk(self.local_path):
             file_paths = [os.path.join(root, file) for file in files]
             included_file_paths = [f for f in file_paths if self._should_include(f)]
+            with open(included_log_file, "a") as f:
+                for path in included_file_paths:
+                    f.write(path + "\n")
+            excluded_file_paths = set(file_paths).difference(set(included_file_paths))
+            with open(excluded_log_file, "a") as f:
+                for path in excluded_file_paths:
+                    f.write(path + "\n")
             for file_path in included_file_paths:
                 with open(file_path, "r") as f:
                     except UnicodeDecodeError:
                         logging.warning("Unable to decode file %s. Skipping.", file_path)
                         continue
+                    relative_file_path = file_path[len(self.local_dir) + 1 :]
+                    metadata = {
+                        "file_path": relative_file_path,
+                        "url": self.url_for_file(relative_file_path),
+                    }
+                    yield contents, metadata
+    def url_for_file(self, file_path: str) -> str:
         """Converts a repository file path to a GitHub link."""
+        file_path = file_path[len(self.repo_id) + 1 :]
         return f"https://github.com/{self.repo_id}/blob/{self.default_branch}/{file_path}"

src/embedder.py CHANGED Viewed

@@ -5,23 +5,23 @@ import logging
 import os
 from abc import ABC, abstractmethod
 from collections import Counter
-from typing import Dict, Generator, List, Tuple
 import marqo
 from openai import OpenAI
 from chunker import Chunk, Chunker
-from repo_manager import RepoManager
 Vector = Tuple[Dict, List[float]]  # (metadata, embedding)
 class BatchEmbedder(ABC):
-    """Abstract class for batch embedding of a repository."""
     @abstractmethod
-    def embed_repo(self, chunks_per_batch: int, max_embedding_jobs: int = None):
-        """Issues batch embedding jobs for the entire repository."""
     @abstractmethod
     def embeddings_are_ready(self) -> bool:
@@ -29,16 +29,16 @@ class BatchEmbedder(ABC):
     @abstractmethod
     def download_embeddings(self) -> Generator[Vector, None, None]:
-        """Yields (chunk_metadata, embedding) pairs for each chunk in the repository."""
 class OpenAIBatchEmbedder(BatchEmbedder):
     """Batch embedder that calls OpenAI. See https://platform.openai.com/docs/guides/batch/overview."""
     def __init__(
-        self, repo_manager: RepoManager, chunker: Chunker, local_dir: str, embedding_model: str, embedding_size: int
     ):
-        self.repo_manager = repo_manager
         self.chunker = chunker
         self.local_dir = local_dir
         self.embedding_model = embedding_model
@@ -47,17 +47,17 @@ class OpenAIBatchEmbedder(BatchEmbedder):
         self.openai_batch_ids = {}
         self.client = OpenAI()
-    def embed_repo(self, chunks_per_batch: int, max_embedding_jobs: int = None):
-        """Issues batch embedding jobs for the entire repository."""
         if self.openai_batch_ids:
             raise ValueError("Embeddings are in progress.")
         batch = []
         chunk_count = 0
-        repo_name = self.repo_manager.repo_id.split("/")[-1]
-        for filepath, content in self.repo_manager.walk():
-            chunks = self.chunker.chunk(filepath, content)
             chunk_count += len(chunks)
             batch.extend(chunks)
@@ -65,9 +65,9 @@ class OpenAIBatchEmbedder(BatchEmbedder):
                 for i in range(0, len(batch), chunks_per_batch):
                     sub_batch = batch[i : i + chunks_per_batch]
                     openai_batch_id = self._issue_job_for_chunks(
-                        sub_batch, batch_id=f"{repo_name}/{len(self.openai_batch_ids)}"
                     )
-                    self.openai_batch_ids[openai_batch_id] = [chunk.to_metadata for chunk in sub_batch]
                     if max_embedding_jobs and len(self.openai_batch_ids) >= max_embedding_jobs:
                         logging.info("Reached the maximum number of embedding jobs. Stopping.")
                         return
@@ -75,8 +75,8 @@ class OpenAIBatchEmbedder(BatchEmbedder):
         # Finally, commit the last batch.
         if batch:
-            openai_batch_id = self._issue_job_for_chunks(batch, batch_id=f"{repo_name}/{len(self.openai_batch_ids)}")
-            self.openai_batch_ids[openai_batch_id] = [chunk.to_metadata for chunk in batch]
         logging.info("Issued %d jobs for %d chunks.", len(self.openai_batch_ids), chunk_count)
         # Save the job IDs to a file, just in case this script is terminated by mistake.
@@ -97,7 +97,7 @@ class OpenAIBatchEmbedder(BatchEmbedder):
         return are_ready
     def download_embeddings(self) -> Generator[Vector, None, None]:
-        """Yield a (chunk_metadata, embedding) pair for each chunk in the repository."""
         job_ids = self.openai_batch_ids.keys()
         statuses = [self.client.batches.retrieve(job_id.strip()) for job_id in job_ids]
@@ -164,17 +164,22 @@ class OpenAIBatchEmbedder(BatchEmbedder):
                 f.write("\n")
     @staticmethod
-    def _chunks_to_request(chunks: List[Chunk], batch_id: str, model: str, dimensions: int):
         """Convert a list of chunks to a batch request."""
         return {
             "custom_id": batch_id,
             "method": "POST",
             "url": "/v1/embeddings",
-            "body": {
-                "model": model,
-                "dimensions": dimensions,
-                "input": [chunk.content for chunk in chunks],
-            },
         }
@@ -184,8 +189,8 @@ class MarqoEmbedder(BatchEmbedder):
     Embeddings can be stored locally (in which case `url` the constructor should point to localhost) or in the cloud.
     """
-    def __init__(self, repo_manager: RepoManager, chunker: Chunker, index_name: str, url: str, model="hf/e5-base-v2"):
-        self.repo_manager = repo_manager
         self.chunker = chunker
         self.client = marqo.Client(url=url)
         self.index = self.client.index(index_name)
@@ -194,16 +199,16 @@ class MarqoEmbedder(BatchEmbedder):
         if not index_name in all_index_names:
             self.client.create_index(index_name, model=model)
-    def embed_repo(self, chunks_per_batch: int, max_embedding_jobs: int = None):
-        """Issues batch embedding jobs for the entire repository."""
         if chunks_per_batch > 64:
             raise ValueError("Marqo enforces a limit of 64 chunks per batch.")
         chunk_count = 0
         batch = []
-        for filepath, content in self.repo_manager.walk():
-            chunks = self.chunker.chunk(filepath, content)
             chunk_count += len(chunks)
             batch.extend(chunks)
@@ -212,7 +217,7 @@ class MarqoEmbedder(BatchEmbedder):
                     sub_batch = batch[i : i + chunks_per_batch]
                     logging.info("Indexing %d chunks...", len(sub_batch))
                     self.index.add_documents(
-                        documents=[chunk.to_metadata for chunk in sub_batch],
                         tensor_fields=["text"],
                     )
@@ -223,16 +228,27 @@ class MarqoEmbedder(BatchEmbedder):
         # Finally, commit the last batch.
         if batch:
-            self.index.add_documents(documents=[chunk.to_metadata for chunk in batch], tensor_fields=["text"])
         logging.info(f"Successfully embedded {chunk_count} chunks.")
     def embeddings_are_ready(self) -> bool:
         """Checks whether the batch embedding jobs are done."""
-        # Marqo indexes documents synchronously, so once embed_repo() returns, the embeddings are ready.
         return True
     def download_embeddings(self) -> Generator[Vector, None, None]:
-        """Yields (chunk_metadata, embedding) pairs for each chunk in the repository."""
         # Marqo stores embeddings as they are created, so they're already in the vector store. No need to download them
         # as we would with e.g. OpenAI, Cohere, or some other cloud-based embedding service.
         return []

 import os
 from abc import ABC, abstractmethod
 from collections import Counter
+from typing import Dict, Generator, List, Optional, Tuple
 import marqo
 from openai import OpenAI
 from chunker import Chunk, Chunker
+from data_manager import DataManager
 Vector = Tuple[Dict, List[float]]  # (metadata, embedding)
 class BatchEmbedder(ABC):
+    """Abstract class for batch embedding of a dataset."""
     @abstractmethod
+    def embed_dataset(self, chunks_per_batch: int, max_embedding_jobs: int = None):
+        """Issues batch embedding jobs for the entire dataset."""
     @abstractmethod
     def embeddings_are_ready(self) -> bool:
     @abstractmethod
     def download_embeddings(self) -> Generator[Vector, None, None]:
+        """Yields (chunk_metadata, embedding) pairs for each chunk in the dataset."""
 class OpenAIBatchEmbedder(BatchEmbedder):
     """Batch embedder that calls OpenAI. See https://platform.openai.com/docs/guides/batch/overview."""
     def __init__(
+        self, data_manager: DataManager, chunker: Chunker, local_dir: str, embedding_model: str, embedding_size: int
     ):
+        self.data_manager = data_manager
         self.chunker = chunker
         self.local_dir = local_dir
         self.embedding_model = embedding_model
         self.openai_batch_ids = {}
         self.client = OpenAI()
+    def embed_dataset(self, chunks_per_batch: int, max_embedding_jobs: int = None):
+        """Issues batch embedding jobs for the entire dataset."""
         if self.openai_batch_ids:
             raise ValueError("Embeddings are in progress.")
         batch = []
         chunk_count = 0
+        dataset_name = self.data_manager.dataset_id.split("/")[-1]
+        for content, metadata in self.data_manager.walk():
+            chunks = self.chunker.chunk(content, metadata)
             chunk_count += len(chunks)
             batch.extend(chunks)
                 for i in range(0, len(batch), chunks_per_batch):
                     sub_batch = batch[i : i + chunks_per_batch]
                     openai_batch_id = self._issue_job_for_chunks(
+                        sub_batch, batch_id=f"{dataset_name}/{len(self.openai_batch_ids)}"
                     )
+                    self.openai_batch_ids[openai_batch_id] = [chunk.metadata for chunk in sub_batch]
                     if max_embedding_jobs and len(self.openai_batch_ids) >= max_embedding_jobs:
                         logging.info("Reached the maximum number of embedding jobs. Stopping.")
                         return
         # Finally, commit the last batch.
         if batch:
+            openai_batch_id = self._issue_job_for_chunks(batch, batch_id=f"{dataset_name}/{len(self.openai_batch_ids)}")
+            self.openai_batch_ids[openai_batch_id] = [chunk.metadata for chunk in batch]
         logging.info("Issued %d jobs for %d chunks.", len(self.openai_batch_ids), chunk_count)
         # Save the job IDs to a file, just in case this script is terminated by mistake.
         return are_ready
     def download_embeddings(self) -> Generator[Vector, None, None]:
+        """Yield a (chunk_metadata, embedding) pair for each chunk in the dataset."""
         job_ids = self.openai_batch_ids.keys()
         statuses = [self.client.batches.retrieve(job_id.strip()) for job_id in job_ids]
                 f.write("\n")
     @staticmethod
+    def _chunks_to_request(chunks: List[Chunk], batch_id: str, model: str, dimensions: Optional[int] = None) -> Dict:
         """Convert a list of chunks to a batch request."""
+        body = {
+            "model": model,
+            "input": [chunk.content for chunk in chunks],
+        }
+        # These are the only two models that support a dynamic embedding size.
+        if model in ["text-embedding-3-small", "text-embedding-3-large"] and dimensions is not None:
+            body["dimensions"] = dimensions
         return {
             "custom_id": batch_id,
             "method": "POST",
             "url": "/v1/embeddings",
+            "body": body,
         }
     Embeddings can be stored locally (in which case `url` the constructor should point to localhost) or in the cloud.
     """
+    def __init__(self, data_manager: DataManager, chunker: Chunker, index_name: str, url: str, model="hf/e5-base-v2"):
+        self.data_manager = data_manager
         self.chunker = chunker
         self.client = marqo.Client(url=url)
         self.index = self.client.index(index_name)
         if not index_name in all_index_names:
             self.client.create_index(index_name, model=model)
+    def embed_dataset(self, chunks_per_batch: int, max_embedding_jobs: int = None):
+        """Issues batch embedding jobs for the entire dataset."""
         if chunks_per_batch > 64:
             raise ValueError("Marqo enforces a limit of 64 chunks per batch.")
         chunk_count = 0
         batch = []
+        for content, metadata in self.data_manager.walk():
+            chunks = self.chunker.chunk(content, metadata)
             chunk_count += len(chunks)
             batch.extend(chunks)
                     sub_batch = batch[i : i + chunks_per_batch]
                     logging.info("Indexing %d chunks...", len(sub_batch))
                     self.index.add_documents(
+                        documents=[chunk.metadata for chunk in sub_batch],
                         tensor_fields=["text"],
                     )
         # Finally, commit the last batch.
         if batch:
+            self.index.add_documents(documents=[chunk.metadata for chunk in batch], tensor_fields=["text"])
         logging.info(f"Successfully embedded {chunk_count} chunks.")
     def embeddings_are_ready(self) -> bool:
         """Checks whether the batch embedding jobs are done."""
+        # Marqo indexes documents synchronously, so once embed_dataset() returns, the embeddings are ready.
         return True
     def download_embeddings(self) -> Generator[Vector, None, None]:
+        """Yields (chunk_metadata, embedding) pairs for each chunk in the dataset."""
         # Marqo stores embeddings as they are created, so they're already in the vector store. No need to download them
         # as we would with e.g. OpenAI, Cohere, or some other cloud-based embedding service.
         return []
+def build_batch_embedder_from_flags(data_manager: DataManager, chunker: Chunker, args) -> BatchEmbedder:
+    if args.embedder_type == "openai":
+        return OpenAIBatchEmbedder(data_manager, chunker, args.local_dir, args.embedding_model, args.embedding_size)
+    elif args.embedder_type == "marqo":
+        return MarqoEmbedder(
+            data_manager, chunker, index_name=args.index_name, url=args.marqo_url, model=args.embedding_model
+        )
+    else:
+        raise ValueError(f"Unrecognized embedder type {args.embedder_type}")

src/github.py ADDED Viewed

	@@ -0,0 +1,226 @@

+"""GitHub-specific implementations for DataManager and Chunker."""
+import os
+from dataclasses import dataclass
+from typing import Any, Dict, Generator, List, Tuple
+import logging
+import requests
+import tiktoken
+from chunker import Chunk, Chunker
+from data_manager import DataManager
+tokenizer = tiktoken.get_encoding("cl100k_base")
+@dataclass
+class GitHubIssueComment:
+    """A comment on a GitHub issue."""
+    url: str
+    html_url: str
+    body: str
+    @property
+    def pretty(self):
+        return f"""## Comment: {self.body}"""
+@dataclass
+class GitHubIssue:
+    """A GitHub issue."""
+    url: str
+    html_url: str
+    title: str
+    body: str
+    comments: List[GitHubIssueComment]
+    @property
+    def pretty(self):
+        # Do not include the comments.
+        return f"# Issue: {self.title}\n{self.body}"
+class GitHubIssuesManager(DataManager):
+    """Class to manage the GitHub issues of a particular repository."""
+    def __init__(self, repo_id: str, max_issues: int = None):
+        super().__init__(dataset_id=repo_id + "/issues")
+        self.repo_id = repo_id
+        self.max_issues = max_issues
+        self.access_token = os.getenv("GITHUB_TOKEN")
+        if not self.access_token:
+            raise ValueError("Please set the GITHUB_TOKEN environment variable when indexing GitHub issues.")
+        self.issues = []
+    def download(self) -> bool:
+        """Downloads all open issues from a GitHub repository (including the comments)."""
+        per_page = min(self.max_issues or 100, 100)  # 100 is maximum per page
+        url = f"https://api.github.com/repos/{self.repo_id}/issues?per_page={per_page}"
+        while url:
+            print(f"Fetching issues from {url}")
+            response = self._get_page_of_issues(url)
+            response.raise_for_status()
+            for issue in response.json():
+                if not "pull_request" in issue:
+                    self.issues.append(
+                        GitHubIssue(
+                            url=issue["url"],
+                            html_url=issue["html_url"],
+                            title=issue["title"],
+                            # When there's no body, issue["body"] is None.
+                            body=issue["body"] or "",
+                            comments=self._get_comments(issue["comments_url"]),
+                        )
+                    )
+            if self.max_issues and len(self.issues) >= self.max_issues:
+                break
+            url = GitHubIssuesManager._get_next_link_from_header(response)
+        return True
+    def walk(self) -> Generator[Tuple[Any, Dict], None, None]:
+        """Yields a tuple of (issue_content, issue_metadata) for each GitHub issue in the repository."""
+        for issue in self.issues:
+            yield issue, {}  # empty metadata
+    @staticmethod
+    def _get_next_link_from_header(response):
+        """
+        Given a response from a paginated request, extracts the URL of the next page.
+        Example:
+            response.headers.get("link") = '<https://api.github.com/repositories/2503910/issues?per_page=10&page=2>; rel="next", <https://api.github.com/repositories/2503910/issues?per_page=10&page=2>; rel="last"'
+            get_next_link_from_header(response) = 'https://api.github.com/repositories/2503910/issues?per_page=10&page=2'
+        """
+        link_header = response.headers.get("link")
+        if link_header:
+            links = link_header.split(", ")
+            for link in links:
+                url, rel = link.split("; ")
+                url = url[1:-1]  # The URL is enclosed in angle brackets
+                rel = rel[5:-1]  # e.g. rel="next" -> next
+                if rel == "next":
+                    return url
+        return None
+    def _get_page_of_issues(self, url):
+        """Downloads a single page of issues. Note that GitHub uses pagination for long lists of objects."""
+        return requests.get(
+            url,
+            headers={
+                "Authorization": f"Bearer {self.access_token}",
+                "X-GitHub-Api-Version": "2022-11-28",
+            },
+        )
+    def _get_comments(self, comments_url) -> List[GitHubIssueComment]:
+        """Downloads all the comments associated with an issue; returns an empty list if the request times out."""
+        try:
+            response = requests.get(
+                comments_url,
+                headers={
+                    "Authorization": f"Bearer {self.access_token}",
+                    "X-GitHub-Api-Version": "2022-11-28",
+                },
+            )
+        except requests.exceptions.ConnectionTimeout:
+            logging.warn(f"Timeout fetching comments from {comments_url}")
+            return []
+        comments = []
+        for comment in response.json():
+            comments.append(
+                GitHubIssueComment(
+                    url=comment["url"],
+                    html_url=comment["html_url"],
+                    body=comment["body"],
+                )
+            )
+        return comments
+@dataclass
+class IssueChunk(Chunk):
+    """A chunk form a GitHub issue with a contiguous (sub)set of comments.
+    Note that, in comparison to FileChunk, its properties are not cached. We want to allow fields to be changed in place
+    and have e.g. the token count be recomputed. Compared to files, GitHub issues are typically smaller, so the overhead
+    is less problematic.
+    """
+    issue: GitHubIssue
+    start_comment: int
+    end_comment: int  # exclusive
+    @property
+    def content(self) -> str:
+        """The title of the issue, followed by the comments in the chunk."""
+        if self.start_comment == 0:
+            # This is the first subsequence of comments. We'll include the entire body of the issue.
+            issue_str = self.issue.pretty
+        else:
+            # This is a middle subsequence of comments. We'll only include the title of the issue.
+            issue_str = f"# Issue: {self.issue.title}"
+        # Now add the comments themselves.
+        comments = self.issue.comments[self.start_comment : self.end_comment]
+        comments_str = "\n\n".join([comment.pretty for comment in comments])
+        return issue_str + "\n\n" + comments_str
+    @property
+    def metadata(self):
+        """Converts the chunk to a dictionary that can be passed to a vector store."""
+        return {
+            "id": f"{self.issue.html_url}_{self.start_comment}_{self.end_comment}",
+            "url": self.issue.html_url,
+            "start_comment": self.start_comment,
+            "end_comment": self.end_comment,
+            # Note to developer: When choosing a large chunk size, you might exceed the vector store's metadata
+            # size limit. In that case, you can simply store the start/end comment indices above, and fetch the
+            # content of the issue on demand from the URL.
+            "text": self.content,
+        }
+    @property
+    def num_tokens(self):
+        """Number of tokens in this chunk."""
+        return len(tokenizer.encode(self.content, disallowed_special=()))
+class GitHubIssuesChunker(Chunker):
+    """Chunks a GitHub issue into smaller pieces of contiguous (sub)sets of comments."""
+    def __init__(self, max_tokens: int):
+        self.max_tokens = max_tokens
+    def chunk(self, content: Any, metadata: Dict) -> List[Chunk]:
+        """Chunks a GitHub issue into subsequences of comments."""
+        del metadata  # The metadata of the input issue is unused.
+        issue = content  # Rename for clarity.
+        if not isinstance(issue, GitHubIssue):
+            raise ValueError(f"Expected a GitHubIssue, got {type(issue)}.")
+        chunks = []
+        # First, create a chunk for the issue body.
+        issue_body_chunk = IssueChunk(issue, 0, 0)
+        chunks.append(issue_body_chunk)
+        for comment_idx, comment in enumerate(issue.comments):
+            # This is just approximate, because when we actually add a comment to the chunk there might be some extra
+            # tokens, like a "Comment:" prefix.
+            approx_comment_size = len(tokenizer.encode(comment.body, disallowed_special=())) + 20  # 20 for buffer
+            if chunks[-1].num_tokens + approx_comment_size > self.max_tokens:
+                # Create a new chunk starting from this comment.
+                chunks.append(
+                    IssueChunk(
+                        issue=issue,
+                        start_comment=comment_idx,
+                        end_comment=comment_idx + 1,
+                    ))
+            else:
+                # Add the comment to the existing chunk.
+                chunks[-1].end_comment = comment_idx + 1
+        return chunks

src/index.py CHANGED Viewed

@@ -4,9 +4,10 @@ import argparse
 import logging
 import time
-from chunker import UniversalChunker
-from embedder import MarqoEmbedder, OpenAIBatchEmbedder
-from repo_manager import RepoManager
 from vector_store import build_from_args
 logging.basicConfig(level=logging.INFO)
@@ -31,43 +32,42 @@ def _read_extensions(path):
 def main():
-    parser = argparse.ArgumentParser(description="Batch-embeds a repository")
     parser.add_argument("repo_id", help="The ID of the repository to index")
-    parser.add_argument("--embedder_type", default="openai", choices=["openai", "marqo"])
     parser.add_argument(
-        "--embedding_model",
         type=str,
         default=None,
         help="The embedding model. Defaults to `text-embedding-ada-002` for OpenAI and `hf/e5-base-v2` for Marqo.",
     )
     parser.add_argument(
-        "--embedding_size",
         type=int,
         default=None,
-        help="The embedding size to use for OpenAI; defaults to OpenAI defaults (e.g. 1536 for `text-embedding-3-small`"
-        " and 3072 for `text-embedding-3-large`). Note that OpenAI allows users to reduce these default dimensions. "
-        "No need to specify an embedding size for Marqo, since the embedding model determines it.",
     )
-    parser.add_argument("--vector_store_type", default="pinecone", choices=["pinecone", "marqo"])
     parser.add_argument(
-        "--local_dir",
         default="repos",
         help="The local directory to store the repository",
     )
     parser.add_argument(
-        "--tokens_per_chunk",
         type=int,
         default=800,
         help="https://arxiv.org/pdf/2406.14497 recommends a value between 200-800.",
     )
     parser.add_argument(
-        "--chunks_per_batch",
         type=int,
         default=2000,
         help="Maximum chunks per batch. We recommend 2000 for the OpenAI embedder. Marqo enforces a limit of 64.",
     )
     parser.add_argument(
-        "--index_name",
         required=True,
         help="Vector store index name. For Pinecone, make sure to create it with the right embedding size.",
     )
@@ -81,16 +81,30 @@ def main():
         help="Path to a file containing a list of extensions to exclude. One extension per line.",
     )
     parser.add_argument(
-        "--max_embedding_jobs",
         type=int,
         help="Maximum number of embedding jobs to run. Specifying this might result in "
         "indexing only part of the repository, but prevents you from burning through OpenAI credits.",
     )
     parser.add_argument(
-        "--marqo_url",
         default="http://localhost:8882",
         help="URL for the Marqo server. Required if using Marqo as embedder or vector store.",
     )
     args = parser.parse_args()
     # Validate embedder and vector store compatibility.
@@ -111,56 +125,81 @@ def main():
         parser.error(f"The maximum number of chunks per job is {MAX_TOKENS_PER_JOB}.")
     if args.include and args.exclude:
         parser.error("At most one of --include and --exclude can be specified.")
     # Set default values based on other arguments
     if args.embedding_model is None:
         args.embedding_model = "text-embedding-ada-002" if args.embedder_type == "openai" else "hf/e5-base-v2"
     if args.embedding_size is None and args.embedder_type == "openai":
         args.embedding_size = OPENAI_DEFAULT_EMBEDDING_SIZE.get(args.embedding_model)
-        # No need to set embedding_size for Marqo, since the embedding model determines the embedding size.
-        logging.warn("--embedding_size is ignored for Marqo embedder.")
-    included_extensions = _read_extensions(args.include) if args.include else None
-    excluded_extensions = _read_extensions(args.exclude) if args.exclude else None
-    logging.info("Cloning the repository...")
-    repo_manager = RepoManager(
-        args.repo_id,
-        local_dir=args.local_dir,
-        included_extensions=included_extensions,
-        excluded_extensions=excluded_extensions,
-    )
-    repo_manager.clone()
-    logging.info("Issuing embedding jobs...")
-    chunker = UniversalChunker(max_tokens=args.tokens_per_chunk)
-    if args.embedder_type == "openai":
-        embedder = OpenAIBatchEmbedder(repo_manager, chunker, args.local_dir, args.embedding_model, args.embedding_size)
-    elif args.embedder_type == "marqo":
-        embedder = MarqoEmbedder(
-            repo_manager, chunker, index_name=args.index_name, url=args.marqo_url, model=args.embedding_model
         )
-    else:
-        raise ValueError(f"Unrecognized embedder type {args.embedder_type}")
-    embedder.embed_repo(args.chunks_per_batch, args.max_embedding_jobs)
     if args.vector_store_type == "marqo":
         # Marqo computes embeddings and stores them in the vector store at once, so we're done.
         logging.info("Done!")
         return
-    logging.info("Waiting for embeddings to be ready...")
-    while not embedder.embeddings_are_ready():
-        logging.info("Sleeping for 30 seconds...")
-        time.sleep(30)
-    logging.info("Moving embeddings to the vector store...")
-    # Note to developer: Replace this with your preferred vector store.
-    vector_store = build_from_args(args)
-    vector_store.ensure_exists()
-    vector_store.upsert(embedder.download_embeddings())
     logging.info("Done!")

 import logging
 import time
+from chunker import UniversalFileChunker
+from data_manager import GitHubRepoManager
+from embedder import build_batch_embedder_from_flags
+from github import GitHubIssuesChunker, GitHubIssuesManager
 from vector_store import build_from_args
 logging.basicConfig(level=logging.INFO)
 def main():
+    parser = argparse.ArgumentParser(description="Batch-embeds a GitHub repository and its issues.")
     parser.add_argument("repo_id", help="The ID of the repository to index")
+    parser.add_argument("--embedder-type", default="openai", choices=["openai", "marqo"])
     parser.add_argument(
+        "--embedding-model",
         type=str,
         default=None,
         help="The embedding model. Defaults to `text-embedding-ada-002` for OpenAI and `hf/e5-base-v2` for Marqo.",
     )
     parser.add_argument(
+        "--embedding-size",
         type=int,
         default=None,
+        help="The embedding size to use for OpenAI text-embedding-3* models. Defaults to 1536 for small and 3072 for "
+        "large. Note that no other OpenAI models support a dynamic embedding size, nor do models used with Marqo.",
     )
+    parser.add_argument("--vector-store-type", default="pinecone", choices=["pinecone", "marqo"])
     parser.add_argument(
+        "--local-dir",
         default="repos",
         help="The local directory to store the repository",
     )
     parser.add_argument(
+        "--tokens-per-chunk",
         type=int,
         default=800,
         help="https://arxiv.org/pdf/2406.14497 recommends a value between 200-800.",
     )
     parser.add_argument(
+        "--chunks-per-batch",
         type=int,
         default=2000,
         help="Maximum chunks per batch. We recommend 2000 for the OpenAI embedder. Marqo enforces a limit of 64.",
     )
     parser.add_argument(
+        "--index-name",
         required=True,
         help="Vector store index name. For Pinecone, make sure to create it with the right embedding size.",
     )
         help="Path to a file containing a list of extensions to exclude. One extension per line.",
     )
     parser.add_argument(
+        "--max-embedding-jobs",
         type=int,
         help="Maximum number of embedding jobs to run. Specifying this might result in "
         "indexing only part of the repository, but prevents you from burning through OpenAI credits.",
     )
     parser.add_argument(
+        "--marqo-url",
         default="http://localhost:8882",
         help="URL for the Marqo server. Required if using Marqo as embedder or vector store.",
     )
+    # Pass --no-index-repo in order to not index the repository.
+    parser.add_argument(
+        "--index-repo",
+        action=argparse.BooleanOptionalAction,
+        default=True,
+        help="Whether to index the repository. At least one of --index-repo and --index-issues must be True.",
+    )
+    # Pass --no-index-issues in order to not index the issues.
+    parser.add_argument(
+        "--index-issues",
+        action=argparse.BooleanOptionalAction,
+        default=True,
+        help="Whether to index GitHub issues. At least one of --index-repo and --index-issues must be True.",
+    )
     args = parser.parse_args()
     # Validate embedder and vector store compatibility.
         parser.error(f"The maximum number of chunks per job is {MAX_TOKENS_PER_JOB}.")
     if args.include and args.exclude:
         parser.error("At most one of --include and --exclude can be specified.")
+    if not args.index_repo and not args.index_issues:
+        parser.error("At least one of --index-repo and --index-issues must be true.")
     # Set default values based on other arguments
     if args.embedding_model is None:
         args.embedding_model = "text-embedding-ada-002" if args.embedder_type == "openai" else "hf/e5-base-v2"
     if args.embedding_size is None and args.embedder_type == "openai":
         args.embedding_size = OPENAI_DEFAULT_EMBEDDING_SIZE.get(args.embedding_model)
+    ######################
+    # Step 1: Embeddings #
+    ######################
+    # Index the repository.
+    repo_embedder = None
+    if args.index_repo:
+        included_extensions = _read_extensions(args.include) if args.include else None
+        excluded_extensions = _read_extensions(args.exclude) if args.exclude else None
+        logging.info("Cloning the repository...")
+        repo_manager = GitHubRepoManager(
+            args.repo_id,
+            local_dir=args.local_dir,
+            included_extensions=included_extensions,
+            excluded_extensions=excluded_extensions,
         )
+        repo_manager.download()
+        logging.info("Embedding the repo...")
+        chunker = UniversalFileChunker(max_tokens=args.tokens_per_chunk)
+        repo_embedder = build_batch_embedder_from_flags(repo_manager, chunker, args)
+        repo_embedder.embed_dataset(args.chunks_per_batch, args.max_embedding_jobs)
+    # Index the GitHub issues.
+    issues_embedder = None
+    assert args.index_issues is True
+    if args.index_issues:
+        logging.info("Issuing embedding jobs for GitHub issues...")
+        issues_manager = GitHubIssuesManager(args.repo_id)
+        issues_manager.download()
+        logging.info("Embedding GitHub issues...")
+        chunker = GitHubIssuesChunker(max_tokens=args.tokens_per_chunk)
+        issues_embedder = build_batch_embedder_from_flags(issues_manager, chunker, args)
+        issues_embedder.embed_dataset(args.chunks_per_batch, args.max_embedding_jobs)
+    ########################
+    # Step 2: Vector Store #
+    ########################
     if args.vector_store_type == "marqo":
         # Marqo computes embeddings and stores them in the vector store at once, so we're done.
         logging.info("Done!")
         return
+    if repo_embedder is not None:
+        logging.info("Waiting for repo embeddings to be ready...")
+        while not repo_embedder.embeddings_are_ready():
+            logging.info("Sleeping for 30 seconds...")
+            time.sleep(30)
+        logging.info("Moving embeddings to the repo vector store...")
+        repo_vector_store = build_from_args(args)
+        repo_vector_store.ensure_exists()
+        repo_vector_store.upsert(repo_embedder.download_embeddings())
+    if issues_embedder is not None:
+        logging.info("Waiting for issue embeddings to be ready...")
+        while not issues_embedder.embeddings_are_ready():
+            logging.info("Sleeping for 30 seconds...")
+            time.sleep(30)
+        logging.info("Moving embeddings to the issues vector store...")
+        issues_vector_store = build_from_args(args)
+        issues_vector_store.ensure_exists()
+        issues_vector_store.upsert(issues_embedder.download_embeddings())
     logging.info("Done!")