juliaturc commited on
Commit
451b0cb
·
1 Parent(s): 6abbb16

Improve the docs based on stargazers' feedback (#25)

Browse files
Files changed (1) hide show
  1. README.md +108 -91
README.md CHANGED
@@ -1,126 +1,142 @@
1
- # What is this?
 
 
 
 
 
 
 
2
 
3
- *TL;DR*: `repo2vec` is a simple-to-use, modular library enabling you to chat with any public or private codebase.
4
 
5
- ![screenshot](assets/chat_screenshot.png)
6
 
7
- **Ok, but why chat with a codebase?**
8
 
9
- Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through
10
- the code itself.
11
 
12
- `repo2vec` is like GitHub Copilot but with the most up-to-date information about your repo.
 
13
 
14
- Features:
15
- - **Dead-simple set-up.** Run *two scripts* and you have a functional chat interface for your code. That's really it.
16
- - **Heavily documented answers.** Every response shows where in the code the context for the answer was pulled from. Let's build trust in the AI.
17
- - **Runs locally or on the cloud.**
18
- - Want privacy? No problem: you can use [Marqo](https://github.com/marqo-ai/marqo) for embeddings + vector store and [Ollama](ollama.com) for the chat LLM.
19
- - Want speed and high performance? Also no problem. We support OpenAI batch embeddings + [Pinecone](https://www.pinecone.io/) for the vector store + OpenAI or Anthropic for the chat LLM.
20
- - **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Google-grade engineering standards allow you to customize to your heart's content.
 
 
21
 
22
- # How to run it
23
 
24
- ## Installation
25
- To install the library, simply run `pip install repo2vec`.
 
26
 
27
- ## Indexing the codebase
28
- We currently support two options for indexing the codebase:
29
 
30
- 1. **Locally**, using the open-source [Marqo vector store](https://github.com/marqo-ai/marqo). Marqo is both an embedder (you can choose your favorite embedding model from Hugging Face) and a vector store.
31
 
32
- You can bring up a Marqo instance using Docker:
33
  ```
34
- docker rm -f marqo
35
- docker pull marqoai/marqo:latest
36
- docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
37
  ```
38
 
39
- Then, to index your codebase, run:
 
 
 
40
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  index github-repo-name \ # e.g. Storia-AI/repo2vec
42
  --embedder-type=marqo \
43
  --vector-store-type=marqo \
44
  --index-name=your-index-name
45
- ```
46
 
47
- 2. **Using external providers** (OpenAI for embeddings and [Pinecone](https://www.pinecone.io/) for the vector store). To index your codebase, run:
48
- ```
49
- export OPENAI_API_KEY=...
50
- export PINECONE_API_KEY=...
 
 
 
 
 
 
 
 
51
 
52
  index github-repo-name \ # e.g. Storia-AI/repo2vec
53
  --embedder-type=openai \
54
  --vector-store-type=pinecone \
55
  --index-name=your-index-name
56
- ```
57
- We are planning on adding more providers soon, so that you can mix and match them. Contributions are also welcome!
58
 
59
- ## Indexing GitHub Issues
60
- You can additionally index GitHub issues by setting the `--index-issues` flag. Conversely, you can turn off indexing the code (and solely index issues) by passing `--no-index-repo`.
61
 
62
- ## Chatting with the codebase
63
- We provide a `gradio` app where you can chat with your codebase. You can use either a local LLM (via [Ollama](https://ollama.com)), or a cloud provider like OpenAI or Anthropic.
 
 
 
64
 
65
- To chat with a local LLM:
66
- 1. Head over to [ollama.com](https://ollama.com) to download the appropriate binary for your machine.
67
- 2. Pull the desired model, e.g. `ollama pull llama3.1`.
68
- 3. Start the `gradio` app:
69
- ```
70
- chat github-repo-name \ # e.g. Storia-AI/repo2vec
71
- --llm-provider=ollama
72
- --llm-model=llama3.1
73
- --vector-store-type=marqo \ # or pinecone
74
- --index-name=your-index-name
75
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
- To chat with a cloud-based LLM, for instance Anthropic's Claude:
78
- ```
79
- export ANTHROPIC_API_KEY=...
80
 
81
- chat github-repo-name \ # e.g. Storia-AI/repo2vec
82
- --llm-provider=anthropic \
83
- --llm-model=claude-3-opus-20240229 \
84
- --vector-store-type=marqo \ # or pinecone
85
- --index-name=your-index-name
86
- ```
87
- To get a public URL for your chat app, set `--share=true`.
88
 
89
- # Peeking under the hood
90
-
91
- ## Indexing the repo
92
- The `repo2vec/index.py` script performs the following steps:
93
- 1. **Clones a GitHub repository**. See [RepoManager](repo2vec/repo_manager.py).
94
- - Make sure to set the `GITHUB_TOKEN` environment variable for private repositories.
95
- 2. **Chunks files**. See [Chunker](repo2vec/chunker.py).
96
- - For code files, we implement a special `CodeChunker` that takes the parse tree into account.
97
- 3. **Batch-embeds chunks**. See [Embedder](repo2vec/embedder.py). We currently support:
98
- - [Marqo](https://github.com/marqo-ai/marqo) as an embedder, which allows you to specify your favorite Hugging Face embedding model, and
99
- - OpenAI's [batch embedding API](https://platform.openai.com/docs/guides/batch/overview), which is much faster and cheaper than the regular synchronous embedding API.
100
- 4. **Stores embeddings in a vector store**. See [VectorStore](repo2vec/vector_store.py).
101
- - We currently support [Marqo](https://github.com/marqo-ai/marqo) and [Pinecone](https://pinecone.io), but you can easily plug in your own.
102
-
103
- Note you can specify an inclusion or exclusion set for the file extensions you want indexed. To specify an extension inclusion set, you can add the `--include` flag:
104
- ```
105
- index repo-org/repo-name --include=/path/to/file/with/extensions
106
- ```
107
- Conversely, to specify an extension exclusion set, you can add the `--exclude` flag:
108
- ```
109
- index repo-org/repo-name --exclude=repo2vec/sample-exclude.txt
110
- ```
111
- Extensions must be specified one per line, in the form `.ext`.
112
-
113
- ## Chatting via RAG
114
- The `repo2vec/chat.py` brings up a [Gradio app](https://www.gradio.app/) with a chat interface as shown above. We use [LangChain](https://langchain.com) to define a RAG chain which, given a user query about the repository:
115
-
116
- 1. Rewrites the query to be self-contained based on previous queries
117
- 2. Embeds the rewritten query using OpenAI embeddings
118
- 3. Retrieves relevant documents from the vector store
119
- 4. Calls a chat LLM to respond to the user query based on the retrieved documents.
120
-
121
- The sources are conveniently surfaced in the chat and linked directly to GitHub.
122
 
123
  # Changelog
 
124
  - 2024-09-03: `repo2vec` is now available on pypi.
125
  - 2024-09-03: Support for indexing GitHub issues.
126
  - 2024-08-30: Support for running everything locally (Marqo for embeddings, Ollama for LLMs).
@@ -134,6 +150,7 @@ If you're the maintainer of an OSS repo and would like a dedicated page on Code
134
  ![](assets/sage.gif)
135
 
136
  # Extensions & Contributions
 
137
  We built the code purposefully modular so that you can plug in your desired embeddings, LLM and vector stores providers by simply implementing the relevant abstract classes.
138
 
139
- Feel free to send feature requests to [founders@storia.ai](mailto:founders@storia.ai) or make a pull request!
 
1
+ <div align="center">
2
+ <h1 align="center">repo2vec</h1>
3
+ <p align="center">An open-source pair programmer for chatting with any codebase.</p>
4
+ <figure>
5
+ <img src="assets/chat_screenshot2.png" alt="screenshot" style="max-height: 500px; border: 1px solid black;">
6
+ <figcaption align="center" style="font-size: smaller;">Our chat window, showing a conversation with the Transformers library. 🚀</figcaption>
7
+ </figure>
8
+ </div>
9
 
10
+ # Getting started
11
 
12
+ ## Installation
13
 
14
+ To install the library, simply run `pip install repo2vec`!
15
 
16
+ ## Prerequisites
17
+ `repo2vec` performs two steps:
18
 
19
+ 1. Indexes your codebase (requiring an embdder and a vector store)
20
+ 2. Enables chatting via LLM + RAG (requiring access to an LLM)
21
 
22
+ <details open>
23
+ <summary><strong>:computer: Running locally</strong></summary>
24
+
25
+ 1. To index the codebase locally, we use the open-source project <a href="https://github.com/marqo-ai/marqo">Marqo</a>, which is both an embedder and a vector store. To bring up a Marqo instance:
26
+ ```
27
+ docker rm -f marqo
28
+ docker pull marqoai/marqo:latest
29
+ docker run --name marqo -it -p 8882:8882 marqoai/marqo:latest
30
+ ```
31
 
32
+ 2. To chat with an LLM locally, we use <a href="https://github.com/ollama/ollama">Ollama</a>:
33
 
34
+ - Head over to [ollama.com](https://ollama.com) to download the appropriate binary for your machine.
35
+ - Pull the desired model, e.g. `ollama pull llama3.1`.
36
+ </details>
37
 
38
+ <details>
39
+ <summary><strong>:cloud: Using external providers</strong></summary>
40
 
41
+ 1. We support <a href="https://openai.com/">OpenAI</a> for embeddings (they have a super fast batch embedding API) and <a href="https://www.pinecone.io/">Pinecone</a> for the vector store. So you will need two API keys:
42
 
 
43
  ```
44
+ export OPENAI_API_KEY=...
45
+ export PINECONE_API_KEY=...
 
46
  ```
47
 
48
+ 2. For chatting with an LLM, we support OpenAI and Anthropic. For the latter, set an additional API key:
49
+
50
+ ```
51
+ export ANTHROPIC_API_KEY=...
52
  ```
53
+ </details>
54
+
55
+ <br>
56
+ <summary><strong>Optional</strong></summary>
57
+ If you are planning on indexing GitHub issues in addition to the codebase, you will need a GitHub token:
58
+
59
+ export GITHUB_TOKEN=...
60
+
61
+ ## Running it
62
+ <details open>
63
+ <summary><strong>:computer: Running locally</strong></summary>
64
+ <p>To index the codebase:</p>
65
+
66
  index github-repo-name \ # e.g. Storia-AI/repo2vec
67
  --embedder-type=marqo \
68
  --vector-store-type=marqo \
69
  --index-name=your-index-name
 
70
 
71
+ <p> To chat with your codebase:</p>
72
+
73
+ chat github-repo-name \
74
+ --vector-store-type=marqo \
75
+ --index-name=your-index-name \
76
+ --llm-provider=ollama \
77
+ --llm-model=llama3.1
78
+ </details>
79
+
80
+ <details open>
81
+ <summary><strong>:cloud: Using external providers</strong></summary>
82
+ <p>To index the codebase:</p>
83
 
84
  index github-repo-name \ # e.g. Storia-AI/repo2vec
85
  --embedder-type=openai \
86
  --vector-store-type=pinecone \
87
  --index-name=your-index-name
 
 
88
 
89
+ <p> To chat with your codebase:</p>
 
90
 
91
+ chat github-repo-name \
92
+ --vector-store-type=pinecone \
93
+ --index-name=your-index-name \
94
+ --llm-provider=openai \
95
+ --llm-model=gpt-4
96
 
97
+ To get a public URL for your chat app, set `--share=true`.
98
+ </details>
99
+
100
+ ## Additional features
101
+ - **Control which files get indexed** based on their extension. You can whitelist or blacklist extensions by passing a file with one extension per line (in the format `.ext`):
102
+ - To only index a whitelist of files:
103
+ ```
104
+ index ... --include=/path/to/extensions/file
105
+ ```
106
+ - To index all code except a blacklist of files:
107
+ ```
108
+ index ... --exclude=/path/to/extensions/file
109
+ ```
110
+ - **Index open GitHub issues** (remember to `export GITHUB_TOKEN=...`):
111
+ - To index GitHub issues without comments:
112
+ ```
113
+ index ... --index-issues
114
+ ```
115
+ - To index GitHub issues with comments:
116
+ ```
117
+ index ... --index-issues --index-issue-comments
118
+ ```
119
+ - To index GitHub issues, but not the codebase:
120
+ ```
121
+ index ... --index-issues --no-index-repo
122
+ ```
123
+
124
+ # Why chat with a codebase?
125
 
126
+ Sometimes you just want to learn how a codebase works and how to integrate it, without spending hours sifting through
127
+ the code itself.
 
128
 
129
+ `repo2vec` is like an open-source GitHub Copilot with the most up-to-date information about your repo.
 
 
 
 
 
 
130
 
131
+ Features:
132
+
133
+ - **Dead-simple set-up.** Run *two scripts* and you have a functional chat interface for your code. That's really it.
134
+ - **Heavily documented answers.** Every response shows where in the code the context for the answer was pulled from. Let's build trust in the AI.
135
+ - **Runs locally or on the cloud.**
136
+ - **Plug-and-play.** Want to improve the algorithms powering the code understanding/generation? We've made every component of the pipeline easily swappable. Google-grade engineering standards allow you to customize to your heart's content.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
138
  # Changelog
139
+
140
  - 2024-09-03: `repo2vec` is now available on pypi.
141
  - 2024-09-03: Support for indexing GitHub issues.
142
  - 2024-08-30: Support for running everything locally (Marqo for embeddings, Ollama for LLMs).
 
150
  ![](assets/sage.gif)
151
 
152
  # Extensions & Contributions
153
+
154
  We built the code purposefully modular so that you can plug in your desired embeddings, LLM and vector stores providers by simply implementing the relevant abstract classes.
155
 
156
+ Feel free to send feature requests to [founders@storia.ai](mailto:founders@storia.ai) or make a pull request!