shahzeb171 commited on
Commit
e0b4927
·
1 Parent(s): 60344c1

Readme update

Browse files
Files changed (1) hide show
  1. README.md +16 -224
README.md CHANGED
@@ -1,224 +1,16 @@
1
- # 🔍code-compass
2
-
3
- An AI-powered tool for analyzing code repositories using hierarchical chunking and semantic search with Pinecone vector database.
4
-
5
- ## 🚀 Features
6
-
7
- - **📥 Multiple Input Methods**: GitHub URLs or ZIP file uploads
8
- - **🧠 Hierarchical Chunking**: Smart code parsing at multiple levels (file → class → function → block)
9
- - **🔍 Semantic Search**: AI-powered natural language queries using Pinecone vector database
10
- - **🤖 Intelligent Analysis**: Local LLM integration with Qwen2.5-Coder-7B-Instruct
11
- - **💬 Conversation History**: Maintains context across multiple queries
12
- - **📊 Repository Analytics**: Comprehensive statistics and structure analysis
13
- - **🎯 Pinecone Integration**: Scalable vector database with automatic embedding generation
14
- - **⚡ Optimized Performance**: Quantized models for efficient local inference
15
-
16
- ## 🛠️ Setup
17
-
18
- ### Prerequisites
19
-
20
- 1. **Python 3.8+**
21
- 2. **Pinecone Account**: Create a free account at [Pinecone.io](https://www.pinecone.io/)
22
- 3. **System Requirements** for LLM:
23
- - **RAM**: 8GB minimum (16GB+ recommended)
24
- - **Storage**: 5-8GB free space for model
25
- - **CPU**: Multi-core processor (supports GPU acceleration if available)
26
-
27
- ### Installation
28
-
29
- 1. **Clone or download this project**
30
- ```bash
31
- git clone https://github.com/shahzeb171/code-compass.git
32
- cd code-compass
33
- ```
34
-
35
- 2. **Install dependencies**
36
- ```bash
37
- pip install -r requirements.txt
38
- ```
39
-
40
- 3. **Download the LLM model**
41
- ```
42
- wget https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
43
- ```
44
- **Recommended**: Select Q4_K_M for the best balance of quality and performance.
45
-
46
- 4. **Set up Pinecone API Key**
47
-
48
- Create `config.py` file:
49
- ```
50
- PINECONE_API_KEY=your-pinecone-api-key-here
51
- PINECONE_INDEX_NAME=index_name(eg. code_compass_index)
52
- PINECONE_EMBEDDING_MODEL=embedding_model(eg. llama-text-embed-v2 (check pinecone docs for more models))
53
- MODEL_PATH=path_to_the_model
54
- ```
55
-
56
- ### Getting Your Pinecone API Key
57
-
58
- 1. Go to [Pinecone.io](https://www.pinecone.io/) and sign up for a free account
59
- 2. Navigate to the "API Keys" section in your dashboard
60
- 3. Create a new API key or copy an existing one
61
- 4. The free tier includes:
62
- - 1 index
63
- - 5M vector dimensions
64
- - Enough for most code analysis projects!
65
-
66
- ## 🚀 Usage
67
-
68
- 1. **Start the application**
69
- ```bash
70
- python main.py
71
- ```
72
-
73
- 2. **Open your browser** to `http://localhost:7860`
74
-
75
- 3. **Load a repository**
76
- - Enter a GitHub URL (e.g., `https://github.com/pallets/flask`)
77
- - Or upload a ZIP file of your code
78
- - Click "📁 Load Repository"
79
-
80
- 4. **Process the repository**
81
- - Click "🚀 Process Repository" to analyze and chunk your code
82
- - This creates hierarchical chunks and stores them in Pinecone with automatic embedding generation
83
- - Wait for processing to complete (may take 1-5 minutes depending on repo size)
84
-
85
- 5. **Initialize the AI model** (Optional but recommended)
86
- - Click "🚀 Initialize LLM" to start loading the local AI model
87
- - This will load Qwen2.5-Coder-7B-Instruct for intelligent code analysis
88
- - Initial loading takes 1-3 minutes
89
-
90
- 6. **Query your code**
91
- - Ask natural language questions like:
92
- - "What does this repository do?"
93
- - "Show me authentication functions"
94
- - "How is error handling implemented?"
95
- - "What are the main classes?"
96
- - Toggle "Use AI Analysis" for intelligent responses vs basic search results
97
- - The AI maintains conversation context for follow-up questions
98
-
99
- ## 📊 How It Works
100
-
101
- ### Hierarchical Chunking Strategy
102
-
103
- The system creates multiple levels of code chunks:
104
-
105
- **Level 1: File Context**
106
- - Complete file overview with imports and purpose
107
- - Metadata: file path, language, total lines
108
-
109
- **Level 2: Class Chunks**
110
- - Full class definitions with inheritance and methods
111
- - Metadata: class name, methods list, relationships
112
-
113
- **Level 3: Function Chunks**
114
- - Individual function implementations with signatures
115
- - Metadata: function name, arguments, complexity score
116
-
117
- **Level 4: Code Block Chunks**
118
- - Sub-chunks for complex functions (loops, conditionals, error handling)
119
- - Metadata: block type, purpose, parent function
120
-
121
- ### Vector Search Process
122
-
123
- 1. **Embedding Generation**: Code chunks are converted to vector embeddings using SentenceTransformers
124
- 2. **Vector Storage**: Embeddings stored in Pinecone with rich metadata
125
- 3. **Semantic Search**: User queries are embedded and matched against stored vectors
126
- 4. **Hybrid Filtering**: Results filtered by chunk type, file path, repository, etc.
127
- 5. **Ranked Results**: Most relevant code sections returned with similarity scores
128
-
129
- ## 🔧 Configuration Options
130
-
131
- ### Supported Languages
132
-
133
- Currently optimized for Python with basic support for:
134
- - JavaScript/TypeScript
135
- - Java
136
- - C/C++
137
- - Go
138
- - Rust
139
- - PHP
140
- - Ruby
141
-
142
- ## 📝 Example Repositories
143
-
144
- Try these public repositories:
145
-
146
- - **Flask**: `https://github.com/pallets/flask` - Web framework
147
- - **Requests**: `https://github.com/requests/requests` - HTTP library
148
- - **FastAPI**: `https://github.com/tiangolo/fastapi` - Modern web framework
149
- - **Black**: `https://github.com/psf/black` - Code formatter
150
-
151
- ## 🔍 Example Queries
152
-
153
- ### General Repository Understanding
154
- - "What is the main purpose of this repository?"
155
- - "What are the core components and how do they interact?"
156
- - "Show me the project architecture overview"
157
-
158
- ### Function & Class Discovery
159
- - "What are the main classes and their responsibilities?"
160
- - "Show me all authentication-related functions"
161
- - "Find functions that handle file operations"
162
- - "What utility functions are available?"
163
-
164
- ### Implementation Analysis
165
- - "How is error handling implemented?"
166
- - "Show me configuration management code"
167
- - "Find database-related functions"
168
- - "How does logging work in this project?"
169
-
170
- ### Code Patterns
171
- - "Show me decorator implementations"
172
- - "Find async/await usage patterns"
173
- - "What design patterns are used?"
174
- - "How are tests structured?"
175
-
176
- ## 🛟 Troubleshooting
177
-
178
- ### Common Issues
179
-
180
- **"Pinecone API key is required"**
181
- - Make sure you've set the `PINECONE_API_KEY` environment variable
182
- - Or enter it in the Advanced Options section
183
-
184
- **"Error downloading repository"**
185
- - Check that the GitHub URL is correct and public
186
- - Ensure you have internet connection
187
- - Large repositories may timeout - try smaller repos first
188
-
189
- **"No chunks generated"**
190
- - Make sure the repository contains supported code files
191
- - Check that ZIP files aren't corrupted
192
- - Python files work best currently
193
-
194
- **"Vector store initialization failed"**
195
- - Verify your Pinecone API key is valid
196
- - Check your Pinecone account hasn't exceeded free tier limits
197
- - Try a different environment region if needed
198
-
199
- ### Performance Tips
200
-
201
- - Start with smaller repositories (< 100 files) to test
202
- - Python repositories work best currently
203
- - Processing time scales with repository size
204
- - Queries are fast once processing is complete
205
-
206
- ## 🔮 Future Enhancements
207
-
208
- - **More Language Support**: Better parsing for JavaScript, Java, etc.
209
- - **Code Generation**: AI-powered code completion and generation
210
- - **Diff Analysis**: Compare changes between repository versions
211
- - **Team Collaboration**: Share analyzed repositories
212
- - **Custom Embeddings**: Fine-tuned models for specific domains
213
- - **API Integration**: REST API for programmatic access
214
-
215
- ## 🤝 Contributing
216
-
217
- Contributions welcome! Please open issues or submit pull requests.
218
-
219
- ## 📞 Support
220
-
221
- For issues or questions:
222
- 1. Check the troubleshooting section above
223
- 2. Open a GitHub issue with detailed error messages
224
- 3. Include your Python version and OS information
 
1
+ ---
2
+ title: Code Compass
3
+ emoji: 💬
4
+ colorFrom: yellow
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 5.42.0
8
+ app_file: app.py
9
+ pinned: false
10
+ hf_oauth: true
11
+ hf_oauth_scopes:
12
+ - inference-api
13
+ short_description: An AI-powered tool for analyzing code repositories
14
+ ---
15
+
16
+ An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).