Spaces:
Sleeping
Sleeping
Commit ·
66142ac
1
Parent(s): b75c304
docs: Completely rewrite README addressing 'Vibe Coding' deficit with 3D Visualizer demo
Browse files- README.md +47 -126
- assets/demo.webp +0 -0
README.md
CHANGED
|
@@ -13,161 +13,82 @@ tags:
|
|
| 13 |
- coding-agent
|
| 14 |
---
|
| 15 |
|
| 16 |
-
# 🔍
|
| 17 |
|
| 18 |
-
> **The system
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
-
|----------------|---------------|
|
| 32 |
-
| Navigation efficiency | Did it read relevant files first? |
|
| 33 |
-
| Reasoning patterns | Did it follow read→write→test? |
|
| 34 |
-
| Context usage | How much of what it read was useful? |
|
| 35 |
-
| Security | Did it write safe code? |
|
| 36 |
-
| Robustness | Can it handle misleading comments? |
|
| 37 |
|
| 38 |
-
|
| 39 |
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
→ runs tests to verify
|
| 46 |
-
→ submits for final grade
|
| 47 |
-
```
|
| 48 |
|
| 49 |
-
|
| 50 |
|
| 51 |
-
|
| 52 |
-
|------|-----------|-------------|----------|
|
| 53 |
-
| task1 | Easy | Single-file bug repair | 5 |
|
| 54 |
-
| task2 | Medium | Cross-module interface bug + regression test | 5 |
|
| 55 |
-
| task3 | Hard | Feature implementation from spec | 5 |
|
| 56 |
|
| 57 |
-
|
| 58 |
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
### 1. Run Locally (No Docker)
|
| 62 |
```bash
|
| 63 |
pip install -r requirements.txt
|
| 64 |
-
python app.py # Gradio UI at http://localhost:7860
|
| 65 |
-
```
|
| 66 |
-
|
| 67 |
-
### 2. Run Agent (No LLM needed)
|
| 68 |
-
```bash
|
| 69 |
-
python run_agent.py # deterministic agent demo
|
| 70 |
-
python run_agent.py --all-tasks # run all 3 tasks
|
| 71 |
```
|
| 72 |
|
| 73 |
-
###
|
| 74 |
```bash
|
| 75 |
export HF_TOKEN=hf_xxxxx
|
| 76 |
-
|
|
|
|
| 77 |
```
|
| 78 |
|
| 79 |
-
###
|
| 80 |
```bash
|
| 81 |
docker build -t codebase-nav-env .
|
| 82 |
docker run -p 7860:7860 codebase-nav-env
|
| 83 |
```
|
| 84 |
|
| 85 |
-
##
|
| 86 |
-
```bash
|
| 87 |
-
# Reset
|
| 88 |
-
curl -X POST "http://localhost:7860/reset?task=task1"
|
| 89 |
-
|
| 90 |
-
# Take action
|
| 91 |
-
curl -X POST http://localhost:7860/step \
|
| 92 |
-
-H "Content-Type: application/json" \
|
| 93 |
-
-d '{"action_type":"read_file","path":"src/auth.py"}'
|
| 94 |
-
|
| 95 |
-
# Submit
|
| 96 |
-
curl -X POST http://localhost:7860/step \
|
| 97 |
-
-d '{"action_type":"submit"}'
|
| 98 |
-
|
| 99 |
-
# Get evaluation
|
| 100 |
-
curl http://localhost:7860/evaluate
|
| 101 |
-
```
|
| 102 |
-
|
| 103 |
-
## API Endpoints
|
| 104 |
|
| 105 |
-
### Core (OpenEnv-compliant)
|
| 106 |
| Endpoint | Method | Description |
|
| 107 |
|----------|--------|-------------|
|
| 108 |
-
| `/
|
| 109 |
-
| `/
|
| 110 |
-
| `/
|
| 111 |
-
| `/
|
| 112 |
-
|
| 113 |
-
### Evaluation Layer
|
| 114 |
-
| Endpoint | Method | Description |
|
| 115 |
-
|----------|--------|-------------|
|
| 116 |
-
| `/trajectory` | GET | Full action log with timing and diffs |
|
| 117 |
-
| `/evaluate` | GET | Multi-dimensional scores (6 axes) |
|
| 118 |
-
| `/metrics` | GET | Memory, security, timeline stats |
|
| 119 |
-
| `/fault-config` | POST | Enable fault injection |
|
| 120 |
-
|
| 121 |
-
## Evaluation Dimensions
|
| 122 |
-
|
| 123 |
-
```
|
| 124 |
-
efficiency [████████████████░░░░] 0.800 — 5 steps vs 4 optimal
|
| 125 |
-
navigation [████████████████████] 1.000 — read relevant files first
|
| 126 |
-
correctness [██████████████░░░░░░] 0.714 — 71.4% tests passing
|
| 127 |
-
reasoning [████████████████████] 1.000 — correct read→write→test pattern
|
| 128 |
-
robustness [████████████████████] 1.000 — no errors encountered
|
| 129 |
-
security [████████████████████] 1.000 — no unsafe code detected
|
| 130 |
-
```
|
| 131 |
-
|
| 132 |
-
## Project Structure
|
| 133 |
-
|
| 134 |
-
```
|
| 135 |
-
codebase-nav-env/
|
| 136 |
-
├── app.py # Gradio UI + FastAPI (HF Space entry point)
|
| 137 |
-
├── run_agent.py # Standalone HF agent (deterministic + LLM)
|
| 138 |
-
├── inference.py # OpenEnv inference script ([START]/[STEP]/[END])
|
| 139 |
-
├── server/
|
| 140 |
-
│ ├── app.py # FastAPI endpoints
|
| 141 |
-
│ ├── environment.py # Core RL environment
|
| 142 |
-
│ ├── models.py # Pydantic models
|
| 143 |
-
│ ├── grader.py # pytest runner
|
| 144 |
-
│ ├── repo_loader.py # Template loader
|
| 145 |
-
│ ├── sandbox.py # Secure subprocess
|
| 146 |
-
│ ├── trajectory.py # Full trajectory recording
|
| 147 |
-
│ ├── evaluator.py # 6-dimension scoring engine
|
| 148 |
-
│ ├── fault_injection.py # Robustness testing
|
| 149 |
-
│ ├── security.py # Unsafe code detection
|
| 150 |
-
│ └── memory.py # Context efficiency tracking
|
| 151 |
-
├── repo_templates/ # 15 task variants
|
| 152 |
-
│ ├── task1/ # 5 single-file bug variants
|
| 153 |
-
│ ├── task2/ # 5 cross-module bug variants
|
| 154 |
-
│ └── task3/ # 5 feature implementation variants
|
| 155 |
-
├── openenv.yaml # Environment metadata
|
| 156 |
-
├── Dockerfile # Docker build
|
| 157 |
-
├── requirements.txt # Dependencies
|
| 158 |
-
└── README.md # This file
|
| 159 |
-
```
|
| 160 |
-
|
| 161 |
-
## Why This Is Real-World
|
| 162 |
-
|
| 163 |
-
This isn't a toy benchmark. It tests the **exact capabilities** production coding agents need:
|
| 164 |
-
|
| 165 |
-
- **Navigate unfamiliar code** — agent sees only file names, not contents
|
| 166 |
-
- **Budget exploration** — finite steps mean strategic reading matters
|
| 167 |
-
- **Verify fixes** — must run tests, not just hope the fix works
|
| 168 |
-
- **Handle noise** — real repos have misleading comments and dead code
|
| 169 |
-
- **Write safe code** — production agents can't `eval()` or `os.system()`
|
| 170 |
-
|
| 171 |
-
## License
|
| 172 |
|
| 173 |
-
|
|
|
|
| 13 |
- coding-agent
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# 🔍 The Antidote to "Vibe Coding" — AI Reliability & Navigation Platform
|
| 17 |
|
| 18 |
+
> **The system repairing the era of blind AI coding by making agents structural, testable, and deeply debuggable.**
|
| 19 |
|
| 20 |
+
**Play with the live environment:** [Interactive Hugging Face Space](https://huggingface.co/spaces/Chirag0123/codebase-nav-env)
|
| 21 |
|
| 22 |
+
## 🚨 The End of "Vibe Coding"
|
| 23 |
|
| 24 |
+
We are officially in the era of **Vibe Coding**. The amount of AI-generated code is exploding, yet developers and AI Agents (Copilot, Devin, Claude Code) are increasingly writing and submitting code *blindly*.
|
| 25 |
|
| 26 |
+
Most agents don't actually know **where the issue exists**, what the **code flow** looks like, or how the **function dependencies** cascade. They simply guess edits based on the prompt until a test arbitrarily passes. When an AI agent claims "I fixed the bug," how do you verify *how* it did it? Did it actually navigate to the source of the crash, or did it randomly change syntax until the test turned green?
|
| 27 |
|
| 28 |
+
Current benchmarks only evaluate the final outcome. **They don't evaluate cognition.**
|
| 29 |
|
| 30 |
+
## 💡 The Solution: 3D Visualization & Deep Analytic Execution
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
This project is not just an environment benchmark—it is a **diagnostic platform**. It forces autonomous AI agents to explore an unknown Python repository file-by-file, and then exposes their **exact cognitive flow** using our bespoke state-of-the-art **3D Visualizer**.
|
| 33 |
|
| 34 |
+
By tracking structural behavior instead of just outcomes, our platform gives researchers and operators complete visibility into the AI's actual thought processes and navigation flow.
|
| 35 |
+
|
| 36 |
+
### 🎬 See It In Action (Demo)
|
| 37 |
+
|
| 38 |
+

|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
*(A live recording of the 3D agent visualizer tracking test files, source files, and resolving dependencies)*
|
| 41 |
|
| 42 |
+
## 🧠 Core Intelligence Modules
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
+
Unlike existing models, we evaluate **how** the agent works, using several proprietary research-grade engines:
|
| 45 |
|
| 46 |
+
| Module | What It Does (The Antidote to Vibe Coding) |
|
| 47 |
+
|--------|--------------------------------------------|
|
| 48 |
+
| **Causal Graph Probe** | Detects "Shortcut Learning". Did the agent actually read the test file, trace its imported module, and fix the root cause, or did it guess blindly? |
|
| 49 |
+
| **Confidence Calibrator** | Infers agent behavioral confidence based on commit speed, rewrite hesitation, and test verification ratios. |
|
| 50 |
+
| **Counterfactual Engine** | Analyzes the precise trace line to determine if the agent's strategy is brittle and heavily reliant on memorization of specific repository layouts. |
|
| 51 |
+
| **Episodic Memory Bank** | A cross-episode RAG store that captures mistakes (like failing to run tests before commiting) and injects hard lessons into future iteration system prompts. |
|
| 52 |
+
| **3D Trace Visualizer** | A seamless, fully-interpolated 3D environment engine that renders repos as geometric maps (Cubes for Source, Prisms for Tests) and visualizes the exact agent navigation traces with glowing Catmull-Rom tube curves. |
|
| 53 |
+
|
| 54 |
+
## ⚙️ How It Works (The OpenEnv Standard)
|
| 55 |
+
|
| 56 |
+
1. **Agent loads unfamiliar environment** → sees repo file tree (NOT contents).
|
| 57 |
+
2. Agent reads files one at a time (costs strict exploration steps).
|
| 58 |
+
3. Agent identifies structural vulnerabilities via function-flow analysis.
|
| 59 |
+
4. Agent writes fixed code.
|
| 60 |
+
5. Agent verifies functionality through containerized `pytest` execution.
|
| 61 |
+
6. Environment scores agent completely dynamically across 6 separate research axes.
|
| 62 |
+
|
| 63 |
+
## 🚀 Quick Start
|
| 64 |
|
| 65 |
### 1. Run Locally (No Docker)
|
| 66 |
```bash
|
| 67 |
pip install -r requirements.txt
|
| 68 |
+
python app.py # Gradio UI + FastAPI at http://localhost:7860
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
```
|
| 70 |
|
| 71 |
+
### 2. Connect Your Custom LLM Agent
|
| 72 |
```bash
|
| 73 |
export HF_TOKEN=hf_xxxxx
|
| 74 |
+
# Configure your script to hit the local /step FASTApi environment
|
| 75 |
+
python inference.py
|
| 76 |
```
|
| 77 |
|
| 78 |
+
### 3. Deploy via Docker
|
| 79 |
```bash
|
| 80 |
docker build -t codebase-nav-env .
|
| 81 |
docker run -p 7860:7860 codebase-nav-env
|
| 82 |
```
|
| 83 |
|
| 84 |
+
## 📊 Evaluation API Layers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
|
|
|
| 86 |
| Endpoint | Method | Description |
|
| 87 |
|----------|--------|-------------|
|
| 88 |
+
| `/step` | POST | Takes singular OpenEnv navigation action (`read_file`, `write_file`) |
|
| 89 |
+
| `/evaluate` | GET | Baseline evaluation metrics |
|
| 90 |
+
| `/causal-probe` | GET | Builds directed acyclic graphs resolving true root-cause logic mapping |
|
| 91 |
+
| `/confidence` | GET | Returns behavior-time confidence estimation algorithms |
|
| 92 |
+
| `/counterfactual` | POST | Subjects agent to 6 robustness ablation tests to detect hallucination |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
+
*Stop trusting the vibe. Force the cognition.*
|
assets/demo.webp
ADDED
|