Chirag0123 commited on
Commit
66142ac
·
1 Parent(s): b75c304

docs: Completely rewrite README addressing 'Vibe Coding' deficit with 3D Visualizer demo

Browse files
Files changed (2) hide show
  1. README.md +47 -126
  2. assets/demo.webp +0 -0
README.md CHANGED
@@ -13,161 +13,82 @@ tags:
13
  - coding-agent
14
  ---
15
 
16
- # 🔍 Codebase Navigation & RepairOpenEnv
17
 
18
- > **The system that makes AI coding agents reliable, testable, and debuggable.**
19
 
20
- ## The Problem
21
 
22
- AI coding agents (Copilot, Devin, Cursor) fail ~25%+ on complex tasks. Current benchmarks tell you the score but not **why** the agent failed. Was it poor navigation? Wasted steps? Hallucinated code? There is no way to know.
23
 
24
- ## Our Solution
25
 
26
- An RL environment where agents navigate unfamiliar Python repos, find bugs, and fix them graded by **actual pytest execution** with **process-level evaluation**.
27
 
28
- Unlike existing benchmarks, we evaluate **how** the agent works, not just the final output:
29
 
30
- | What We Measure | Why It Matters |
31
- |----------------|---------------|
32
- | Navigation efficiency | Did it read relevant files first? |
33
- | Reasoning patterns | Did it follow read→write→test? |
34
- | Context usage | How much of what it read was useful? |
35
- | Security | Did it write safe code? |
36
- | Robustness | Can it handle misleading comments? |
37
 
38
- ## How It Works
39
 
40
- ```
41
- Agent resets environment → sees repo file tree (NOT contents)
42
- reads files one at a time (costs steps)
43
- → identifies bugs in source code
44
- writes fixed code
45
- → runs tests to verify
46
- → submits for final grade
47
- ```
48
 
49
- ### Tasks
50
 
51
- | Task | Difficulty | Description | Variants |
52
- |------|-----------|-------------|----------|
53
- | task1 | Easy | Single-file bug repair | 5 |
54
- | task2 | Medium | Cross-module interface bug + regression test | 5 |
55
- | task3 | Hard | Feature implementation from spec | 5 |
56
 
57
- Each variant has structurally different code, so the agent can't memorize solutions.
58
 
59
- ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
  ### 1. Run Locally (No Docker)
62
  ```bash
63
  pip install -r requirements.txt
64
- python app.py # Gradio UI at http://localhost:7860
65
- ```
66
-
67
- ### 2. Run Agent (No LLM needed)
68
- ```bash
69
- python run_agent.py # deterministic agent demo
70
- python run_agent.py --all-tasks # run all 3 tasks
71
  ```
72
 
73
- ### 3. Run Agent with LLM
74
  ```bash
75
  export HF_TOKEN=hf_xxxxx
76
- python run_agent.py --llm --task task1
 
77
  ```
78
 
79
- ### 4. Docker
80
  ```bash
81
  docker build -t codebase-nav-env .
82
  docker run -p 7860:7860 codebase-nav-env
83
  ```
84
 
85
- ### 5. API Usage
86
- ```bash
87
- # Reset
88
- curl -X POST "http://localhost:7860/reset?task=task1"
89
-
90
- # Take action
91
- curl -X POST http://localhost:7860/step \
92
- -H "Content-Type: application/json" \
93
- -d '{"action_type":"read_file","path":"src/auth.py"}'
94
-
95
- # Submit
96
- curl -X POST http://localhost:7860/step \
97
- -d '{"action_type":"submit"}'
98
-
99
- # Get evaluation
100
- curl http://localhost:7860/evaluate
101
- ```
102
-
103
- ## API Endpoints
104
 
105
- ### Core (OpenEnv-compliant)
106
  | Endpoint | Method | Description |
107
  |----------|--------|-------------|
108
- | `/reset?task=task1` | POST | Start new episode |
109
- | `/step` | POST | Take one action |
110
- | `/state` | GET | Get current state |
111
- | `/health` | GET | Health check |
112
-
113
- ### Evaluation Layer
114
- | Endpoint | Method | Description |
115
- |----------|--------|-------------|
116
- | `/trajectory` | GET | Full action log with timing and diffs |
117
- | `/evaluate` | GET | Multi-dimensional scores (6 axes) |
118
- | `/metrics` | GET | Memory, security, timeline stats |
119
- | `/fault-config` | POST | Enable fault injection |
120
-
121
- ## Evaluation Dimensions
122
-
123
- ```
124
- efficiency [████████████████░░░░] 0.800 — 5 steps vs 4 optimal
125
- navigation [████████████████████] 1.000 — read relevant files first
126
- correctness [██████████████░░░░░░] 0.714 — 71.4% tests passing
127
- reasoning [████████████████████] 1.000 — correct read→write→test pattern
128
- robustness [████████████████████] 1.000 — no errors encountered
129
- security [████████████████████] 1.000 — no unsafe code detected
130
- ```
131
-
132
- ## Project Structure
133
-
134
- ```
135
- codebase-nav-env/
136
- ├── app.py # Gradio UI + FastAPI (HF Space entry point)
137
- ├── run_agent.py # Standalone HF agent (deterministic + LLM)
138
- ├── inference.py # OpenEnv inference script ([START]/[STEP]/[END])
139
- ├── server/
140
- │ ├── app.py # FastAPI endpoints
141
- │ ├── environment.py # Core RL environment
142
- │ ├── models.py # Pydantic models
143
- │ ├── grader.py # pytest runner
144
- │ ├── repo_loader.py # Template loader
145
- │ ├── sandbox.py # Secure subprocess
146
- │ ├── trajectory.py # Full trajectory recording
147
- │ ├── evaluator.py # 6-dimension scoring engine
148
- │ ├── fault_injection.py # Robustness testing
149
- │ ├── security.py # Unsafe code detection
150
- │ └── memory.py # Context efficiency tracking
151
- ├── repo_templates/ # 15 task variants
152
- │ ├── task1/ # 5 single-file bug variants
153
- │ ├── task2/ # 5 cross-module bug variants
154
- │ └── task3/ # 5 feature implementation variants
155
- ├── openenv.yaml # Environment metadata
156
- ├── Dockerfile # Docker build
157
- ├── requirements.txt # Dependencies
158
- └── README.md # This file
159
- ```
160
-
161
- ## Why This Is Real-World
162
-
163
- This isn't a toy benchmark. It tests the **exact capabilities** production coding agents need:
164
-
165
- - **Navigate unfamiliar code** — agent sees only file names, not contents
166
- - **Budget exploration** — finite steps mean strategic reading matters
167
- - **Verify fixes** — must run tests, not just hope the fix works
168
- - **Handle noise** — real repos have misleading comments and dead code
169
- - **Write safe code** — production agents can't `eval()` or `os.system()`
170
-
171
- ## License
172
 
173
- MIT
 
13
  - coding-agent
14
  ---
15
 
16
+ # 🔍 The Antidote to "Vibe Coding" AI Reliability & Navigation Platform
17
 
18
+ > **The system repairing the era of blind AI coding by making agents structural, testable, and deeply debuggable.**
19
 
20
+ **Play with the live environment:** [Interactive Hugging Face Space](https://huggingface.co/spaces/Chirag0123/codebase-nav-env)
21
 
22
+ ## 🚨 The End of "Vibe Coding"
23
 
24
+ We are officially in the era of **Vibe Coding**. The amount of AI-generated code is exploding, yet developers and AI Agents (Copilot, Devin, Claude Code) are increasingly writing and submitting code *blindly*.
25
 
26
+ Most agents don't actually know **where the issue exists**, what the **code flow** looks like, or how the **function dependencies** cascade. They simply guess edits based on the prompt until a test arbitrarily passes. When an AI agent claims "I fixed the bug," how do you verify *how* it did it? Did it actually navigate to the source of the crash, or did it randomly change syntax until the test turned green?
27
 
28
+ Current benchmarks only evaluate the final outcome. **They don't evaluate cognition.**
29
 
30
+ ## 💡 The Solution: 3D Visualization & Deep Analytic Execution
 
 
 
 
 
 
31
 
32
+ This project is not just an environment benchmark—it is a **diagnostic platform**. It forces autonomous AI agents to explore an unknown Python repository file-by-file, and then exposes their **exact cognitive flow** using our bespoke state-of-the-art **3D Visualizer**.
33
 
34
+ By tracking structural behavior instead of just outcomes, our platform gives researchers and operators complete visibility into the AI's actual thought processes and navigation flow.
35
+
36
+ ### 🎬 See It In Action (Demo)
37
+
38
+ ![3D Visualizer Architecture Trace](assets/demo.webp)
 
 
 
39
 
40
+ *(A live recording of the 3D agent visualizer tracking test files, source files, and resolving dependencies)*
41
 
42
+ ## 🧠 Core Intelligence Modules
 
 
 
 
43
 
44
+ Unlike existing models, we evaluate **how** the agent works, using several proprietary research-grade engines:
45
 
46
+ | Module | What It Does (The Antidote to Vibe Coding) |
47
+ |--------|--------------------------------------------|
48
+ | **Causal Graph Probe** | Detects "Shortcut Learning". Did the agent actually read the test file, trace its imported module, and fix the root cause, or did it guess blindly? |
49
+ | **Confidence Calibrator** | Infers agent behavioral confidence based on commit speed, rewrite hesitation, and test verification ratios. |
50
+ | **Counterfactual Engine** | Analyzes the precise trace line to determine if the agent's strategy is brittle and heavily reliant on memorization of specific repository layouts. |
51
+ | **Episodic Memory Bank** | A cross-episode RAG store that captures mistakes (like failing to run tests before commiting) and injects hard lessons into future iteration system prompts. |
52
+ | **3D Trace Visualizer** | A seamless, fully-interpolated 3D environment engine that renders repos as geometric maps (Cubes for Source, Prisms for Tests) and visualizes the exact agent navigation traces with glowing Catmull-Rom tube curves. |
53
+
54
+ ## ⚙️ How It Works (The OpenEnv Standard)
55
+
56
+ 1. **Agent loads unfamiliar environment** → sees repo file tree (NOT contents).
57
+ 2. Agent reads files one at a time (costs strict exploration steps).
58
+ 3. Agent identifies structural vulnerabilities via function-flow analysis.
59
+ 4. Agent writes fixed code.
60
+ 5. Agent verifies functionality through containerized `pytest` execution.
61
+ 6. Environment scores agent completely dynamically across 6 separate research axes.
62
+
63
+ ## 🚀 Quick Start
64
 
65
  ### 1. Run Locally (No Docker)
66
  ```bash
67
  pip install -r requirements.txt
68
+ python app.py # Gradio UI + FastAPI at http://localhost:7860
 
 
 
 
 
 
69
  ```
70
 
71
+ ### 2. Connect Your Custom LLM Agent
72
  ```bash
73
  export HF_TOKEN=hf_xxxxx
74
+ # Configure your script to hit the local /step FASTApi environment
75
+ python inference.py
76
  ```
77
 
78
+ ### 3. Deploy via Docker
79
  ```bash
80
  docker build -t codebase-nav-env .
81
  docker run -p 7860:7860 codebase-nav-env
82
  ```
83
 
84
+ ## 📊 Evaluation API Layers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
 
 
86
  | Endpoint | Method | Description |
87
  |----------|--------|-------------|
88
+ | `/step` | POST | Takes singular OpenEnv navigation action (`read_file`, `write_file`) |
89
+ | `/evaluate` | GET | Baseline evaluation metrics |
90
+ | `/causal-probe` | GET | Builds directed acyclic graphs resolving true root-cause logic mapping |
91
+ | `/confidence` | GET | Returns behavior-time confidence estimation algorithms |
92
+ | `/counterfactual` | POST | Subjects agent to 6 robustness ablation tests to detect hallucination |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
 
94
+ *Stop trusting the vibe. Force the cognition.*
assets/demo.webp ADDED