Spaces:
Sleeping
Sleeping
havinashpatil commited on
Commit Β·
3f9399a
1
Parent(s): a448db8
docs: Rewrite README for hackathon submission
Browse files
README.md
CHANGED
|
@@ -1,26 +1,85 @@
|
|
| 1 |
# CodeArena RL Benchmark
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
## Features
|
| 6 |
|
| 7 |
-
- **Adaptive Curriculum**: The environment supports an `auto` difficulty mode that dynamically scales task complexity
|
| 8 |
-
- **Complex Shaped Rewards**: Rewards are a weighted composite
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
- **Extensive Task Categories**: Includes standard algorithmic tasks, `type_errors`, and `security_bugs`.
|
| 15 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## Architecture
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
- `frontend/`: React + Vite frontend for live monitoring and manual intervention.
|
| 21 |
- `tasks/`: Task definitions stored in OpenEnv-compatible JSON schema.
|
| 22 |
- `inference.py`: CLI runner for evaluating RL agents, supporting both OpenAI-compatible APIs and native HuggingFace `transformers` pipelines.
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
## Setup
|
| 25 |
|
| 26 |
1. **Install Dependencies:**
|
|
@@ -30,7 +89,7 @@ CodeArena is an OpenEnv-compatible reinforcement learning benchmark for autonomo
|
|
| 30 |
```
|
| 31 |
|
| 32 |
2. **Generate New Tasks:**
|
| 33 |
-
To populate the extended task categories (`type_errors` and `security_bugs`), run
|
| 34 |
```bash
|
| 35 |
python create_tasks.py
|
| 36 |
```
|
|
@@ -78,3 +137,9 @@ This generates `reward_curve.png` and `reward_by_task.png` in the `results/` dir
|
|
| 78 |
## OpenEnv Compatibility
|
| 79 |
|
| 80 |
This benchmark strictly adheres to the OpenEnv specification. See `openenv.yaml` for full configuration details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# CodeArena RL Benchmark
|
| 2 |
|
| 3 |
+
GitHub Copilot, Cursor, Devin β every major coding AI is
|
| 4 |
+
benchmarked on generation. Can it write a function? Can it
|
| 5 |
+
complete a snippet? Nobody benchmarks what happens when the
|
| 6 |
+
code breaks and the agent has to reason about failure, iterate
|
| 7 |
+
on fixes, and recover from mistakes.
|
| 8 |
+
|
| 9 |
+
CodeArena measures exactly that. It is the first standardized,
|
| 10 |
+
open-source reinforcement learning environment built specifically
|
| 11 |
+
for iterative code repair β graded not just on test pass rates
|
| 12 |
+
but on whether the fix is correct, secure, and written to a
|
| 13 |
+
professional standard.
|
| 14 |
|
| 15 |
## Features
|
| 16 |
|
| 17 |
+
- **Adaptive Curriculum**: The environment supports an `auto` difficulty mode that dynamically scales task complexity based on the agent's recent rolling average rewards.
|
| 18 |
+
- **Complex Shaped Rewards**: Rewards are a weighted composite:
|
| 19 |
+
|
| 20 |
+
| Component | Weight | What it measures |
|
| 21 |
+
|---|---|---|
|
| 22 |
+
| compile_score | 20% | Code compiles without error |
|
| 23 |
+
| test_pass_ratio | 40% | Fraction of unit tests passed |
|
| 24 |
+
| efficiency_score | 10% | Speed vs optimal runtime |
|
| 25 |
+
| llm_judge_score | 30% | Correctness + Security + Code Quality |
|
| 26 |
+
| step_penalty | -0.02/step | Rewards faster fixes |
|
| 27 |
+
| novelty_penalty | -0.10 | Penalises repeating identical fixes |
|
| 28 |
+
|
| 29 |
+
All rewards clamped to [0.001, 0.999]
|
| 30 |
+
|
| 31 |
- **Extensive Task Categories**: Includes standard algorithmic tasks, `type_errors`, and `security_bugs`.
|
| 32 |
+
- **Real-time Reward Visualization**: Watch compile score, test ratio, and LLM judge scores update live as the agent works using the React Frontend.
|
| 33 |
+
|
| 34 |
+
## Adaptive Curriculum
|
| 35 |
+
|
| 36 |
+
CodeArena tracks the agent's rolling average reward and
|
| 37 |
+
escalates or de-escalates difficulty automatically.
|
| 38 |
+
An agent cannot plateau by memorising easy tasks.
|
| 39 |
+
|
| 40 |
+
| Condition | Transition |
|
| 41 |
+
|---|---|
|
| 42 |
+
| avg reward > 0.80 on easy | β medium |
|
| 43 |
+
| avg reward > 0.75 on medium | β hard |
|
| 44 |
+
| avg reward < 0.35 on hard | β medium |
|
| 45 |
+
| avg reward < 0.35 on medium | β easy |
|
| 46 |
+
|
| 47 |
+
Minimum 3 episodes at each level before any transition.
|
| 48 |
+
Enable with: POST /reset with `{"task_id": "auto"}`
|
| 49 |
+
Monitor live with: GET /curriculum
|
| 50 |
|
| 51 |
## Architecture
|
| 52 |
|
| 53 |
+
**Data Flow:** Agent β `/reset` β buggy_code β `/step` β subprocess β LLM judge β reward β Agent
|
| 54 |
+
|
| 55 |
+
- `server/`: FastAPI backend acting as the OpenEnv entrypoint.
|
| 56 |
- `frontend/`: React + Vite frontend for live monitoring and manual intervention.
|
| 57 |
- `tasks/`: Task definitions stored in OpenEnv-compatible JSON schema.
|
| 58 |
- `inference.py`: CLI runner for evaluating RL agents, supporting both OpenAI-compatible APIs and native HuggingFace `transformers` pipelines.
|
| 59 |
|
| 60 |
+
## Results
|
| 61 |
+
|
| 62 |
+

|
| 63 |
+
*Episode reward over training steps. Rolling 10-step average shown.*
|
| 64 |
+
|
| 65 |
+

|
| 66 |
+
*Average reward per task category.*
|
| 67 |
+
|
| 68 |
+
| Model | Easy | Medium | Hard | Avg |
|
| 69 |
+
|---|---|---|---|---|
|
| 70 |
+
| GPT-4o | - | - | - | - |
|
| 71 |
+
| Qwen-72B | - | - | - | - |
|
| 72 |
+
| Llama-3-8B | - | - | - | - |
|
| 73 |
+
|
| 74 |
+
## Why It Matters
|
| 75 |
+
|
| 76 |
+
Every production coding AI needs to debug, not just write.
|
| 77 |
+
There is no other standardized RL environment that trains
|
| 78 |
+
and benchmarks iterative repair. The hybrid grader β
|
| 79 |
+
deterministic test execution plus LLM quality judgment β
|
| 80 |
+
means agents cannot game the reward by memorising solutions
|
| 81 |
+
or producing syntactically correct but semantically wrong fixes.
|
| 82 |
+
|
| 83 |
## Setup
|
| 84 |
|
| 85 |
1. **Install Dependencies:**
|
|
|
|
| 89 |
```
|
| 90 |
|
| 91 |
2. **Generate New Tasks:**
|
| 92 |
+
To populate the extended task categories (`type_errors` and `security_bugs`), run the task generator. This must be run first or the new task categories won't exist.
|
| 93 |
```bash
|
| 94 |
python create_tasks.py
|
| 95 |
```
|
|
|
|
| 137 |
## OpenEnv Compatibility
|
| 138 |
|
| 139 |
This benchmark strictly adheres to the OpenEnv specification. See `openenv.yaml` for full configuration details.
|
| 140 |
+
|
| 141 |
+
## Links
|
| 142 |
+
- HuggingFace Space: [URL]
|
| 143 |
+
- Colab Training Notebook: [URL]
|
| 144 |
+
- HuggingFace Blog Post: [URL]
|
| 145 |
+
- Demo Video: [URL]
|