havinashpatil commited on
Commit
3f9399a
Β·
1 Parent(s): a448db8

docs: Rewrite README for hackathon submission

Browse files
Files changed (1) hide show
  1. README.md +76 -11
README.md CHANGED
@@ -1,26 +1,85 @@
1
  # CodeArena RL Benchmark
2
 
3
- CodeArena is an OpenEnv-compatible reinforcement learning benchmark for autonomous code repair. In this environment, an agent receives buggy Python code, proposes fixes, and is iteratively evaluated based on test execution feedback and LLM-based quality metrics.
 
 
 
 
 
 
 
 
 
 
4
 
5
  ## Features
6
 
7
- - **Adaptive Curriculum**: The environment supports an `auto` difficulty mode that dynamically scales task complexity (`easy`, `medium`, `hard`) based on the agent's recent rolling average rewards.
8
- - **Complex Shaped Rewards**: Rewards are a weighted composite of:
9
- - `compile_score` (0.2)
10
- - `test_pass_ratio` (0.4)
11
- - `efficiency_score` (0.1)
12
- - `llm_judge_score` (0.3): Correctness, Security, and Code Quality evaluated via LLM-as-a-judge.
13
- - **Novelty & Step Penalties**: The agent receives penalties for repeating identical failed fixes or taking too many steps.
 
 
 
 
 
 
 
14
  - **Extensive Task Categories**: Includes standard algorithmic tasks, `type_errors`, and `security_bugs`.
15
- - **Live React Frontend**: Connect a local LLM (like Ollama) or HuggingFace models to interactively visualize step-by-step progress, execution outputs, and live reward components.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ## Architecture
18
 
19
- - `server/`: FastAPI backend acting as the OpenEnv entrypoint. Handles state, execution sandbox (`executor.py`), and reward grading (`grader.py`).
 
 
20
  - `frontend/`: React + Vite frontend for live monitoring and manual intervention.
21
  - `tasks/`: Task definitions stored in OpenEnv-compatible JSON schema.
22
  - `inference.py`: CLI runner for evaluating RL agents, supporting both OpenAI-compatible APIs and native HuggingFace `transformers` pipelines.
23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
  ## Setup
25
 
26
  1. **Install Dependencies:**
@@ -30,7 +89,7 @@ CodeArena is an OpenEnv-compatible reinforcement learning benchmark for autonomo
30
  ```
31
 
32
  2. **Generate New Tasks:**
33
- To populate the extended task categories (`type_errors` and `security_bugs`), run:
34
  ```bash
35
  python create_tasks.py
36
  ```
@@ -78,3 +137,9 @@ This generates `reward_curve.png` and `reward_by_task.png` in the `results/` dir
78
  ## OpenEnv Compatibility
79
 
80
  This benchmark strictly adheres to the OpenEnv specification. See `openenv.yaml` for full configuration details.
 
 
 
 
 
 
 
1
  # CodeArena RL Benchmark
2
 
3
+ GitHub Copilot, Cursor, Devin β€” every major coding AI is
4
+ benchmarked on generation. Can it write a function? Can it
5
+ complete a snippet? Nobody benchmarks what happens when the
6
+ code breaks and the agent has to reason about failure, iterate
7
+ on fixes, and recover from mistakes.
8
+
9
+ CodeArena measures exactly that. It is the first standardized,
10
+ open-source reinforcement learning environment built specifically
11
+ for iterative code repair β€” graded not just on test pass rates
12
+ but on whether the fix is correct, secure, and written to a
13
+ professional standard.
14
 
15
  ## Features
16
 
17
+ - **Adaptive Curriculum**: The environment supports an `auto` difficulty mode that dynamically scales task complexity based on the agent's recent rolling average rewards.
18
+ - **Complex Shaped Rewards**: Rewards are a weighted composite:
19
+
20
+ | Component | Weight | What it measures |
21
+ |---|---|---|
22
+ | compile_score | 20% | Code compiles without error |
23
+ | test_pass_ratio | 40% | Fraction of unit tests passed |
24
+ | efficiency_score | 10% | Speed vs optimal runtime |
25
+ | llm_judge_score | 30% | Correctness + Security + Code Quality |
26
+ | step_penalty | -0.02/step | Rewards faster fixes |
27
+ | novelty_penalty | -0.10 | Penalises repeating identical fixes |
28
+
29
+ All rewards clamped to [0.001, 0.999]
30
+
31
  - **Extensive Task Categories**: Includes standard algorithmic tasks, `type_errors`, and `security_bugs`.
32
+ - **Real-time Reward Visualization**: Watch compile score, test ratio, and LLM judge scores update live as the agent works using the React Frontend.
33
+
34
+ ## Adaptive Curriculum
35
+
36
+ CodeArena tracks the agent's rolling average reward and
37
+ escalates or de-escalates difficulty automatically.
38
+ An agent cannot plateau by memorising easy tasks.
39
+
40
+ | Condition | Transition |
41
+ |---|---|
42
+ | avg reward > 0.80 on easy | β†’ medium |
43
+ | avg reward > 0.75 on medium | β†’ hard |
44
+ | avg reward < 0.35 on hard | β†’ medium |
45
+ | avg reward < 0.35 on medium | β†’ easy |
46
+
47
+ Minimum 3 episodes at each level before any transition.
48
+ Enable with: POST /reset with `{"task_id": "auto"}`
49
+ Monitor live with: GET /curriculum
50
 
51
  ## Architecture
52
 
53
+ **Data Flow:** Agent β†’ `/reset` β†’ buggy_code β†’ `/step` β†’ subprocess β†’ LLM judge β†’ reward β†’ Agent
54
+
55
+ - `server/`: FastAPI backend acting as the OpenEnv entrypoint.
56
  - `frontend/`: React + Vite frontend for live monitoring and manual intervention.
57
  - `tasks/`: Task definitions stored in OpenEnv-compatible JSON schema.
58
  - `inference.py`: CLI runner for evaluating RL agents, supporting both OpenAI-compatible APIs and native HuggingFace `transformers` pipelines.
59
 
60
+ ## Results
61
+
62
+ ![Reward Curve](results/reward_curve.png)
63
+ *Episode reward over training steps. Rolling 10-step average shown.*
64
+
65
+ ![Reward by Task](results/reward_by_task.png)
66
+ *Average reward per task category.*
67
+
68
+ | Model | Easy | Medium | Hard | Avg |
69
+ |---|---|---|---|---|
70
+ | GPT-4o | - | - | - | - |
71
+ | Qwen-72B | - | - | - | - |
72
+ | Llama-3-8B | - | - | - | - |
73
+
74
+ ## Why It Matters
75
+
76
+ Every production coding AI needs to debug, not just write.
77
+ There is no other standardized RL environment that trains
78
+ and benchmarks iterative repair. The hybrid grader β€”
79
+ deterministic test execution plus LLM quality judgment β€”
80
+ means agents cannot game the reward by memorising solutions
81
+ or producing syntactically correct but semantically wrong fixes.
82
+
83
  ## Setup
84
 
85
  1. **Install Dependencies:**
 
89
  ```
90
 
91
  2. **Generate New Tasks:**
92
+ To populate the extended task categories (`type_errors` and `security_bugs`), run the task generator. This must be run first or the new task categories won't exist.
93
  ```bash
94
  python create_tasks.py
95
  ```
 
137
  ## OpenEnv Compatibility
138
 
139
  This benchmark strictly adheres to the OpenEnv specification. See `openenv.yaml` for full configuration details.
140
+
141
+ ## Links
142
+ - HuggingFace Space: [URL]
143
+ - Colab Training Notebook: [URL]
144
+ - HuggingFace Blog Post: [URL]
145
+ - Demo Video: [URL]