avichauhan commited on
Commit
df48515
Β·
verified Β·
1 Parent(s): 6b40cb9

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +7 -4
README.md CHANGED
@@ -102,7 +102,7 @@ The environment returns an `APIDebugObservation` with:
102
  | feedback | string | Structured validation feedback from last action |
103
  | message | string | Human-readable status |
104
  | done | bool | Whether the episode has ended |
105
- | reward | float | Reward signal (0.0 to 1.0) |
106
 
107
  ## Reward Design
108
 
@@ -137,10 +137,10 @@ Scores from running inference.py against the live HF Space (3 episodes per task,
137
 
138
  | Task | Episodes | Qwen2.5-72B-Instruct | gpt-4o-mini |
139
  |------|----------|----------------------|-------------|
140
- | easy | 3 | 1.000 | 0.667 |
141
- | medium | 3 | 1.000 | 1.000 |
142
  | hard | 3 | 0.780 | 0.760 |
143
- | **overall** | **9** | **0.927** | **0.809** |
144
 
145
  Hard task uses LLM-as-judge (gpt-4o-mini) for explanation quality scoring, which is stricter than a heuristic baseline. The agent must fix 2-3 simultaneous errors and provide a developer-facing explanation to score high. Larger models perform better on the hard task, showing meaningful difficulty progression.
146
 
@@ -211,6 +211,9 @@ api-debug-env/
211
  β”œβ”€β”€ client.py # APIDebugEnv(EnvClient)
212
  β”œβ”€β”€ validate-submission.sh # Pre-submission validator
213
  β”œβ”€β”€ __init__.py
 
 
 
214
  └── server/
215
  β”œβ”€β”€ __init__.py
216
  β”œβ”€β”€ app.py # FastAPI app via create_app()
 
102
  | feedback | string | Structured validation feedback from last action |
103
  | message | string | Human-readable status |
104
  | done | bool | Whether the episode has ended |
105
+ | reward | float | Reward signal in open interval (0, 1) |
106
 
107
  ## Reward Design
108
 
 
137
 
138
  | Task | Episodes | Qwen2.5-72B-Instruct | gpt-4o-mini |
139
  |------|----------|----------------------|-------------|
140
+ | easy | 3 | 0.999 | 0.667 |
141
+ | medium | 3 | 0.999 | 0.999 |
142
  | hard | 3 | 0.780 | 0.760 |
143
+ | **overall** | **9** | **0.926** | **0.809** |
144
 
145
  Hard task uses LLM-as-judge (gpt-4o-mini) for explanation quality scoring, which is stricter than a heuristic baseline. The agent must fix 2-3 simultaneous errors and provide a developer-facing explanation to score high. Larger models perform better on the hard task, showing meaningful difficulty progression.
146
 
 
211
  β”œβ”€β”€ client.py # APIDebugEnv(EnvClient)
212
  β”œβ”€β”€ validate-submission.sh # Pre-submission validator
213
  β”œβ”€β”€ __init__.py
214
+ β”œβ”€β”€ tests/
215
+ β”‚ β”œβ”€β”€ __init__.py
216
+ β”‚ └── test_environment.py # 79 unit tests
217
  └── server/
218
  β”œβ”€β”€ __init__.py
219
  β”œβ”€β”€ app.py # FastAPI app via create_app()